Goals and Scope

Type
informational
Labels
protocol
Created
December 12, 2025
Author
Rory Byrne <rory@rory.bio>

This proposal is open for feedback.

Join the discussion on GitHub →

Abstract

This document defines the goals, scope, and non-goals of the Open Science Archive (OSA) protocol. It answers what problems OSA solves and what it explicitly does not attempt to solve. Read this before the architecture overview (OEP-0004).

Motivation

Before diving into architecture and technical details, readers need to understand what OSA is for. This document provides that context.

Specification

Problems

Building serious data infrastructure is prohibitively hard

The Protein Data Bank took many years and significant funding to develop its validation pipelines, curation workflows, and quality standards. A new field wanting similar rigor faces years of development, substantial engineering investment, and the risk of building something that doesn't get adopted. Most fields can't justify this cost, so they settle for minimal archives or none at all.

Archives don't interoperate

Despite this, many archives exist. However, each one is an island with its own API, conventions, and tooling ecosystems. This fragments the landscape: a tool built for one archive must be rewritten for another. Researchers write bespoke integrations for each data source. Smaller fields get left behind entirely because they lack the user base to justify custom tooling.

Archives duplicate generic infrastructure

Every new archive re-invents submission portals, validation runners, metadata schemas, search APIs, and access control. The domain-specific parts get less attention because generic plumbing consumes the budget.

Goals

Deploy an archive in days, not years

A research group can spin up a production-grade archive with validation, submission workflows, and APIs without a dedicated engineering team.

Discover data by quality

Researchers can search for data using quality criteria, not just metadata keywords.

Plug in domain-specific logic

Communities define their own validators, curation tools, and data conversions. The protocol provides the machinery; domains provide the semantics.

Attribute every quality claim

Every assertion carries provenance: who computed it, when, with what software. Users verify rather than trust.

Publish immutable, citable records

Published data gets a stable identifier. Updates create new versions, not edits. Citations remain valid.

Non-Goals

OSA is not a storage provider

The protocol defines how archives behave, not where bytes live. Storage is the operator's choice (local disk, S3, institutional storage).

OSA is not a compute platform

Validators run during submission, but OSA is not a general-purpose compute system. It does not manage jobs, queues, or cluster resources.

OSA is not a single database

There is no central OSA database. The protocol enables federation between independent nodes.

OSA does not define domain semantics

The protocol does not say what "quality" means for any particular field. Communities define their own vocabularies and validators.

OSA does not enforce quality thresholds

The protocol computes and exposes quality attributes. It does not decide what is "good enough". That judgment belongs to users and communities.

OSA does not replace existing archives

OSA is designed to complement existing infrastructure. Index Nodes can compute attributes about data in external archives (GEO, SRA, PDB) without requiring those archives to change.

Rationale

Separating goals from architecture makes it easier to evaluate whether the architecture actually serves the goals. It also helps readers who want to understand the purpose without reading technical details.

The explicit non-goals prevent scope creep and set expectations. OSA is infrastructure for a specific set of problems, not a universal solution.

Backwards Compatibility

N/A. This is an informational document.

Security & Privacy

This document defines goals and scope. Security and privacy considerations are addressed in the relevant technical OEPs.

Open Issues

  • Should OSA define a minimal "core" vocabulary that all nodes understand, or is everything domain-specific?
  • How do we balance ease of deployment with operational security requirements?