Goals and Scope
- Type
- informational
- Labels
- protocol
- Created
- December 12, 2025
This proposal is open for feedback.
Join the discussion on GitHub →Abstract
This document defines the goals, scope, and non-goals of the Open Science Archive (OSA) protocol. It answers what problems OSA solves and what it explicitly does not attempt to solve. Read this before the architecture overview (OEP-0004).
Motivation
Before diving into architecture and technical details, readers need to understand what OSA is for. This document provides that context.
Specification
Problems
Building serious data infrastructure is prohibitively hard
The Protein Data Bank took many years and significant funding to develop its validation pipelines, curation workflows, and quality standards. A new field wanting similar rigor faces years of development, substantial engineering investment, and the risk of building something that doesn't get adopted. Most fields can't justify this cost, so they settle for minimal archives or none at all.
Archives don't interoperate
Despite this, many archives exist. However, each one is an island with its own API, conventions, and tooling ecosystems. This fragments the landscape: a tool built for one archive must be rewritten for another. Researchers write bespoke integrations for each data source. Smaller fields get left behind entirely because they lack the user base to justify custom tooling.
Archives duplicate generic infrastructure
Every new archive re-invents submission portals, validation runners, metadata schemas, search APIs, and access control. The domain-specific parts get less attention because generic plumbing consumes the budget.
Goals
Deploy an archive in days, not years
A research group can spin up a production-grade archive with validation, submission workflows, and APIs without a dedicated engineering team.
Discover data by quality
Researchers can search for data using quality criteria, not just metadata keywords.
Plug in domain-specific logic
Communities define their own validators, curation tools, and data conversions. The protocol provides the machinery; domains provide the semantics.
Attribute every quality claim
Every assertion carries provenance: who computed it, when, with what software. Users verify rather than trust.
Publish immutable, citable records
Published data gets a stable identifier. Updates create new versions, not edits. Citations remain valid.
Non-Goals
OSA is not a storage provider
The protocol defines how archives behave, not where bytes live. Storage is the operator's choice (local disk, S3, institutional storage).
OSA is not a compute platform
Validators run during submission, but OSA is not a general-purpose compute system. It does not manage jobs, queues, or cluster resources.
OSA is not a single database
There is no central OSA database. The protocol enables federation between independent nodes.
OSA does not define domain semantics
The protocol does not say what "quality" means for any particular field. Communities define their own vocabularies and validators.
OSA does not enforce quality thresholds
The protocol computes and exposes quality attributes. It does not decide what is "good enough". That judgment belongs to users and communities.
OSA does not replace existing archives
OSA is designed to complement existing infrastructure. Index Nodes can compute attributes about data in external archives (GEO, SRA, PDB) without requiring those archives to change.
Rationale
Separating goals from architecture makes it easier to evaluate whether the architecture actually serves the goals. It also helps readers who want to understand the purpose without reading technical details.
The explicit non-goals prevent scope creep and set expectations. OSA is infrastructure for a specific set of problems, not a universal solution.
Backwards Compatibility
N/A. This is an informational document.
Security & Privacy
This document defines goals and scope. Security and privacy considerations are addressed in the relevant technical OEPs.
Open Issues
- Should OSA define a minimal "core" vocabulary that all nodes understand, or is everything domain-specific?
- How do we balance ease of deployment with operational security requirements?