Abstract

This document provides a high-level overview of the Open Science Archive (OSA) architecture. It introduces the two node types (Archive Nodes and Index Nodes), explains how data flows through the system, and describes how nodes federate. For goals and scope, see OEP-0003 first.

Motivation

Scientific Context

Researchers routinely download large datasets from public archives only to discover quality issues—contamination, batch effects, missing annotations, poor sequencing quality. It's often difficult to know, before downloading, whether a dataset meets your needs.

Meanwhile, the rise of AI/ML in science has created urgent demand for high-quality training data. Models are only as good as their inputs, but finding validated, well-structured scientific data remains a manual, time-consuming process.

Technical Context

Key gaps in existing infrastructure:

Infrastructure is a barrier to AI-ready data. Training AI models on scientific data requires rigorous standardisation and validation, but building the infrastructure to achieve this takes months of engineering effort. Each subfield rebuilds these systems from scratch, duplicating work and creating a barrier to entry for less informatically-developed fields.
Quality metrics are siloed. When someone computes a quality metric for a dataset (mapping rate, contamination level, etc.), that information stays local. There's no standard way to share or discover quality assessments across the field. Researchers repeat the same QC work, or worse, skip it entirely.
Fragmented APIs prevent tooling. Every archive has its own bespoke API, authentication model, and data format. This fragmentation makes it impossible to build rich, universal tools. A dashboard that works with one archive must be rewritten for another.
Centralisation creates fragility. Systems that rely on a single authority become bottlenecks and single points of failure. But fully decentralised systems struggle with discovery and coordination.

OSA addresses these gaps by defining two complementary node types and a federation protocol that connects them.

Specification

Design Principles

These technical principles guide the architecture:

Domain agnosticism. The protocol doesn't prescribe schemas, semantics, or validation logic. It provides mechanisms for communities to define their own. Validators, vocabularies, and conventions are all pluggable.

Progressive schema evolution. Communities can evolve their schemas over time without breaking existing data. Start simple, add structure as understanding matures. This also enables gradual semantic alignment between fields.

Ease of deployment. Professional-grade infrastructure should be deployable in hours, not months. This lowers the barrier for fields without strong informatics traditions.

Transparency over trust. Provenance for all claims. Users see who computed what, when, with which validator version. Verify rather than trust.

Immutability. Published records don't change. New versions, not edits. Citable, stable references.

Federation without permission. Anyone can run a node. No gatekeepers, no central registry required.

Node Types

Archive Node

An Archive Node holds primary data. It is the authoritative source for records it publishes.

Responsibilities:

Accept depositions (draft submissions)
Run validators to compute quality attributes
Support human curation (approve/reject)
Publish immutable, versioned records
Serve records via API

Flow: Deposition → Validation → Curation → Record

Each published record has an SRN (Structured Resource Name) that identifies this node as its origin. See OEP-0005 for details.

Index Node

An Index Node computes derived attributes about data held elsewhere. It does not store raw data.

Responsibilities:

Point at data sources (Archive Nodes, external archives like GEO)
Run validators to compute vocabulary attributes
Store attributed values with provenance
Serve a queryable, federated API
Participate in gossip with other Index Nodes

Index Nodes enable questions like "which GEO datasets have >70% mapping rate?" without downloading terabytes of data. See OEP-0010 for details.

Data Flow

Depositor                         Researcher
    │                                 ▲
    ▼                                 │
┌────────────────────────────────────────────────┐
│               Archive Node                     │
│  Deposition → Validation → Curation → Record   │
└────────────────────────────────────────────────┘
                    │
                    │ (records)
                    ▼
┌────────────────────────────────────────────────┐
│               Index Node(s)                    │
│  Run validators → Store attributes → Query     │
└────────────────────────────────────────────────┘
                    │
                    │ (federation)
                    ▼
             ┌──────────────┐
             │ Other Index  │
             │   Nodes      │
             └──────────────┘

Federation

Index Nodes form a federated network:

Discovery: DNS-based resolution. A node's domain maps to /.well-known/osa-node.json, which declares its API endpoint and capabilities. See OEP-0006.

Gossip: Nodes exchange:

Computed attribute values (enabling specialisation—one node runs Salmon, another runs FastQC)
Vocabulary catalogs (for discovery and UI suggestions)
Peer lists (for network growth)

No central registry: Nodes bootstrap via manual peer configuration or community seed lists. Gossip propagates discovery over time.

Trust model: Every attributed value carries provenance—which node computed it, using which validator, when. Users and nodes can choose their trust policy.

Key Concepts

Below are key architectural concepts in the protocol, and the planned OEP where they will be described.

Concept	Description	Planned OEP
SRN	Global identifier for resources	OEP-0005
Node Identity	How nodes identify and discover each other	OEP-0006
Vocabulary	Named, typed attributes with semantics	OEP-0007
Validator	OCI container that computes attributes	OEP-0008
Archive Node	Holds primary data	OEP-0009
Index Node	Computes derived attributes, federates	OEP-0010
Source Adapter	Connects Index Nodes to external archives	OEP-0011
Traits & Conventions	Saved queries, submission profiles	OEP-0012

What OSA is NOT

Not a single database — it's a protocol for federated nodes
Not domain-specific — validators and vocabularies are pluggable
Not a data warehouse — Index Nodes don't store raw data
Not centralised — no master node, no registry, no single point of failure

Rationale

Two node types. Separating "holds data" (Archive) from "computes quality" (Index) enables:

Specialisation: institutions run archives, quality-focused groups run indexes
Efficiency: Index Nodes can validate external data (GEO, SRA) without duplicating storage
Flexibility: a node can be both, or just one

DNS-based federation. Alternatives considered:

Central registry: requires OSA-run infrastructure, single point of failure
DHT/gossip-only: complex, eventual consistency issues, hard to bootstrap
Query-time peer hints: requires users to know peer URLs

DNS leverages existing infrastructure, is decentralised, and follows familiar patterns (WebFinger, ActivityPub, Bluesky).

Gossip for discovery. After DNS bootstrap, gossip allows the network to grow organically. Nodes learn about peers, attributes, and vocabularies through interaction rather than configuration.

Backwards Compatibility

N/A — this is an informational document describing the architecture.

Security & Privacy

This document describes architecture at a high level. Security considerations for specific components are addressed in their respective OEPs:

Node identity and authentication: OEP-0006
Validator sandboxing: OEP-0008
Trust model for federated data: OEP-0010

Open Issues

How do we handle the transition from single-node deployments to federated networks?
What's the minimum viable federation (how many nodes before gossip becomes useful)?