Architecture Overview
- Type
- technical
- Labels
- protocolarchitecture
- Created
- December 7, 2025
Abstract
This document provides a high-level overview of the Open Science Archive (OSA) architecture. It introduces the two node types (Archive Nodes and Index Nodes), explains how data flows through the system, and describes how nodes federate. For goals and scope, see OEP-0003 first.
Motivation
Scientific Context
Researchers routinely download large datasets from public archives only to discover quality issues—contamination, batch effects, missing annotations, poor sequencing quality. It's often difficult to know, before downloading, whether a dataset meets your needs.
Meanwhile, the rise of AI/ML in science has created urgent demand for high-quality training data. Models are only as good as their inputs, but finding validated, well-structured scientific data remains a manual, time-consuming process.
Technical Context
Key gaps in existing infrastructure:
Infrastructure is a barrier to AI-ready data. Training AI models on scientific data requires rigorous standardisation and validation, but building the infrastructure to achieve this takes months of engineering effort. Each subfield rebuilds these systems from scratch, duplicating work and creating a barrier to entry for less informatically-developed fields.
Quality metrics are siloed. When someone computes a quality metric for a dataset (mapping rate, contamination level, etc.), that information stays local. There's no standard way to share or discover quality assessments across the field. Researchers repeat the same QC work, or worse, skip it entirely.
Fragmented APIs prevent tooling. Every archive has its own bespoke API, authentication model, and data format. This fragmentation makes it impossible to build rich, universal tools. A dashboard that works with one archive must be rewritten for another.
Centralisation creates fragility. Systems that rely on a single authority become bottlenecks and single points of failure. But fully decentralised systems struggle with discovery and coordination.
OSA addresses these gaps by defining two complementary node types and a federation protocol that connects them.
Specification
Design Principles
These technical principles guide the architecture:
Domain agnosticism. The protocol doesn't prescribe schemas, semantics, or validation logic. It provides mechanisms for communities to define their own. Validators, vocabularies, and conventions are all pluggable.
Progressive schema evolution. Communities can evolve their schemas over time without breaking existing data. Start simple, add structure as understanding matures. This also enables gradual semantic alignment between fields.
Ease of deployment. Professional-grade infrastructure should be deployable in hours, not months. This lowers the barrier for fields without strong informatics traditions.
Transparency over trust. Provenance for all claims. Users see who computed what, when, with which validator version. Verify rather than trust.
Immutability. Published records don't change. New versions, not edits. Citable, stable references.
Federation without permission. Anyone can run a node. No gatekeepers, no central registry required.
Node Types
Archive Node
An Archive Node holds primary data. It is the authoritative source for records it publishes.
Responsibilities:
- Accept depositions (draft submissions)
- Run validators to compute quality attributes
- Support human curation (approve/reject)
- Publish immutable, versioned records
- Serve records via API
Flow: Deposition → Validation → Curation → Record
Each published record has an SRN (Structured Resource Name) that identifies this node as its origin. See OEP-0005 for details.
Index Node
An Index Node computes derived attributes about data held elsewhere. It does not store raw data.
Responsibilities:
- Point at data sources (Archive Nodes, external archives like GEO)
- Run validators to compute vocabulary attributes
- Store attributed values with provenance
- Serve a queryable, federated API
- Participate in gossip with other Index Nodes
Index Nodes enable questions like "which GEO datasets have >70% mapping rate?" without downloading terabytes of data. See OEP-0010 for details.
Data Flow
Depositor Researcher
│ ▲
▼ │
┌────────────────────────────────────────────────┐
│ Archive Node │
│ Deposition → Validation → Curation → Record │
└────────────────────────────────────────────────┘
│
│ (records)
▼
┌────────────────────────────────────────────────┐
│ Index Node(s) │
│ Run validators → Store attributes → Query │
└────────────────────────────────────────────────┘
│
│ (federation)
▼
┌──────────────┐
│ Other Index │
│ Nodes │
└──────────────┘
Federation
Index Nodes form a federated network:
Discovery: DNS-based resolution. A node's domain maps to /.well-known/osa-node.json, which declares its API endpoint and capabilities. See OEP-0006.
Gossip: Nodes exchange:
- Computed attribute values (enabling specialisation—one node runs Salmon, another runs FastQC)
- Vocabulary catalogs (for discovery and UI suggestions)
- Peer lists (for network growth)
No central registry: Nodes bootstrap via manual peer configuration or community seed lists. Gossip propagates discovery over time.
Trust model: Every attributed value carries provenance—which node computed it, using which validator, when. Users and nodes can choose their trust policy.
Key Concepts
Below are key architectural concepts in the protocol, and the planned OEP where they will be described.
| Concept | Description | Planned OEP |
|---|---|---|
| SRN | Global identifier for resources | OEP-0005 |
| Node Identity | How nodes identify and discover each other | OEP-0006 |
| Vocabulary | Named, typed attributes with semantics | OEP-0007 |
| Validator | OCI container that computes attributes | OEP-0008 |
| Archive Node | Holds primary data | OEP-0009 |
| Index Node | Computes derived attributes, federates | OEP-0010 |
| Source Adapter | Connects Index Nodes to external archives | OEP-0011 |
| Traits & Conventions | Saved queries, submission profiles | OEP-0012 |
What OSA is NOT
- Not a single database — it's a protocol for federated nodes
- Not domain-specific — validators and vocabularies are pluggable
- Not a data warehouse — Index Nodes don't store raw data
- Not centralised — no master node, no registry, no single point of failure
Rationale
Two node types. Separating "holds data" (Archive) from "computes quality" (Index) enables:
- Specialisation: institutions run archives, quality-focused groups run indexes
- Efficiency: Index Nodes can validate external data (GEO, SRA) without duplicating storage
- Flexibility: a node can be both, or just one
DNS-based federation. Alternatives considered:
- Central registry: requires OSA-run infrastructure, single point of failure
- DHT/gossip-only: complex, eventual consistency issues, hard to bootstrap
- Query-time peer hints: requires users to know peer URLs
DNS leverages existing infrastructure, is decentralised, and follows familiar patterns (WebFinger, ActivityPub, Bluesky).
Gossip for discovery. After DNS bootstrap, gossip allows the network to grow organically. Nodes learn about peers, attributes, and vocabularies through interaction rather than configuration.
Backwards Compatibility
N/A — this is an informational document describing the architecture.
Security & Privacy
This document describes architecture at a high level. Security considerations for specific components are addressed in their respective OEPs:
- Node identity and authentication: OEP-0006
- Validator sandboxing: OEP-0008
- Trust model for federated data: OEP-0010
Open Issues
- How do we handle the transition from single-node deployments to federated networks?
- What's the minimum viable federation (how many nodes before gossip becomes useful)?