Resource Identifiers

Type
technical
Labels
protocolidentifiers
Created
December 7, 2025
Author
Rory Byrne <rory@rory.bio>

This proposal is open for feedback.

Join the discussion on GitHub →

Abstract

This OEP explores identifier schemes for resources in the Open Science Archive protocol. Rather than prescribing a specific format, it establishes the requirements that any identifier scheme must satisfy, surveys existing approaches, and proposes one candidate (Structured Resource Names) for community feedback.

Motivation

Every resource in OSA—Records, Depositions, Vocabularies, Schemas, Validators—needs an identifier. The choice of identifier scheme has far-reaching consequences for the protocol's usability, longevity, and interoperability.

Scientific data archives present unique challenges:

  • Longevity: Identifiers may be cited in papers for decades
  • Federation: Multiple independent nodes must avoid collisions
  • Machine use: Software needs to parse, route, and validate identifiers
  • Human use: Developers and researchers need to debug and discuss identifiers

Getting this wrong is costly. Changing identifier schemes after deployment breaks existing references.

Requirements

Any identifier scheme for OSA should satisfy the following properties:

Must Have

Globally unique: Two resources must never share an identifier, even across independent nodes operated by different organizations.

Resolvable: Given an identifier, there must be a defined mechanism to retrieve the resource or its metadata.

Stable: Once assigned, an identifier must continue to refer to the same resource. Identifiers should not be reassigned or recycled.

Should Have

Human-readable: Developers should be able to understand what an identifier refers to without dereferencing it. At minimum, identifiers should be pronounceable and not excessively long.

Type-aware: The identifier should indicate what kind of resource it refers to (Record, Vocabulary, etc.), enabling validation and routing without network calls.

Version-aware: For versioned resources, the identifier should support pinning to a specific version.

Decentralized minting: Nodes should be able to create identifiers without coordinating with a central authority.

Nice to Have

Persistent across migrations: If an organization changes its domain or infrastructure, existing identifiers should remain valid.

Content-addressable: Identifiers could be derived from content hashes, enabling integrity verification and deduplication.

Compatible with existing standards: Alignment with URN, DID, DOI, or other established schemes reduces implementation burden and improves interoperability.

Existing Approaches

URLs

https://archive.example.org/records/abc123

Pros: Universal, familiar, directly resolvable, existing tooling.

Cons: Conflates identity with location. When domains change, URLs break. No built-in versioning or typing.

Used by: Most web APIs, many data repositories.

DOIs (Digital Object Identifiers)

doi:10.1234/abc.5678

Pros: Designed for persistence, widely adopted in academia, resolver infrastructure exists (doi.org), citable in papers.

Cons: Opaque (no type or origin information), requires registration with a DOI agency (cost, bureaucracy), resolution depends on Handle System (centralized).

Used by: Academic publishing, Zenodo, Figshare, DataCite.

URNs (Uniform Resource Names)

urn:isbn:978-3-16-148410-0
urn:ietf:rfc:3986

Pros: W3C/IETF standard, separates naming from resolution, extensible namespace system.

Cons: No universal resolution mechanism (each namespace defines its own), requires IANA registration for formal namespaces.

Used by: ISBN, IETF RFCs, various domain-specific schemes.

DIDs (Decentralized Identifiers)

did:web:example.org
did:plc:abc123xyz

Pros: W3C standard, designed for decentralization, supports cryptographic verification, multiple "methods" for different tradeoffs.

Cons: Designed for entities (people, organizations) not resources, verbose, emerging ecosystem.

Used by: AT Protocol (Bluesky), identity wallets, Verifiable Credentials.

ARNs (Amazon Resource Names)

arn:aws:s3:us-east-1:123456789:bucket/object

Pros: Proven at scale, encodes region/account/service/resource hierarchy, enables policy-based access control.

Cons: AWS-specific, complex syntax, assumes single operator (Amazon).

Used by: All AWS services.

UUIDs

550e8400-e29b-41d4-a716-446655440000

Pros: Trivial to generate, guaranteed unique (v4), no coordination required.

Cons: Opaque, no context about resource type or origin, not human-friendly, not directly resolvable.

Used by: Databases, internal systems, anywhere uniqueness matters more than readability.

Content Identifiers (CIDs)

bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oca...

Pros: Derived from content hash, self-verifying, enables deduplication, immutable by design.

Cons: Long, not human-readable, requires content to generate identifier, any content change = new identifier.

Used by: IPFS, Filecoin, content-addressed storage systems.

Analysis

Scheme Unique Resolvable Stable Readable Typed Versioned Decentralized
URLs
DOIs
URNs
DIDs
ARNs
UUIDs
CIDs

✓ = yes, ✗ = no, ◐ = partial/depends

No existing scheme fully satisfies our requirements. This suggests either:

  1. Extending an existing scheme (likely URN or DID)
  2. Defining a new scheme purpose-built for OSA

Candidate: Structured Resource Names (SRNs)

One option is to define a URN-based scheme that embeds the properties we need. We propose this as a starting point for discussion, not as a final specification.

Format

urn:osa:{node-id}:{type}:{local-id}[@{version}][#{fragment}]

Components

urn:osa: — Fixed prefix indicating an OSA identifier.

{node-id} — The originating node. Options:

  • DNS hostname (e.g., data.imperial.ac.uk) — simple, enables direct resolution, but breaks if domain changes
  • DID (e.g., did:web:data.imperial.ac.uk) — more persistent, adds complexity
  • Opaque ID with registry lookup — most persistent, requires central infrastructure

{type} — Resource type: rec, dep, vocab, schema, val, tool.

{local-id} — Node-assigned identifier, opaque to clients.

@{version} — Optional version suffix for immutable snapshots.

#{fragment} — Optional fragment for sub-resources (e.g., vocabulary attributes).

Examples

urn:osa:data.imperial.ac.uk:rec:xyz789@v1
urn:osa:archive.embl.org:vocab:rnaseq@v2.1#mapped-reads-percent
urn:osa:did:web:data.imperial.ac.uk:dep:abc123

Open Questions

Node identity: Should node-id be a DNS hostname, a DID, or something else? DNS is simple but fragile. DIDs add persistence but complexity.

DID integration: If nodes have DIDs (via did:web or similar), should the SRN embed the full DID or just the hostname with an implied DID?

Registration: Should urn:osa be registered with IANA? This adds legitimacy but bureaucracy.

Versioning syntax: Is @v1 the right format? Alternatives: /v1, ?version=1, separate field.

Migration: How should identifiers survive domain changes? Options include redirect protocols, DID-based persistence, or accepting breakage as rare.

Alternative: DID-Native Approach

Rather than inventing SRNs, we could use DIDs directly:

did:osa:data.imperial.ac.uk:rec:xyz789

This would require defining a did:osa method specifying:

  • Identifier format
  • Resolution process
  • CRUD operations on DID Documents

Pros: Aligns with W3C standard, potential interop with Verifiable Credentials, existing DID tooling.

Cons: DIDs are designed for entities not resources, would be non-standard usage, more complex resolution.

Alternative: Minimal Approach

Use simple URLs with conventions:

https://data.imperial.ac.uk/osa/records/xyz789/v1

Rely on HTTP redirects for persistence. Accept that URLs may break.

Pros: Simplest to implement, no new concepts, universal tooling.

Cons: Fragile, no type information, conflates identity with location.

Next Steps

This OEP seeks feedback on:

  1. Requirements: Are the requirements complete and correctly prioritized?
  2. Existing schemes: Are there schemes we should consider that aren't listed?
  3. SRN proposal: Is this a reasonable starting point, or should we pursue a different direction?
  4. Node identity: What should node-id be? DNS hostname, DID, or hybrid?
  5. Migration: How important is surviving domain changes? What tradeoffs are acceptable?

Based on community input, a follow-up OEP will specify the chosen scheme in detail.

References