Skip to content

Appendix A: Example Schema

Appendix A: Example Schema — Omics Research Deployment

Document status: Draft v0.1 Depends on: sec3_data_model.md


This is an example deployment configuration, not part of the Hippo system spec. The entity types, fields, and relationships below illustrate how a multi-modal omics research group might configure Hippo for tracking biological samples, sequencing data files, and analysis workflows. A deployment targeting a different domain (e.g., manufacturing QC, environmental monitoring, clinical trials) would author an entirely different schema using the same LinkML features documented in sec3. Hippo's core behavior is identical regardless of schema content.


A.1 Entity Types

The following six entity types form a complete omics metadata tracking schema. Fields listed are user-defined fields only — system fields (sec3 §3.2) and availability (sec3 §3.3) are present on all entities implicitly and are not repeated here.

Subject

A biological subject who contributed one or more samples.

Field Type Required Indexed Notes
external_id string yes yes ID from upstream subject registry
species enum yes no Values: Homo sapiens, Mus musculus, Rattus norvegicus
biological_sex enum no no Values: male, female, unknown
age_at_collection float no no Age in years at time of sample collection
diagnosis string no yes Primary clinical diagnosis

Sample

A piece of biological material derived from a Subject. May be derived from another Sample (e.g., dissection, aliquoting).

Field Type Required Indexed Notes
external_id string yes yes ID from upstream sample registry
tissue_type string yes yes Tissue of origin
tissue_region string no yes Sub-region within tissue (e.g., hippocampus)
collection_date date no no
passage int no no For cell lines

Datafile

A file at a known location containing data derived from one or more Samples.

Field Type Required Indexed Notes
uri uri yes yes S3 URI, local path, or HTTPS URL
file_type enum yes yes Values: fastq, bam, vcf, tsv, h5ad, idat, ...
modality enum yes yes Values: RNASeq, WGBS, WGS, ATAC, genotyping, ...
file_size_bytes int no no
checksum_md5 string no no Integrity verification
read_count int no no For sequencing files
genome_build string no yes e.g., GRCh38
is_primary bool no yes True if raw/unprocessed

Dataset

A named, versioned logical collection of Datafiles.

Field Type Required Indexed Notes
name string yes yes Human-readable name
version string yes no Semver or date string
description string no no
is_public bool no yes Placeholder for future auth layer

Workflow

A versioned analysis pipeline definition from which WorkflowRuns are instantiated.

Field Type Required Indexed Notes
name string yes yes Pipeline name
version string yes yes Semver or git SHA
description string no no
repository_uri uri no no Git repository of pipeline definition
language enum no no Values: nextflow, snakemake, wdl, cwl, ...

WorkflowRun

A single execution of a Workflow, consuming input Datafiles and producing output Datafiles. The instance_of relationship links a WorkflowRun to its Workflow definition. Execution state is tracked via the execution_state field.

Field Type Required Indexed Notes
execution_state enum yes yes Values: pending, running, succeeded, failed
executor string no no Nextflow run ID, AWS Batch job ID, etc.
started_at datetime no no
completed_at datetime no no
parameters json yes no Full parameter set used for this run

A.2 Relationship Declarations

relationships:
  - name: donated
    from: Subject
    to: Sample
    cardinality: one-to-many
    description: "A subject donated one or more samples"

  - name: derived_from
    from: Sample
    to: Sample
    cardinality: many-to-many
    description: "A sample derived from another (e.g. dissection, aliquot)"
    properties:
      method: {type: string}

  - name: generated
    from: Sample
    to: Datafile
    cardinality: one-to-many

  - name: input_to
    from: Datafile
    to: WorkflowRun
    cardinality: many-to-many

  - name: output_of
    from: Datafile
    to: WorkflowRun
    cardinality: many-to-many

  - name: instance_of
    from: WorkflowRun
    to: Workflow
    cardinality: many-to-one
    required: true

  - name: contains
    from: Dataset
    to: Datafile
    cardinality: many-to-many

A.3 Entity Relationship Graph

                    ┌─────────────┐
                    │   Subject   │
                    └──────┬──────┘
                           │ donated (1:many)
                    ┌──────▼──────┐
               ┌────│   Sample    │◄──────────────┐
               │    └──────┬──────┘  derived_from  │
               │           │         (many:many,   │
               │           │          self-ref)    │
               │           │ generated (1:many)
               │    ┌──────▼──────┐
               │    │  Datafile   │
               │    └──┬──────┬───┘
               │       │      │
               │       │      │ output_of (many:many)
               │       │      │
               │       │  ┌───▼──────────────────────────┐
               │       │  │        WorkflowRun            │
               │       │  │  (parameters, executor)       │
               │       │  └───▲──────────┬────────────────┘
               │       │      │          │ instance_of (many:1)
               │       │      │   ┌──────▼──────┐
               │       │      │   │  Workflow   │
               │       │      │   │  (name,     │
               │       │      │   │   version)  │
               │       │      │   └─────────────┘
               │       │      │
               │       └──────┘
               │       input_to (many:many)
               │    ┌─────────────┐
               └───►│   Dataset   │
                    │  (contains  │
                    │  Datafiles) │
                    └─────────────┘

System relationships (built-in, available on all entity types):
  [any entity] ──superseded_by──► [any entity]

A.4 Complete Schema YAML

The full schema.yaml for this example deployment. This is what a deployer would author and commit to their config repository.

version: "1.0"
format: linkml

# This example does not reference any external ontology packages.
# A deployment that uses FMA anatomy terms or Ensembl gene IDs would declare them here:
# requires:
#   - hippo-reference-fma>=3.3
#   - hippo-reference-ensembl>=GRCh38.109

entities:

  Subject:
    description: "A biological subject who contributed one or more samples."
    fields:
      external_id:
        type: string
        required: true
        indexed: true
      species:
        type: enum
        values: [Homo sapiens, Mus musculus, Rattus norvegicus]
        required: true
      biological_sex:
        type: enum
        values: [male, female, unknown]
      age_at_collection:
        type: float
        description: "Age in years at time of sample collection"
      diagnosis:
        type: string
        indexed: true
        search: fts    # enables full-text search on diagnosis field

  Sample:
    description: "A piece of biological material derived from a Subject."
    fields:
      external_id:
        type: string
        required: true
        indexed: true
      tissue_type:
        type: string
        required: true
        indexed: true
      tissue_region:
        type: string
        indexed: true
        search: fts    # enables full-text search on region descriptions
      collection_date:
        type: date
      passage:
        type: int

  # Example of polymorphic extension (base: = is-a inheritance).
  # BrainSample IS a Sample — it is queryable as Sample and validates against Sample rules.
  # A brain bank deployment would extend the generic omics schema this way.
  BrainSample:
    base: Sample
    description: "A sample from a brain tissue collection."
    fields:
      brain_region:
        type: string
        indexed: true
        search: fts
      hemisphere:
        type: enum
        values: [left, right, bilateral, unknown]
      post_mortem_interval_hours:
        type: float

  Datafile:
    description: "A file at a known location containing data derived from one or more Samples."
    fields:
      uri:
        type: uri
        required: true
        indexed: true
      file_type:
        type: enum
        values: [fastq, bam, vcf, tsv, h5ad, idat, csv, other]
        required: true
        indexed: true
      modality:
        type: enum
        values: [RNASeq, WGBS, WGS, ATAC, genotyping, proteomics, metabolomics, other]
        required: true
        indexed: true
      file_size_bytes:
        type: int
      checksum_md5:
        type: string
      read_count:
        type: int
        validators:
          - type: range
            min: 0
      genome_build:
        type: string
        indexed: true
      is_primary:
        type: bool
        indexed: true

  Dataset:
    description: "A named, versioned logical collection of Datafiles."
    fields:
      name:
        type: string
        required: true
        indexed: true
        search: fts
      version:
        type: string
        required: true
      description:
        type: string
        search: fts
      is_public:
        type: bool
        indexed: true

  Workflow:
    description: "A versioned analysis pipeline definition."
    fields:
      name:
        type: string
        required: true
        indexed: true
        search: fts
      version:
        type: string
        required: true
        indexed: true
      description:
        type: string
        search: fts
      repository_uri:
        type: uri
      language:
        type: enum
        values: [nextflow, snakemake, wdl, cwl, other]

  WorkflowRun:
    description: "A single execution of a Workflow."
    fields:
      execution_state:
        type: enum
        values: [pending, running, succeeded, failed]
        required: true
        indexed: true
      executor:
        type: string
        description: "Nextflow run ID, AWS Batch job ID, etc."
      started_at:
        type: datetime
      completed_at:
        type: datetime
      parameters:
        type: json
        required: true

relationships:
  - name: donated
    from: Subject
    to: Sample
    cardinality: one-to-many
    description: "A subject donated one or more samples"

  - name: derived_from
    from: Sample
    to: Sample
    cardinality: many-to-many
    description: "A sample derived from another (e.g. dissection, aliquot)"
    properties:
      method: {type: string}

  - name: generated
    from: Sample
    to: Datafile
    cardinality: one-to-many

  - name: input_to
    from: Datafile
    to: WorkflowRun
    cardinality: many-to-many

  - name: output_of
    from: Datafile
    to: WorkflowRun
    cardinality: many-to-many

  - name: instance_of
    from: WorkflowRun
    to: Workflow
    cardinality: many-to-one
    required: true

  - name: contains
    from: Dataset
    to: Datafile
    cardinality: many-to-many

Note: The BrainSample type is included here to illustrate polymorphic extension. A deployment that doesn't need brain-specific fields would simply omit it. The donated and derived_from relationships declared from: Sample also cover BrainSample entities because BrainSample is a subtype of Sample.