Skip to content

Hippo Schema Writer's Guide

A practical guide for writing LinkML schema files for your lab. This document covers the most common schema patterns with real examples. For the full field reference, see Configuration Reference.


Quick Start

A minimal schema with two linked entity types:

id: https://example.org/my-lab
name: my_lab
prefixes:
  linkml: https://w3id.org/linkml/
  my_lab: https://example.org/my-lab/
imports:
  - linkml:types
default_range: string

classes:
  Donor:
    attributes:
      external_id:
        range: string
        required: true
      diagnosis:
        range: string
        required: true
      sex:
        range: string
      age_at_death:
        range: integer

  Sample:
    attributes:
      external_id:
        range: string
        required: true
      tissue:
        range: string
        required: true
      donor:
        range: Donor          # links this attribute to the Donor class
        required: true

Save this as schema.yaml, then point your HippoConfig at it:

from hippo.config import HippoConfig
config = HippoConfig(schema_path="schema.yaml", db_path="my_lab.db")

Schema Header

Every LinkML schema file needs a header with metadata and imports. This tells Hippo (and any LinkML tooling) how to interpret the schema.

id: https://example.org/brain-study     # unique schema identifier (URI)
name: brain_study                         # short name (alphanumeric, underscores, dashes)
prefixes:
  linkml: https://w3id.org/linkml/
  brain_study: https://example.org/brain-study/
imports:
  - linkml:types                          # imports built-in types (string, integer, date, etc.)
default_range: string                     # default type for attributes without an explicit range

Required fields:

  • id — A unique URI identifying this schema
  • name — A short name for the schema
  • prefixes — Must include linkml to use standard imports
  • imports — Almost always includes linkml:types for built-in data types

Classes

Each class defines an entity type in Hippo. Classes are defined under the top-level classes: key as a dictionary keyed by class name.

classes:
  SequencingDataset:
    description: "An RNA-seq or ATAC-seq dataset"
    attributes:
      ...

Rules: - Class names must be unique across your schema - description is optional but recommended - Use PascalCase for class names (e.g., SequencingDataset, GenomeBuild)


Attributes

Basic Attributes

Attributes are declared under a class's attributes: key. Each attribute specifies a range (data type).

attributes:
  sample_id:
    range: string
    required: true           # raises ValidationFailure if missing on create

  quality_score:
    range: float
    required: false          # optional, defaults to null

  read_count:
    range: integer

  is_paired:
    range: boolean

  collected_at:
    range: date              # YYYY-MM-DD

  processed_at:
    range: datetime          # ISO 8601: 2026-03-01T14:30:00Z

  file_uri:
    range: uri               # URI/URL string

Built-in Range Types

These types are available when you import linkml:types:

Range Description
string Text data
integer Integer numbers
float Floating-point numbers
boolean True/false values
date Date (YYYY-MM-DD)
datetime Date and time (ISO 8601)
uri URI/URL string
uriorcurie URI or compact URI (CURIE)

Enum Attributes

Restrict an attribute to a fixed set of values by defining an enum and referencing it:

enums:
  AssayType:
    permissible_values:
      RNASeq:
      ATACSeq:
      ChIPSeq:
      WGS:

classes:
  SequencingDataset:
    attributes:
      assay:
        range: AssayType
        required: true

Validation raises ValidationFailure if an entity is created with a value not in the enum's permissible_values.

You can add descriptions to enum values:

enums:
  TissueType:
    permissible_values:
      DLPFC:
        description: "Dorsolateral prefrontal cortex"
      HC:
        description: "Hippocampus"
      SN:
        description: "Substantia nigra"
      CB:
        description: "Cerebellum"
      STR:
        description: "Striatum"

Default Values

Use ifabsent to specify default values:

  status:
    range: string
    ifabsent: "string(pending)"

  priority:
    range: integer
    ifabsent: "int(0)"

  is_active:
    range: boolean
    ifabsent: "true"

Multivalued Attributes

Use multivalued: true for list-valued attributes:

  tags:
    range: string
    multivalued: true

  diagnoses:
    range: string
    multivalued: true

Linking Classes (References)

Use range to declare that an attribute points to another class. This is the foundation for entity graph traversal in Cappella and explicit relationship queries.

  donor:
    range: Donor              # the class this attribute points to
    required: true

When the range is a class (not a built-in type like string), Hippo treats the attribute as an entity reference. This enables: - Schema introspection (HippoClient.schema_references()) - Cappella collection resolver entity graph traversal - Documentation and tooling

Note

Reference attributes hold Hippo internal IDs (UUIDs), not user-facing identifiers. User-facing identifiers belong in a plain string attribute (such as external_id) or as an ExternalID.

  parent:
    range: Sample             # links back to the same class

Multi-Level Chains

Build graph traversal paths by chaining references:

classes:
  Donor:
    attributes:
      ...

  Sample:
    attributes:
      donor:
        range: Donor                       # Sample -> Donor

  SequencingDataset:
    attributes:
      sample:
        range: Sample                      # Dataset -> Sample -> Donor

With this schema, Cappella can traverse Dataset.sample -> Sample.donor -> Donor automatically when you pass criteria like donor.diagnosis=CTE.


Hippo Extensions (Annotations)

Hippo extends standard LinkML with storage and indexing annotations. These are expressed using LinkML's annotations mechanism and are specific to Hippo's storage layer.

Mark an attribute for full-text search:

  notes:
    range: string
    annotations:
      hippo_search: fts5

  description:
    range: string
    annotations:
      hippo_search: fts5

Query via HippoClient.search() or the REST API:

results = client.search("Sample", "hippocampus cortex")

Database Indexes

Speed up exact lookups on frequently-queried attributes:

  diagnosis:
    range: string
    annotations:
      hippo_index: true              # adds a B-tree index

  batch_id:
    range: string
    annotations:
      hippo_index_partial: true      # index only non-null values (smaller index)

Unique Constraints

  barcode:
    range: string
    identifier: true                 # makes this the unique primary key for the class

For non-primary-key uniqueness, use a class-level unique_keys declaration:

classes:
  Sample:
    unique_keys:
      barcode_key:
        unique_key_slots:
          - barcode
    attributes:
      barcode:
        range: string

Schema Inheritance

Use is_a to inherit all attributes from a parent class. The child class is queryable both as itself and as the parent type.

classes:
  File:
    attributes:
      file_uri:
        range: uri
        required: true
      checksum_sha256:
        range: string

  AlignmentFile:
    is_a: File                    # inherits file_uri and checksum_sha256
    attributes:
      aligner:
        range: string
      genome_build:
        range: GenomeBuild

client.query("File") returns both File and AlignmentFile entities. client.query("AlignmentFile") returns only alignment files.

Mixins

Use mixins to compose shared attribute sets without single-inheritance constraints:

classes:
  Timestamped:
    mixin: true
    attributes:
      collected_at:
        range: datetime
      processed_at:
        range: datetime

  QCAnnotation:
    mixin: true
    attributes:
      qc_status:
        range: string
      qc_score:
        range: float

  AnnotatedAlignmentFile:
    is_a: AlignmentFile
    mixins:
      - Timestamped
      - QCAnnotation            # inherits from AlignmentFile, plus both mixins

Validators

CEL-based validators enforce data quality at write time. Define them in validators.yaml (separate from the schema).

# validators.yaml
validators:
  - name: sample_name_format
    entity_type: Sample
    operations: [create, update]
    condition: 'entity.external_id.matches("^S[0-9]{3,}$")'
    message: "external_id must match S followed by 3+ digits"

  - name: age_plausible
    entity_type: Donor
    operations: [create]
    condition: 'entity.age_at_death >= 18 && entity.age_at_death <= 120'
    message: "age_at_death must be between 18 and 120"

Referential Integrity Validator

Since Hippo doesn't enforce foreign keys at the storage level, use a validator if you need hard enforcement:

validators:
  - name: donor_must_exist
    entity_type: Sample
    operations: [create]
    expand: donor        # pre-fetches the referenced entity
    condition: 'has(entity.donor) && entity.donor != ""'
    message: "Sample must have a non-empty donor reference"

Full cross-entity validation (checking the referenced entity actually exists) requires using the ref_check built-in validator preset:

  - name: valid_donor_ref
    entity_type: Sample
    preset: ref_check
    field: donor
    ref_entity_type: Donor

Namespaces

Group entity types from different domains into named namespaces to avoid collisions. In LinkML, this maps to the default_prefix for a schema file:

id: https://example.org/clinical
name: clinical
prefixes:
  linkml: https://w3id.org/linkml/
  clinical: https://example.org/clinical/
imports:
  - linkml:types
default_prefix: clinical

classes:
  Assessment:                      # fully qualified: clinical.Assessment
    attributes:
      ...

Reference across namespaces using fully-qualified names in attribute ranges:

  sample:
    range: tissue.Sample           # namespace.ClassName

Root namespace classes (no namespace declared) can be referenced without qualification.


Complete Example: Neuroscience Study

# schema.yaml -- example for a brain tissue bank

id: https://example.org/brain-tissue-bank
name: brain_tissue_bank
prefixes:
  linkml: https://w3id.org/linkml/
  btb: https://example.org/brain-tissue-bank/
imports:
  - linkml:types
default_range: string

enums:
  SexType:
    permissible_values:
      M:
      F:
      Unknown:

  TissueRegion:
    permissible_values:
      DLPFC:
        description: "Dorsolateral prefrontal cortex"
      HC:
        description: "Hippocampus"
      SN:
        description: "Substantia nigra"
      CB:
        description: "Cerebellum"
      STR:
        description: "Striatum"

  AssayType:
    permissible_values:
      RNASeq:
      ATACSeq:
      WGS:
      WES:

classes:
  Donor:
    description: "A research subject (human)"
    attributes:
      external_id:
        range: string
        required: true
        identifier: true
        annotations:
          hippo_index: true
      sex:
        range: SexType
        required: true
      age_at_death:
        range: integer
      diagnosis:
        range: string
        required: true
        annotations:
          hippo_index: true
          hippo_search: fts5
      notes:
        range: string
        annotations:
          hippo_search: fts5

  Sample:
    description: "A tissue sample from a donor"
    attributes:
      external_id:
        range: string
        required: true
        identifier: true
      donor:
        range: Donor
        required: true
        annotations:
          hippo_index: true
      tissue:
        range: TissueRegion
        required: true
      brain_region:
        range: string

  SequencingDataset:
    description: "A sequencing run for a sample"
    attributes:
      external_id:
        range: string
        required: true
      sample:
        range: Sample
        required: true
        annotations:
          hippo_index: true
      assay:
        range: AssayType
        required: true
      platform:
        range: string
      read_count:
        range: integer

  GenomeBuild:
    description: "A reference genome assembly"
    attributes:
      name:
        range: string
        required: true
        identifier: true
      source:
        range: string
      release:
        range: string
      source_uri:
        range: uri
      local_uri:
        range: uri

  AlignmentFile:
    is_a: File                     # if you have a base File class
    description: "An aligned sequencing file"
    attributes:
      dataset:
        range: SequencingDataset
        required: true
      genome_build:
        range: GenomeBuild
        required: true
      aligner:
        range: string
      aligner_version:
        range: string

Common Mistakes

Using type instead of range for attribute data types LinkML uses range to specify data types:

  diagnosis:
    range: string       # correct
Not:
  diagnosis:
    type: string        # wrong -- not valid LinkML

Forgetting imports: - linkml:types Without this import, built-in types like string, integer, date are not available. Almost every schema needs this import.

Defining enums inline on an attribute LinkML enums must be defined as separate top-level entries under enums:, then referenced by name in range:

enums:
  AssayType:
    permissible_values:
      RNASeq:
      ATACSeq:

classes:
  Dataset:
    attributes:
      assay:
        range: AssayType   # correct -- references the enum by name
Not:
      assay:
        enum_values:       # wrong -- not valid LinkML
          - RNASeq
          - ATACSeq

Using fields instead of attributes LinkML uses attributes for inline class-specific definitions:

classes:
  Donor:
    attributes:            # correct
      name:
        range: string
Not:
classes:
  Donor:
    fields:                # wrong -- not valid LinkML
      name:
        type: string

Forgetting identifier: true on the primary key attribute Just naming an attribute id or external_id does not make it an identifier. You must explicitly set identifier: true.

Not adding hippo_index annotation on reference attributes Queries filtering on reference attributes (like donor) will be slow without an index. Add hippo_index: true annotation to any attribute you filter on frequently.