Hippo Schema Writer's Guide¶
A practical guide for writing LinkML schema files for your lab. This document covers the most common schema patterns with real examples. For the full field reference, see Configuration Reference.
Quick Start¶
A minimal schema with two linked entity types:
id: https://example.org/my-lab
name: my_lab
prefixes:
linkml: https://w3id.org/linkml/
my_lab: https://example.org/my-lab/
imports:
- linkml:types
default_range: string
classes:
Donor:
attributes:
external_id:
range: string
required: true
diagnosis:
range: string
required: true
sex:
range: string
age_at_death:
range: integer
Sample:
attributes:
external_id:
range: string
required: true
tissue:
range: string
required: true
donor:
range: Donor # links this attribute to the Donor class
required: true
Save this as schema.yaml, then point your HippoConfig at it:
from hippo.config import HippoConfig
config = HippoConfig(schema_path="schema.yaml", db_path="my_lab.db")
Schema Header¶
Every LinkML schema file needs a header with metadata and imports. This tells Hippo (and any LinkML tooling) how to interpret the schema.
id: https://example.org/brain-study # unique schema identifier (URI)
name: brain_study # short name (alphanumeric, underscores, dashes)
prefixes:
linkml: https://w3id.org/linkml/
brain_study: https://example.org/brain-study/
imports:
- linkml:types # imports built-in types (string, integer, date, etc.)
default_range: string # default type for attributes without an explicit range
Required fields:
id— A unique URI identifying this schemaname— A short name for the schemaprefixes— Must includelinkmlto use standard importsimports— Almost always includeslinkml:typesfor built-in data types
Classes¶
Each class defines an entity type in Hippo. Classes are defined under the top-level classes: key as a dictionary keyed by class name.
Rules:
- Class names must be unique across your schema
- description is optional but recommended
- Use PascalCase for class names (e.g., SequencingDataset, GenomeBuild)
Attributes¶
Basic Attributes¶
Attributes are declared under a class's attributes: key. Each attribute specifies a range (data type).
attributes:
sample_id:
range: string
required: true # raises ValidationFailure if missing on create
quality_score:
range: float
required: false # optional, defaults to null
read_count:
range: integer
is_paired:
range: boolean
collected_at:
range: date # YYYY-MM-DD
processed_at:
range: datetime # ISO 8601: 2026-03-01T14:30:00Z
file_uri:
range: uri # URI/URL string
Built-in Range Types¶
These types are available when you import linkml:types:
| Range | Description |
|---|---|
string |
Text data |
integer |
Integer numbers |
float |
Floating-point numbers |
boolean |
True/false values |
date |
Date (YYYY-MM-DD) |
datetime |
Date and time (ISO 8601) |
uri |
URI/URL string |
uriorcurie |
URI or compact URI (CURIE) |
Enum Attributes¶
Restrict an attribute to a fixed set of values by defining an enum and referencing it:
enums:
AssayType:
permissible_values:
RNASeq:
ATACSeq:
ChIPSeq:
WGS:
classes:
SequencingDataset:
attributes:
assay:
range: AssayType
required: true
Validation raises ValidationFailure if an entity is created with a value not in the enum's permissible_values.
You can add descriptions to enum values:
enums:
TissueType:
permissible_values:
DLPFC:
description: "Dorsolateral prefrontal cortex"
HC:
description: "Hippocampus"
SN:
description: "Substantia nigra"
CB:
description: "Cerebellum"
STR:
description: "Striatum"
Default Values¶
Use ifabsent to specify default values:
status:
range: string
ifabsent: "string(pending)"
priority:
range: integer
ifabsent: "int(0)"
is_active:
range: boolean
ifabsent: "true"
Multivalued Attributes¶
Use multivalued: true for list-valued attributes:
Linking Classes (References)¶
Use range to declare that an attribute points to another class. This is the foundation for entity graph traversal in Cappella and explicit relationship queries.
When the range is a class (not a built-in type like string), Hippo treats the attribute as an entity reference. This enables:
- Schema introspection (HippoClient.schema_references())
- Cappella collection resolver entity graph traversal
- Documentation and tooling
Note
Reference attributes hold Hippo internal IDs (UUIDs), not user-facing identifiers. User-facing identifiers belong in a plain string attribute (such as external_id) or as an ExternalID.
Self-Referential Links¶
Multi-Level Chains¶
Build graph traversal paths by chaining references:
classes:
Donor:
attributes:
...
Sample:
attributes:
donor:
range: Donor # Sample -> Donor
SequencingDataset:
attributes:
sample:
range: Sample # Dataset -> Sample -> Donor
With this schema, Cappella can traverse Dataset.sample -> Sample.donor -> Donor automatically when you pass criteria like donor.diagnosis=CTE.
Hippo Extensions (Annotations)¶
Hippo extends standard LinkML with storage and indexing annotations. These are expressed using LinkML's annotations mechanism and are specific to Hippo's storage layer.
Full-Text Search¶
Mark an attribute for full-text search:
notes:
range: string
annotations:
hippo_search: fts5
description:
range: string
annotations:
hippo_search: fts5
Query via HippoClient.search() or the REST API:
Database Indexes¶
Speed up exact lookups on frequently-queried attributes:
diagnosis:
range: string
annotations:
hippo_index: true # adds a B-tree index
batch_id:
range: string
annotations:
hippo_index_partial: true # index only non-null values (smaller index)
Unique Constraints¶
For non-primary-key uniqueness, use a class-level unique_keys declaration:
classes:
Sample:
unique_keys:
barcode_key:
unique_key_slots:
- barcode
attributes:
barcode:
range: string
Schema Inheritance¶
Use is_a to inherit all attributes from a parent class. The child class is queryable both as itself and as the parent type.
classes:
File:
attributes:
file_uri:
range: uri
required: true
checksum_sha256:
range: string
AlignmentFile:
is_a: File # inherits file_uri and checksum_sha256
attributes:
aligner:
range: string
genome_build:
range: GenomeBuild
client.query("File") returns both File and AlignmentFile entities. client.query("AlignmentFile") returns only alignment files.
Mixins¶
Use mixins to compose shared attribute sets without single-inheritance constraints:
classes:
Timestamped:
mixin: true
attributes:
collected_at:
range: datetime
processed_at:
range: datetime
QCAnnotation:
mixin: true
attributes:
qc_status:
range: string
qc_score:
range: float
AnnotatedAlignmentFile:
is_a: AlignmentFile
mixins:
- Timestamped
- QCAnnotation # inherits from AlignmentFile, plus both mixins
Validators¶
CEL-based validators enforce data quality at write time. Define them in validators.yaml (separate from the schema).
# validators.yaml
validators:
- name: sample_name_format
entity_type: Sample
operations: [create, update]
condition: 'entity.external_id.matches("^S[0-9]{3,}$")'
message: "external_id must match S followed by 3+ digits"
- name: age_plausible
entity_type: Donor
operations: [create]
condition: 'entity.age_at_death >= 18 && entity.age_at_death <= 120'
message: "age_at_death must be between 18 and 120"
Referential Integrity Validator¶
Since Hippo doesn't enforce foreign keys at the storage level, use a validator if you need hard enforcement:
validators:
- name: donor_must_exist
entity_type: Sample
operations: [create]
expand: donor # pre-fetches the referenced entity
condition: 'has(entity.donor) && entity.donor != ""'
message: "Sample must have a non-empty donor reference"
Full cross-entity validation (checking the referenced entity actually exists) requires using the ref_check built-in validator preset:
Namespaces¶
Group entity types from different domains into named namespaces to avoid collisions. In LinkML, this maps to the default_prefix for a schema file:
id: https://example.org/clinical
name: clinical
prefixes:
linkml: https://w3id.org/linkml/
clinical: https://example.org/clinical/
imports:
- linkml:types
default_prefix: clinical
classes:
Assessment: # fully qualified: clinical.Assessment
attributes:
...
Reference across namespaces using fully-qualified names in attribute ranges:
Root namespace classes (no namespace declared) can be referenced without qualification.
Complete Example: Neuroscience Study¶
# schema.yaml -- example for a brain tissue bank
id: https://example.org/brain-tissue-bank
name: brain_tissue_bank
prefixes:
linkml: https://w3id.org/linkml/
btb: https://example.org/brain-tissue-bank/
imports:
- linkml:types
default_range: string
enums:
SexType:
permissible_values:
M:
F:
Unknown:
TissueRegion:
permissible_values:
DLPFC:
description: "Dorsolateral prefrontal cortex"
HC:
description: "Hippocampus"
SN:
description: "Substantia nigra"
CB:
description: "Cerebellum"
STR:
description: "Striatum"
AssayType:
permissible_values:
RNASeq:
ATACSeq:
WGS:
WES:
classes:
Donor:
description: "A research subject (human)"
attributes:
external_id:
range: string
required: true
identifier: true
annotations:
hippo_index: true
sex:
range: SexType
required: true
age_at_death:
range: integer
diagnosis:
range: string
required: true
annotations:
hippo_index: true
hippo_search: fts5
notes:
range: string
annotations:
hippo_search: fts5
Sample:
description: "A tissue sample from a donor"
attributes:
external_id:
range: string
required: true
identifier: true
donor:
range: Donor
required: true
annotations:
hippo_index: true
tissue:
range: TissueRegion
required: true
brain_region:
range: string
SequencingDataset:
description: "A sequencing run for a sample"
attributes:
external_id:
range: string
required: true
sample:
range: Sample
required: true
annotations:
hippo_index: true
assay:
range: AssayType
required: true
platform:
range: string
read_count:
range: integer
GenomeBuild:
description: "A reference genome assembly"
attributes:
name:
range: string
required: true
identifier: true
source:
range: string
release:
range: string
source_uri:
range: uri
local_uri:
range: uri
AlignmentFile:
is_a: File # if you have a base File class
description: "An aligned sequencing file"
attributes:
dataset:
range: SequencingDataset
required: true
genome_build:
range: GenomeBuild
required: true
aligner:
range: string
aligner_version:
range: string
Common Mistakes¶
Using type instead of range for attribute data types
LinkML uses range to specify data types:
Forgetting imports: - linkml:types
Without this import, built-in types like string, integer, date are not available. Almost every schema needs this import.
Defining enums inline on an attribute
LinkML enums must be defined as separate top-level entries under enums:, then referenced by name in range:
enums:
AssayType:
permissible_values:
RNASeq:
ATACSeq:
classes:
Dataset:
attributes:
assay:
range: AssayType # correct -- references the enum by name
Using fields instead of attributes
LinkML uses attributes for inline class-specific definitions:
Forgetting identifier: true on the primary key attribute
Just naming an attribute id or external_id does not make it an identifier. You must explicitly set identifier: true.
Not adding hippo_index annotation on reference attributes
Queries filtering on reference attributes (like donor) will be slow without an index. Add hippo_index: true annotation to any attribute you filter on frequently.