Data Model
3. Data Model¶
Document status: Draft v0.3 Depends on: sec1_overview.md, sec2_architecture.md Feeds into: sec3b_relational_storage.md, sec4_api_layer.md, sec6_provenance.md
3.1 Design Philosophy¶
Hippo's data model is config-driven and graph-shaped. Entity types, their fields, and the relationships between them are defined in a YAML or JSON schema file — not hardcoded. The SDK exposes a graph-traversal API that abstracts over the underlying storage backend. The storage layer is pluggable via the adapter pattern (see sec2): relational databases (SQLite, PostgreSQL), graph databases, or document stores can all serve as backends — the query API is identical regardless of backend. See sec3b_relational_storage.md for the relational storage mapping reference.
Every entity type in Hippo shares a common structure:
- System fields — a minimal set of fields managed automatically by Hippo, not declarable in schema config
- Availability — a boolean governing default query visibility
- User-defined fields — typed fields declared in schema config
Temporal metadata (created_at, updated_at, schema_version) is owned exclusively by the
provenance system (see sec6_provenance.md) and exposed as computed properties on entity
objects by the SDK. These values are never stored alongside entity data. The provenance log is
the single authoritative source of all temporal and audit information.
3.2 System Fields¶
Hippo automatically manages the following fields on every entity type. They are not declared in schema config and cannot be overridden by users.
| Field | Type | Writable by user | Description |
|---|---|---|---|
id |
UUID | No — generated by Hippo on creation | Unique identifier for the entity |
is_available |
bool | No — managed via availability operations only | Controls whether the entity appears in default query results |
superseded_by |
UUID (nullable) | No — set atomically by client.supersede_entity() |
ID of the replacement entity if this entity has been superseded; None otherwise |
created_at |
datetime (UTC) | No | Timestamp of entity creation, derived from the provenance log |
updated_at |
datetime (UTC) | No | Timestamp of most recent change, derived from the provenance log |
schema_version |
string | No | Schema config version at most recent change, derived from the provenance log |
superseded_by is a system column present on all entity types (applied generically by hippo migrate, not declared per-schema). It is set atomically alongside the availability change when client.supersede_entity() is called. client.get() returns the superseded_by field on all entities; it is None when not superseded. The authoritative record of the supersession is the EntitySuperseded provenance event — the column is a fast-read cache.
On computed fields: created_at is the timestamp of the first provenance event for an
entity. updated_at is the timestamp of the most recent provenance event. schema_version
is the schema config version recorded on the most recent provenance event. All three are
derived at read time by the SDK from the provenance log and presented as first-class fields
on entity response objects — callers never need to query the provenance log directly for these
values. See sec6_provenance.md for the full provenance event model.
There are no hard deletes in Hippo. Availability transitions replace deletion. All entities and all provenance history are retained permanently.
3.3 Entity Availability¶
Every entity carries an is_available boolean field governing whether it appears in default
query results.
The reason an entity became unavailable (archived, deleted, distributed, superseded, etc.)
is not stored on the entity itself. It is recorded on the AvailabilityChanged provenance
event that flipped the flag. This delegates all lifecycle semantics to the provenance system,
which is the single authoritative source of "what happened and why." See sec6_provenance.md
for the full provenance event model.
Default query behavior: All queries return only available entities (is_available = true)
unless the caller explicitly requests unavailable entities. Storage adapters are expected to
optimize for this filter pattern, as it applies to virtually every query. See
sec3b_relational_storage.md §3b.2 for the relational indexing strategy.
superseded_by system relationship: When an entity is superseded, a built-in
superseded_by edge is created. Setting superseded_by automatically marks the source
entity as unavailable (is_available = false). The inverse (supersedes) is a computed
relationship — queryable but not stored as a separate edge. Supersession is an atomic SDK
operation:
client.supersede(
entity_type="Sample",
old_id="abc-123",
new_id="def-456",
actor="pipeline-run-789",
reason="Corrected tissue region annotation"
)
This writes atomically: an AvailabilityChanged provenance event on abc-123
(is_available true → false, reason: "superseded"), a superseded_by relationship edge,
and the availability update — or rolls back entirely on failure.
3.4 External IDs¶
Every entity can carry zero or more external identifiers from upstream systems. An entity might carry IDs from multiple external systems — all pointing to the same Hippo UUID.
Cardinality: Each entity can have multiple external IDs across different systems, and multiple external IDs from the same system. Each external ID maps to exactly one entity.
Lookup: External ID lookup is a first-class query operation. Given a system name and external ID string, the SDK returns the corresponding Hippo entity.
Immutability: External ID records are immutable once written. Correction creates a new
record rather than modifying an existing one. The old (incorrect) record is invalidated by
writing an ExternalIdSuperseded provenance event that references both the old and new
external ID records. The old record is retained for audit purposes but is excluded from
lookups by default — only the most recent active mapping for a given (entity_id, system)
pair is returned. Callers can explicitly request the full history of external ID mappings
for an entity.
See sec3b_relational_storage.md §3b.3 for the relational table schema.
3.5 Field Type System¶
| Type | Description | Example |
|---|---|---|
string |
UTF-8 text, optional max length | "APOE4" |
int |
Integer | 72 |
float |
Floating point | 1.85 |
bool |
Boolean | true |
date |
Calendar date (ISO 8601) | "2021-04-15" |
datetime |
UTC timestamp | "2021-04-15T10:30:00Z" |
enum |
String constrained to a declared value set | "active" |
json |
Arbitrary JSON object or array | {"key": "value"} |
uri |
S3 URI, local path, or HTTPS URL | "s3://bucket/key" |
ref |
Foreign reference to another Hippo entity by UUID | "sample:abc-123" |
Optional search declaration on string, enum, json, and ref fields:
fields:
preferred_label:
type: string
search: fts # full-text search (SQLite FTS5, PostgreSQL tsvector)
description:
type: string
search: embedding # vector similarity (adapter-specific model)
synonyms:
type: json # list of strings treated as additional FTS tokens
search: synonym
fma_id:
type: string
indexed: true # exact lookup only — no search: declaration
Supported search modes: fts, embedding, synonym. The active storage adapter declares
which modes it supports via search_capabilities(). Hippo validates at startup that all
schema-declared modes are supported; fails with SearchCapabilityError otherwise. Search
returns list[ScoredMatch] — see sec2 §2.3 for the type definition.
ref type validation: The ref format encodes the target entity type and UUID
(e.g. "sample:abc-123"). The SDK validates at write time that (a) the referenced entity
type exists in the current schema config, and (b) the referenced entity UUID exists and is
available. If a schema migration renames an entity type, hippo migrate rewrites all ref
values in affected fields as part of the migration plan — this is a data migration step
that requires explicit confirmation.
3.6 Schema Config Format¶
Authoring format¶
Hippo schema config is authored directly in LinkML — a YAML or JSON format that drives code generation (Pydantic models, storage adapter models, JSON Schema for API validation) and provides interoperability with other data systems.
Both YAML and JSON are valid. The config loader accepts .yaml, .yml, and .json
extensions.
Entity type declaration¶
Example (omics deployment): The following is excerpted from the example omics schema in Appendix A. See appendix_a_example_schema_omics.md for the complete schema.
version: "1.0"
format: linkml # optional — auto-detected if absent
entities:
Subject:
description: "A biological subject who contributed one or more samples."
base: BaseEntity
fields:
external_id:
type: string
required: true
indexed: true
species:
type: enum
values: [Homo sapiens, Mus musculus, Rattus norvegicus]
required: true
biological_sex:
type: enum
values: [male, female, unknown]
age_at_collection:
type: float
description: "Age in years at time of sample collection"
diagnosis:
type: string
indexed: true
Sample:
description: "A piece of biological material derived from a Subject."
base: BaseEntity
fields:
external_id:
type: string
required: true
indexed: true
tissue_type:
type: string
required: true
indexed: true
tissue_region:
type: string
indexed: true
collection_date:
type: date
passage:
type: int
Relationship declaration¶
Example (omics deployment): See appendix_a_example_schema_omics.md §A.2 for the complete set of relationship declarations.
relationships:
- name: donated
from: Subject
to: Sample
cardinality: one-to-many
description: "A subject donated one or more samples"
- name: derived_from
from: Sample
to: Sample
cardinality: many-to-many
description: "A sample derived from another (e.g. dissection, aliquot)"
properties:
method: {type: string}
- name: generated
from: Sample
to: Datafile
cardinality: one-to-many
- name: input_to
from: Datafile
to: WorkflowRun
cardinality: many-to-many
- name: output_of
from: Datafile
to: WorkflowRun
cardinality: many-to-many
- name: instance_of
from: WorkflowRun
to: Workflow
cardinality: many-to-one
required: true
- name: contains
from: Dataset
to: Datafile
cardinality: many-to-many
Schema requires: block¶
A schema may declare dependencies on reference loader packages. hippo validate fails fast
with a clear install suggestion if any required loader is absent.
version: "1.0"
requires:
- hippo-reference-fma>=3.3
- hippo-reference-ensembl>=GRCh38.109
- hippo-reference-go>=2024-01-01
entities:
Sample:
fields:
tissue_region:
type: ref
target: AnatomyTerm # provided by hippo-reference-fma
search: fts
Schema inheritance (polymorphic)¶
base: declares an is-a (polymorphic) relationship — not just field copying. A subtype
entity is an instance of its parent type. This has the following semantics:
- Queries:
client.query("Sample")returns entities of typeSampleand all subtypes (BrainSample,CellLine, etc.) unlessexact_type=Trueis passed - Validators:
entity_types: [Sample]in a validator config coversSampleand all subtypes - Relationships: A relationship declared
from: Sampleaccepts any Sample subtype - Storage: Each subtype has its own table for its additional fields; subtype queries join to the parent table. See sec3b §3b.8 for the relational mapping.
__type__system field: Every entity carries a read-only__type__field containing the concrete type name (e.g."BrainSample"), enabling callers to distinguish subtypes
Single inheritance only. A type may have one base: declaration. Cycles are rejected at
schema validation time.
Example (omics deployment): A brain tissue bank extends
Samplewith region-specific fields:
entities:
BrainSample:
base: Sample
description: "A sample from a brain tissue collection"
fields:
brain_region: {type: string, indexed: true, search: fts}
hemisphere:
type: enum
values: [left, right, bilateral, unknown]
post_mortem_interval_hours: {type: float}
3.7 Relationship Model¶
Relationships in Hippo are edge-based and typed. Each relationship connects two entities with a named, directional edge. Relationship types and their allowed cardinalities are declared in schema config. The SDK enforces cardinality constraints at write time.
Supported cardinalities: one-to-many, many-to-one, many-to-many. Self-referential
relationships (where from and to are the same entity type) are supported.
Relationship properties: Relationships can carry typed key-value properties declared in
schema config (e.g., a method property on a derived_from relationship). Properties are
optional and defined per relationship type.
Immutability: Relationships are immutable once created. To remove a relationship, its
status is set to removed and a RelationshipRemoved provenance event is written. The
original relationship record is retained for audit purposes. Queries return only active
relationships by default; callers can explicitly request removed relationships.
System relationships: The superseded_by relationship is built-in and available on all
entity types regardless of schema config. See §3.3 for supersession semantics.
See sec3b_relational_storage.md §3b.4 for the relational table schema.
3.8 Schema Versioning and Migration¶
Every schema config carries a version string. Hippo tracks the deployed version in internal
metadata. When versions differ, hippo migrate must run before the server starts.
hippo migrate steps:
1. Load and validate the LinkML schema
2. Diff schema against deployed version
3. Generate and print migration plan for human review
4. On confirmation (or --yes flag), apply changes and update internal metadata
5. Write a MigrationApplied provenance event
Migration rules:
| Change type | Action |
|---|---|
| New entity type | Provision new entity type with configured indexes |
| New field | Add field (nullable, or with default if provided) |
| New relationship type | No structural change — relationship storage is generic |
| New index | Create index on the specified field |
| New enum value | Extend allowed value set |
| Field type change | Reject — requires manual migration |
| Remove field | Mark deprecated; field retained in storage; excluded from API responses (see below) |
| Remove entity type | Reject — all entities must reach non-active status first |
Hippo never removes stored data automatically.
Deprecated field communication: When a field is removed from schema config, it enters a
deprecated state tracked in internal metadata. Deprecated fields are excluded from SDK
response objects and API responses by default. The REST transport layer includes deprecated
field names in a X-Hippo-Deprecated-Fields response header on any endpoint that would have
returned them. The SDK exposes a schema.deprecated_fields(entity_type) method for
programmatic discovery. Callers can explicitly request deprecated fields via an
include_deprecated=True flag, which returns them with a _deprecated suffix to prevent
silent breakage.
See sec3b_relational_storage.md §3b.5 for the hippo_meta table schema and §3b.6 for
migration DDL mechanics.
3.9 Validation Rules¶
Example (omics deployment): The following validation rules are for Datafile fields from the example omics schema in Appendix A.
fields:
modality:
type: enum
values: [RNASeq, WGBS, ATAC, WGS, genotyping]
required: true
uri:
type: uri
required: true
validators:
- type: uri_scheme
allowed: [s3, file, https]
read_count:
type: int
validators:
- type: range
min: 0
external_id:
type: string
required: true
validators:
- type: unique
scope: entity_type
Built-in field validators for v0.1: required, unique, range, uri_scheme, regex,
min_length, max_length. These are field-level structural constraints only.
For cross-field and cross-entity business rule validation, see sec2 §2.13 (Validation
Infrastructure) — config-driven validators.yaml rules (CEL-based) and Python plugin
validators (hippo.write_validators entry points) are a separate, complementary system
that runs in the write path after field validators.
3.10 Extending or Replacing the Schema¶
Schema config is the sole input that defines Hippo's entity model. There is no built-in default schema — every deployment authors a schema appropriate to its domain. Deployments can:
- Extend — add new entity types and relationships to an existing schema
- Modify — change fields, add validators, or adjust relationships on existing types
- Replace — discard the current schema entirely and author a new one from scratch
Example (omics deployment): The following extends the omics schema from Appendix A with a new entity type for cell lines.
entities:
CellLine:
description: "An immortalized cell line derived from a subject"
base: Sample
fields:
cell_line_name: {type: string, required: true, indexed: true}
passage_number: {type: int}
culture_conditions: {type: json}
relationships:
- name: established_from
from: CellLine
to: Sample
cardinality: many-to-one
Run hippo migrate to apply. The new entity type is immediately queryable and ingestable
with no further changes.
3.11 Entity Type Namespacing¶
Entity type strings in Hippo are optionally namespace-qualified. The namespace system
partitions entity types into named scopes, allowing multiple subsystems to define their own
Sample or Subject types without collision.
Namespace syntax¶
- Bare string (root namespace):
"Sample","Donor"— refers to entity types declared with no namespace prefix. All existing entity types without a namespace key are in the root namespace. - Qualified string (named namespace):
"tissue.Sample","omics.Datafile"— the prefix before the dot is the namespace name. - Explicit root prefix:
"root.Donor"is equivalent to"Donor"—rootis the only implicit namespace; all others must be explicit.
These forms are valid wherever an entity type is accepted:
| Context | Example |
|---|---|
| SDK calls | client.put("tissue.Sample", {...}) / client.get("Donor", id) |
| REST query parameters | GET /entities?entity_type=tissue.Sample |
Schema references.entity_type fields |
references: {entity_type: tissue.Sample} |
| Provenance records | entity_type stored verbatim as "tissue.Sample" |
Root namespace canonicalization¶
root.Donor is normalized to "Donor" at schema load time. Only the unqualified form
appears in SchemaConfig and in storage — there is no root. prefix in persisted data.
tissue.Sample and omics.Sample are stored verbatim as distinct entity type strings.
Namespacing in schema config¶
Schema files may declare a top-level namespace: key. Entities in that file are scoped to
the named namespace. Files without a namespace: key contribute to the root namespace.
Multiple files sharing the same namespace have their entity lists merged at load time.
# schemas/tissue.yaml
namespace: tissue
entities:
Sample:
fields:
donor_id: {type: string, references: {entity_type: Donor}} # root Donor
parent_id: {type: string, references: {entity_type: tissue.Sample}} # self-ref
Cross-namespace references use FQNs in references.entity_type. Namespace dependencies are
inferred from these references — no explicit depends_on: declaration is required. See
§3.12 for the full schema loading and validation model.
Backwards compatibility¶
Schemas with no namespace: key are unaffected. All existing unqualified entity type
strings continue to resolve to the root namespace. No data migration is required — existing
rows in storage use unqualified names, which remain valid. client.put("Sample", {...})
continues to work for root-namespace Sample entities regardless of whether any namespaced
tissue.Sample or omics.Sample entities exist in the same deployment.
3.12 Schema Namespaces¶
Note: This section documents the multi-file namespace system introduced for multi-team deployments. Single-file deployments with no
namespace:key require no changes and are unaffected by this section.
Namespace declaration¶
Each schema file may declare an optional namespace: key at the top level. This key scopes
all entity types in that file to a named namespace. Files without the key contribute entities
to the root namespace.
Multiple files may share the same namespace: value — their entity lists are merged at load
time. If two files in the same namespace declare an entity with the same name, the schema
loader raises SchemaValidationError identifying the conflicting entity and both files.
# schemas/tissue.yaml
namespace: tissue
entities:
Sample: { ... }
Block: { ... }
# schemas/tissue_extended.yaml
namespace: tissue # same namespace — merges with tissue.yaml above
entities:
Slide: { ... } # OK — distinct name
NamespaceRegistry¶
The SchemaLoader builds a NamespaceRegistry as it discovers schema files. The registry
maps (namespace, entity_name) → EntityConfig. The full registry is populated from all
files before any cross-namespace reference validation begins — this ensures forward references
work regardless of file discovery order.
NamespaceRegistry provides:
- FQN lookup: registry.get("tissue", "Sample") → EntityConfig
- Root lookup: registry.get(None, "Donor") or registry.get("root", "Donor") → same result
- Validation: raises SchemaValidationError for unknown FQNs or circular dependencies
FQN resolution rules¶
| Input string | Resolution |
|---|---|
"Donor" |
Root namespace entity Donor |
"root.Donor" |
Equivalent to "Donor" — normalized at registry ingestion |
"tissue.Sample" |
Entity Sample in namespace tissue |
"ghost.Entity" |
SchemaValidationError — namespace ghost not registered |
Cross-namespace reference validation¶
After the registry is fully populated, the schema loader validates all references.entity_type
values in all fields across all namespaces. For each reference:
- Parse the FQN into
(namespace, entity_name) - Look up in the registry
- If not found, raise
SchemaValidationErroridentifying the unresolved FQN and the file
Circular dependency detection¶
The schema loader derives a namespace dependency graph from cross-namespace references. If
namespace A references an entity in namespace B, then A depends on B. The loader performs a
topological sort over this graph. If a cycle is detected (e.g., A → B → A), the loader raises
SchemaValidationError identifying the cycle path.
Dependencies are inferred — no depends_on: key is required or supported.
Error messages¶
Namespace validation errors identify the problematic reference or cycle path and the file where it originates:
SchemaValidationError: Unresolved FQN reference 'ghost.Entity' in field 'sample.ghost_id'
declared in: schemas/tissue.yaml
SchemaValidationError: Circular namespace dependency detected: tissue → omics → tissue
first reference: schemas/omics.yaml field 'datafile.sample_id' (references tissue.Sample)
Backwards compatibility¶
Existing single-file schema.yaml deployments require no changes. The namespace: key is
optional. HippoClient method signatures are unchanged — callers pass FQNs as strings to the
same entity_type parameter they already use. No data migration is required for any existing
deployment.
3.13 Multi-tenancy¶
Single namespace in v0.1. Multi-tenancy is explicitly out of scope for v0.1.