Skip to content

Data Model

3. Data Model

Document status: Draft v0.3 Depends on: sec1_overview.md, sec2_architecture.md Feeds into: sec3b_relational_storage.md, sec4_api_layer.md, sec6_provenance.md


3.1 Design Philosophy

Hippo's data model is config-driven and graph-shaped. Entity types, their fields, and the relationships between them are defined in a YAML or JSON schema file — not hardcoded. The SDK exposes a graph-traversal API that abstracts over the underlying storage backend. The storage layer is pluggable via the adapter pattern (see sec2): relational databases (SQLite, PostgreSQL), graph databases, or document stores can all serve as backends — the query API is identical regardless of backend. See sec3b_relational_storage.md for the relational storage mapping reference.

Every entity type in Hippo shares a common structure:

  • System fields — a minimal set of fields managed automatically by Hippo, not declarable in schema config
  • Availability — a boolean governing default query visibility
  • User-defined fields — typed fields declared in schema config

Temporal metadata (created_at, updated_at, schema_version) is owned exclusively by the provenance system (see sec6_provenance.md) and exposed as computed properties on entity objects by the SDK. These values are never stored alongside entity data. The provenance log is the single authoritative source of all temporal and audit information.


3.2 System Fields

Hippo automatically manages the following fields on every entity type. They are not declared in schema config and cannot be overridden by users.

Field Type Writable by user Description
id UUID No — generated by Hippo on creation Unique identifier for the entity
is_available bool No — managed via availability operations only Controls whether the entity appears in default query results
superseded_by UUID (nullable) No — set atomically by client.supersede_entity() ID of the replacement entity if this entity has been superseded; None otherwise
created_at datetime (UTC) No Timestamp of entity creation, derived from the provenance log
updated_at datetime (UTC) No Timestamp of most recent change, derived from the provenance log
schema_version string No Schema config version at most recent change, derived from the provenance log

superseded_by is a system column present on all entity types (applied generically by hippo migrate, not declared per-schema). It is set atomically alongside the availability change when client.supersede_entity() is called. client.get() returns the superseded_by field on all entities; it is None when not superseded. The authoritative record of the supersession is the EntitySuperseded provenance event — the column is a fast-read cache.

On computed fields: created_at is the timestamp of the first provenance event for an entity. updated_at is the timestamp of the most recent provenance event. schema_version is the schema config version recorded on the most recent provenance event. All three are derived at read time by the SDK from the provenance log and presented as first-class fields on entity response objects — callers never need to query the provenance log directly for these values. See sec6_provenance.md for the full provenance event model.

There are no hard deletes in Hippo. Availability transitions replace deletion. All entities and all provenance history are retained permanently.


3.3 Entity Availability

Every entity carries an is_available boolean field governing whether it appears in default query results.

The reason an entity became unavailable (archived, deleted, distributed, superseded, etc.) is not stored on the entity itself. It is recorded on the AvailabilityChanged provenance event that flipped the flag. This delegates all lifecycle semantics to the provenance system, which is the single authoritative source of "what happened and why." See sec6_provenance.md for the full provenance event model.

Default query behavior: All queries return only available entities (is_available = true) unless the caller explicitly requests unavailable entities. Storage adapters are expected to optimize for this filter pattern, as it applies to virtually every query. See sec3b_relational_storage.md §3b.2 for the relational indexing strategy.

superseded_by system relationship: When an entity is superseded, a built-in superseded_by edge is created. Setting superseded_by automatically marks the source entity as unavailable (is_available = false). The inverse (supersedes) is a computed relationship — queryable but not stored as a separate edge. Supersession is an atomic SDK operation:

client.supersede(
    entity_type="Sample",
    old_id="abc-123",
    new_id="def-456",
    actor="pipeline-run-789",
    reason="Corrected tissue region annotation"
)

This writes atomically: an AvailabilityChanged provenance event on abc-123 (is_available true → false, reason: "superseded"), a superseded_by relationship edge, and the availability update — or rolls back entirely on failure.


3.4 External IDs

Every entity can carry zero or more external identifiers from upstream systems. An entity might carry IDs from multiple external systems — all pointing to the same Hippo UUID.

Cardinality: Each entity can have multiple external IDs across different systems, and multiple external IDs from the same system. Each external ID maps to exactly one entity.

Lookup: External ID lookup is a first-class query operation. Given a system name and external ID string, the SDK returns the corresponding Hippo entity.

Immutability: External ID records are immutable once written. Correction creates a new record rather than modifying an existing one. The old (incorrect) record is invalidated by writing an ExternalIdSuperseded provenance event that references both the old and new external ID records. The old record is retained for audit purposes but is excluded from lookups by default — only the most recent active mapping for a given (entity_id, system) pair is returned. Callers can explicitly request the full history of external ID mappings for an entity.

See sec3b_relational_storage.md §3b.3 for the relational table schema.


3.5 Field Type System

Type Description Example
string UTF-8 text, optional max length "APOE4"
int Integer 72
float Floating point 1.85
bool Boolean true
date Calendar date (ISO 8601) "2021-04-15"
datetime UTC timestamp "2021-04-15T10:30:00Z"
enum String constrained to a declared value set "active"
json Arbitrary JSON object or array {"key": "value"}
uri S3 URI, local path, or HTTPS URL "s3://bucket/key"
ref Foreign reference to another Hippo entity by UUID "sample:abc-123"

Optional search declaration on string, enum, json, and ref fields:

fields:
  preferred_label:
    type: string
    search: fts          # full-text search (SQLite FTS5, PostgreSQL tsvector)
  description:
    type: string
    search: embedding    # vector similarity (adapter-specific model)
  synonyms:
    type: json           # list of strings treated as additional FTS tokens
    search: synonym
  fma_id:
    type: string
    indexed: true        # exact lookup only — no search: declaration

Supported search modes: fts, embedding, synonym. The active storage adapter declares which modes it supports via search_capabilities(). Hippo validates at startup that all schema-declared modes are supported; fails with SearchCapabilityError otherwise. Search returns list[ScoredMatch] — see sec2 §2.3 for the type definition.

ref type validation: The ref format encodes the target entity type and UUID (e.g. "sample:abc-123"). The SDK validates at write time that (a) the referenced entity type exists in the current schema config, and (b) the referenced entity UUID exists and is available. If a schema migration renames an entity type, hippo migrate rewrites all ref values in affected fields as part of the migration plan — this is a data migration step that requires explicit confirmation.


3.6 Schema Config Format

Authoring format

Hippo schema config is authored directly in LinkML — a YAML or JSON format that drives code generation (Pydantic models, storage adapter models, JSON Schema for API validation) and provides interoperability with other data systems.

# hippo.yaml
schema:
  path: ./schema.yaml           # LinkML format

Both YAML and JSON are valid. The config loader accepts .yaml, .yml, and .json extensions.

Entity type declaration

Example (omics deployment): The following is excerpted from the example omics schema in Appendix A. See appendix_a_example_schema_omics.md for the complete schema.

version: "1.0"
format: linkml            # optional — auto-detected if absent

entities:
  Subject:
    description: "A biological subject who contributed one or more samples."
    base: BaseEntity
    fields:
      external_id:
        type: string
        required: true
        indexed: true
      species:
        type: enum
        values: [Homo sapiens, Mus musculus, Rattus norvegicus]
        required: true
      biological_sex:
        type: enum
        values: [male, female, unknown]
      age_at_collection:
        type: float
        description: "Age in years at time of sample collection"
      diagnosis:
        type: string
        indexed: true

  Sample:
    description: "A piece of biological material derived from a Subject."
    base: BaseEntity
    fields:
      external_id:
        type: string
        required: true
        indexed: true
      tissue_type:
        type: string
        required: true
        indexed: true
      tissue_region:
        type: string
        indexed: true
      collection_date:
        type: date
      passage:
        type: int

Relationship declaration

Example (omics deployment): See appendix_a_example_schema_omics.md §A.2 for the complete set of relationship declarations.

relationships:
  - name: donated
    from: Subject
    to: Sample
    cardinality: one-to-many
    description: "A subject donated one or more samples"

  - name: derived_from
    from: Sample
    to: Sample
    cardinality: many-to-many
    description: "A sample derived from another (e.g. dissection, aliquot)"
    properties:
      method: {type: string}

  - name: generated
    from: Sample
    to: Datafile
    cardinality: one-to-many

  - name: input_to
    from: Datafile
    to: WorkflowRun
    cardinality: many-to-many

  - name: output_of
    from: Datafile
    to: WorkflowRun
    cardinality: many-to-many

  - name: instance_of
    from: WorkflowRun
    to: Workflow
    cardinality: many-to-one
    required: true

  - name: contains
    from: Dataset
    to: Datafile
    cardinality: many-to-many

Schema requires: block

A schema may declare dependencies on reference loader packages. hippo validate fails fast with a clear install suggestion if any required loader is absent.

version: "1.0"

requires:
  - hippo-reference-fma>=3.3
  - hippo-reference-ensembl>=GRCh38.109
  - hippo-reference-go>=2024-01-01

entities:
  Sample:
    fields:
      tissue_region:
        type: ref
        target: AnatomyTerm    # provided by hippo-reference-fma
        search: fts

Schema inheritance (polymorphic)

base: declares an is-a (polymorphic) relationship — not just field copying. A subtype entity is an instance of its parent type. This has the following semantics:

  • Queries: client.query("Sample") returns entities of type Sample and all subtypes (BrainSample, CellLine, etc.) unless exact_type=True is passed
  • Validators: entity_types: [Sample] in a validator config covers Sample and all subtypes
  • Relationships: A relationship declared from: Sample accepts any Sample subtype
  • Storage: Each subtype has its own table for its additional fields; subtype queries join to the parent table. See sec3b §3b.8 for the relational mapping.
  • __type__ system field: Every entity carries a read-only __type__ field containing the concrete type name (e.g. "BrainSample"), enabling callers to distinguish subtypes

Single inheritance only. A type may have one base: declaration. Cycles are rejected at schema validation time.

Example (omics deployment): A brain tissue bank extends Sample with region-specific fields:

entities:
  BrainSample:
    base: Sample
    description: "A sample from a brain tissue collection"
    fields:
      brain_region: {type: string, indexed: true, search: fts}
      hemisphere:
        type: enum
        values: [left, right, bilateral, unknown]
      post_mortem_interval_hours: {type: float}

3.7 Relationship Model

Relationships in Hippo are edge-based and typed. Each relationship connects two entities with a named, directional edge. Relationship types and their allowed cardinalities are declared in schema config. The SDK enforces cardinality constraints at write time.

Supported cardinalities: one-to-many, many-to-one, many-to-many. Self-referential relationships (where from and to are the same entity type) are supported.

Relationship properties: Relationships can carry typed key-value properties declared in schema config (e.g., a method property on a derived_from relationship). Properties are optional and defined per relationship type.

Immutability: Relationships are immutable once created. To remove a relationship, its status is set to removed and a RelationshipRemoved provenance event is written. The original relationship record is retained for audit purposes. Queries return only active relationships by default; callers can explicitly request removed relationships.

System relationships: The superseded_by relationship is built-in and available on all entity types regardless of schema config. See §3.3 for supersession semantics.

See sec3b_relational_storage.md §3b.4 for the relational table schema.


3.8 Schema Versioning and Migration

Every schema config carries a version string. Hippo tracks the deployed version in internal metadata. When versions differ, hippo migrate must run before the server starts.

hippo migrate steps: 1. Load and validate the LinkML schema 2. Diff schema against deployed version 3. Generate and print migration plan for human review 4. On confirmation (or --yes flag), apply changes and update internal metadata 5. Write a MigrationApplied provenance event

Migration rules:

Change type Action
New entity type Provision new entity type with configured indexes
New field Add field (nullable, or with default if provided)
New relationship type No structural change — relationship storage is generic
New index Create index on the specified field
New enum value Extend allowed value set
Field type change Reject — requires manual migration
Remove field Mark deprecated; field retained in storage; excluded from API responses (see below)
Remove entity type Reject — all entities must reach non-active status first

Hippo never removes stored data automatically.

Deprecated field communication: When a field is removed from schema config, it enters a deprecated state tracked in internal metadata. Deprecated fields are excluded from SDK response objects and API responses by default. The REST transport layer includes deprecated field names in a X-Hippo-Deprecated-Fields response header on any endpoint that would have returned them. The SDK exposes a schema.deprecated_fields(entity_type) method for programmatic discovery. Callers can explicitly request deprecated fields via an include_deprecated=True flag, which returns them with a _deprecated suffix to prevent silent breakage.

See sec3b_relational_storage.md §3b.5 for the hippo_meta table schema and §3b.6 for migration DDL mechanics.


3.9 Validation Rules

Example (omics deployment): The following validation rules are for Datafile fields from the example omics schema in Appendix A.

fields:
  modality:
    type: enum
    values: [RNASeq, WGBS, ATAC, WGS, genotyping]
    required: true
  uri:
    type: uri
    required: true
    validators:
      - type: uri_scheme
        allowed: [s3, file, https]
  read_count:
    type: int
    validators:
      - type: range
        min: 0
  external_id:
    type: string
    required: true
    validators:
      - type: unique
        scope: entity_type

Built-in field validators for v0.1: required, unique, range, uri_scheme, regex, min_length, max_length. These are field-level structural constraints only.

For cross-field and cross-entity business rule validation, see sec2 §2.13 (Validation Infrastructure) — config-driven validators.yaml rules (CEL-based) and Python plugin validators (hippo.write_validators entry points) are a separate, complementary system that runs in the write path after field validators.


3.10 Extending or Replacing the Schema

Schema config is the sole input that defines Hippo's entity model. There is no built-in default schema — every deployment authors a schema appropriate to its domain. Deployments can:

  • Extend — add new entity types and relationships to an existing schema
  • Modify — change fields, add validators, or adjust relationships on existing types
  • Replace — discard the current schema entirely and author a new one from scratch

Example (omics deployment): The following extends the omics schema from Appendix A with a new entity type for cell lines.

entities:
  CellLine:
    description: "An immortalized cell line derived from a subject"
    base: Sample
    fields:
      cell_line_name: {type: string, required: true, indexed: true}
      passage_number: {type: int}
      culture_conditions: {type: json}

relationships:
  - name: established_from
    from: CellLine
    to: Sample
    cardinality: many-to-one

Run hippo migrate to apply. The new entity type is immediately queryable and ingestable with no further changes.


3.11 Entity Type Namespacing

Entity type strings in Hippo are optionally namespace-qualified. The namespace system partitions entity types into named scopes, allowing multiple subsystems to define their own Sample or Subject types without collision.

Namespace syntax

  • Bare string (root namespace): "Sample", "Donor" — refers to entity types declared with no namespace prefix. All existing entity types without a namespace key are in the root namespace.
  • Qualified string (named namespace): "tissue.Sample", "omics.Datafile" — the prefix before the dot is the namespace name.
  • Explicit root prefix: "root.Donor" is equivalent to "Donor"root is the only implicit namespace; all others must be explicit.

These forms are valid wherever an entity type is accepted:

Context Example
SDK calls client.put("tissue.Sample", {...}) / client.get("Donor", id)
REST query parameters GET /entities?entity_type=tissue.Sample
Schema references.entity_type fields references: {entity_type: tissue.Sample}
Provenance records entity_type stored verbatim as "tissue.Sample"

Root namespace canonicalization

root.Donor is normalized to "Donor" at schema load time. Only the unqualified form appears in SchemaConfig and in storage — there is no root. prefix in persisted data. tissue.Sample and omics.Sample are stored verbatim as distinct entity type strings.

Namespacing in schema config

Schema files may declare a top-level namespace: key. Entities in that file are scoped to the named namespace. Files without a namespace: key contribute to the root namespace. Multiple files sharing the same namespace have their entity lists merged at load time.

# schemas/tissue.yaml
namespace: tissue
entities:
  Sample:
    fields:
      donor_id: {type: string, references: {entity_type: Donor}}   # root Donor
      parent_id: {type: string, references: {entity_type: tissue.Sample}}  # self-ref

Cross-namespace references use FQNs in references.entity_type. Namespace dependencies are inferred from these references — no explicit depends_on: declaration is required. See §3.12 for the full schema loading and validation model.

Backwards compatibility

Schemas with no namespace: key are unaffected. All existing unqualified entity type strings continue to resolve to the root namespace. No data migration is required — existing rows in storage use unqualified names, which remain valid. client.put("Sample", {...}) continues to work for root-namespace Sample entities regardless of whether any namespaced tissue.Sample or omics.Sample entities exist in the same deployment.


3.12 Schema Namespaces

Note: This section documents the multi-file namespace system introduced for multi-team deployments. Single-file deployments with no namespace: key require no changes and are unaffected by this section.

Namespace declaration

Each schema file may declare an optional namespace: key at the top level. This key scopes all entity types in that file to a named namespace. Files without the key contribute entities to the root namespace.

Multiple files may share the same namespace: value — their entity lists are merged at load time. If two files in the same namespace declare an entity with the same name, the schema loader raises SchemaValidationError identifying the conflicting entity and both files.

# schemas/tissue.yaml
namespace: tissue
entities:
  Sample: { ... }
  Block: { ... }

# schemas/tissue_extended.yaml
namespace: tissue          # same namespace — merges with tissue.yaml above
entities:
  Slide: { ... }           # OK — distinct name

NamespaceRegistry

The SchemaLoader builds a NamespaceRegistry as it discovers schema files. The registry maps (namespace, entity_name) → EntityConfig. The full registry is populated from all files before any cross-namespace reference validation begins — this ensures forward references work regardless of file discovery order.

NamespaceRegistry provides: - FQN lookup: registry.get("tissue", "Sample")EntityConfig - Root lookup: registry.get(None, "Donor") or registry.get("root", "Donor") → same result - Validation: raises SchemaValidationError for unknown FQNs or circular dependencies

FQN resolution rules

Input string Resolution
"Donor" Root namespace entity Donor
"root.Donor" Equivalent to "Donor" — normalized at registry ingestion
"tissue.Sample" Entity Sample in namespace tissue
"ghost.Entity" SchemaValidationError — namespace ghost not registered

Cross-namespace reference validation

After the registry is fully populated, the schema loader validates all references.entity_type values in all fields across all namespaces. For each reference:

  1. Parse the FQN into (namespace, entity_name)
  2. Look up in the registry
  3. If not found, raise SchemaValidationError identifying the unresolved FQN and the file

Circular dependency detection

The schema loader derives a namespace dependency graph from cross-namespace references. If namespace A references an entity in namespace B, then A depends on B. The loader performs a topological sort over this graph. If a cycle is detected (e.g., A → B → A), the loader raises SchemaValidationError identifying the cycle path.

Dependencies are inferred — no depends_on: key is required or supported.

Error messages

Namespace validation errors identify the problematic reference or cycle path and the file where it originates:

SchemaValidationError: Unresolved FQN reference 'ghost.Entity' in field 'sample.ghost_id'
  declared in: schemas/tissue.yaml

SchemaValidationError: Circular namespace dependency detected: tissue → omics → tissue
  first reference: schemas/omics.yaml field 'datafile.sample_id' (references tissue.Sample)

Backwards compatibility

Existing single-file schema.yaml deployments require no changes. The namespace: key is optional. HippoClient method signatures are unchanged — callers pass FQNs as strings to the same entity_type parameter they already use. No data migration is required for any existing deployment.


3.13 Multi-tenancy

Single namespace in v0.1. Multi-tenancy is explicitly out of scope for v0.1.