Hippo Data Model¶

This document describes Hippo's data model for users of the SDK. For the full engineering specification, see Data Model design spec.

Core Concepts¶

Entities¶

An entity is a typed data object in Hippo. Every entity has:

A unique internal identifier (UUID)
An entity type (e.g., Sample, Subject, Datafile)
User-defined fields as declared in the schema
System fields managed automatically by Hippo

Entity types (called classes in LinkML) and their attributes are defined in the schema configuration (LinkML YAML or JSON), not hardcoded in the SDK.

System Fields¶

Every entity carries the following read-only system fields:

Field	Type	Description
`id`	UUID	Unique identifier, generated on creation
`is_available`	bool	Controls visibility in default queries
`superseded_by`	UUID (nullable)	ID of the replacement entity if this entity has been superseded; `None` otherwise
`created_at`	datetime	Timestamp of entity creation (UTC), derived from the provenance log
`updated_at`	datetime	Timestamp of most recent change (UTC), derived from the provenance log
`schema_version`	string	Schema config version at most recent change, derived from the provenance log

superseded_by is set atomically by client.supersede_entity() alongside the availability change. client.get() returns it on all entities; it is None when not superseded. The authoritative record of the supersession is the EntitySuperseded provenance event — the column is a fast-read cache.

The created_at, updated_at, and schema_version fields are derived at read time from the provenance log. The current implementation caches created_at and updated_at directly on the entity record for performance, but the provenance log is authoritative — client.get() reads from provenance when available and falls back to the cached values.

Availability Semantics¶

Hippo uses soft deletes — there are no hard deletes. Every entity carries an is_available boolean field:

is_available = true: Entity appears in default query results
is_available = false: Entity is hidden from default queries but retained in storage

When an entity becomes unavailable, the reason is recorded in the provenance event (e.g., archived, deleted, superseded), not on the entity itself.

Default Query Behavior¶

All query operations return only available entities by default:

# Returns only available samples
result = client.query("Sample", filters=[...])

To include unavailable entities, use the include_archived parameter where supported:

# Include archived entities
entity = client.get_by_external_id("EXT-123", include_archived=True)

Entity Namespaces (FQNs)¶

Entity type strings in Hippo are optionally namespace-qualified. Namespaces allow multiple subsystems to define their own Sample or Subject types without collision.

Namespace Syntax¶

Root namespace (no prefix): "Sample", "Donor" — entity types declared without a namespace key
Named namespace: "tissue.Sample", "omics.Datafile" — the prefix before the dot is the namespace name
Explicit root prefix: "root.Donor" is equivalent to "Donor" — normalized at schema load time

Using FQNs in SDK Calls¶

FQNs are valid wherever an entity_type string is accepted:

# Root namespace (no prefix needed)
client.put("Sample", data={...})
client.get("Donor", entity_id="abc-123")
client.query("Subject", filters=[...])

# Named namespace
client.put("tissue.Sample", data={...})
client.get("tissue.Sample", entity_id="abc-123")
client.query("tissue.Sample", filters=[...])

Declaring Namespaces in Schema Config¶

Set the default_prefix in a LinkML schema file to scope all its classes into a namespace:

# schemas/tissue.yaml
id: https://example.org/tissue
name: tissue
prefixes:
  linkml: https://w3id.org/linkml/
  tissue: https://example.org/tissue/
imports:
  - linkml:types
default_prefix: tissue

classes:
  Sample:
    attributes:
      donor:
        range: Donor                  # root-namespace Donor
      parent:
        range: tissue.Sample          # self-ref within namespace

Schemas without a default_prefix contribute to the root namespace. Multiple files may share the same namespace -- their class lists are merged at load time. Cross-namespace references use FQNs in the range value.

Existing schemas with no default_prefix are unaffected. All unqualified entity type strings continue to resolve to the root namespace; no data migration is required.

External IDs¶

External IDs connect Hippo entities to identifiers from upstream systems (LIMS, lab databases, etc.).

Registering External IDs¶

# Register an external ID for an entity
record = client.register_external_id(
    entity_id="abc-123",
    external_id="SAMPLE-001"
)

Lookup by External ID¶

# Find an entity by its external ID
entity = client.get_by_external_id("SAMPLE-001")

# Include archived entities in search
entity = client.get_by_external_id("SAMPLE-001", include_archived=True)

Listing External IDs¶

# List all external IDs for an entity
external_ids = client.list_external_ids(entity_id="abc-123")

# Include superseded (replaced) IDs
external_ids = client.list_external_ids(entity_id="abc-123", include_superseded=True)

External ID Immutability¶

External IDs are immutable once written. To "correct" an external ID, supersede the old one with a new one:

# Replace an external ID with a corrected value
new_record = client.supersede(
    entity_id="abc-123",
    old_external_id="SAMPLE-001-INCORRECT",
    new_external_id="SAMPLE-001-CORRECTED"
)

This creates a new active external ID record and marks the old one as superseded. Both records are retained for audit purposes.

Relationships¶

Relationships are typed, directional edges between entities. They can be declared in the schema using class-level attributes with a class range:

classes:
  Sample:
    attributes:
      derived_from:
        range: Sample
        multivalued: true
        description: "Samples this was derived from"

Supported cardinalities: - one-to-many: One entity relates to many (e.g., Subject → Samples) - many-to-one: Many entities relate to one (e.g., Sample → Subject) - many-to-many: Bidirectional many-to-many (e.g., Sample ↔ Sample via derived_from)

Graph Traversal¶

Use the expand parameter to fetch related entities in a single query:

# Fetch a sample with its subject
sample = client.get(
    entity_type="Sample",
    entity_id="abc-123",
    expand="subject"
)

The expand parameter supports nested paths:

# Fetch sample → subject → diagnosis
sample = client.get(
    entity_type="Sample",
    entity_id="abc-123",
    expand="subject.diagnosis"
)

Entity Supersession¶

Entity supersession replaces one entity with another. This is used when an entity needs to be corrected or updated in a way that preserves the full audit trail of the old record.

supersede_entity()¶

client.supersede_entity() is an atomic operation that:

Marks the old entity as unavailable (is_available = false)
Sets superseded_by on the old entity to the new entity's UUID
Writes an EntitySuperseded provenance event on the old entity
Creates a superseded_by relationship edge from old to new
Writes an EntityUpdated provenance event on the new entity

All five writes succeed together or roll back entirely on failure.

client.supersede_entity(
    entity_id="abc-123",
    replacement_id="def-456",
    actor="pipeline-run-789",
    reason="Corrected tissue region annotation"
)

Both entities are retained — there are no hard deletes. The old entity remains queryable via client.get() (which returns superseded entities) and client.history().

Raises¶

EntityNotFoundError — if either entity_id or replacement_id does not exist
EntityAlreadySupersededError — if entity_id is already superseded

Provenance and History¶

Every change to an entity is recorded in the provenance log.

Viewing History¶

# Get full change history for an entity
history = client.history(entity_id="abc-123")

# Returns list of records in chronological order (oldest first)
# Each record contains:
# - operation_id: Unique identifier
# - entity_id: The entity ID
# - entity_type: The entity type
# - operation_type: CREATE, UPDATE, SOFT_DELETE, EntitySuperseded, etc.
# - timestamp: When the operation occurred
# - user_id: Who performed the operation
# - previous_state_hash: Hash of previous state
# - state_snapshot: Entity state at that point

client.history() accepts superseded (unavailable) entity IDs.

Querying Historical State¶

# Get entity state at a specific point in time
state = client.state_at(
    entity_id="abc-123",
    timestamp="2024-01-15T10:30:00+00:00"
)

This returns the entity's data as it existed at the specified timestamp.

Query API¶

Basic Queries¶

# Query entities with filters
result = client.query(
    entity_type="Sample",
    filters=[
        {"field": "tissue_type", "operator": "eq", "value": "brain"},
        {"field": "passage", "operator": "gte", "value": 5}
    ]
)

# result is a PaginatedResult
for item in result.items:
    print(item["id"], item["data"])

print(f"Showing {len(result.items)} of {result.total} total")

Filter Operators¶

Operator	Description
`eq`	Equal to
`ne`	Not equal to
`gt`	Greater than
`gte`	Greater than or equal
`lt`	Less than
`lte`	Less than or equal
`in`	In list
`not_in`	Not in list
`contains`	String contains
`starts_with`	String starts with
`ends_with`	String ends with
`is_null`	Field is null
`is_not_null`	Field is not null

Pagination¶

# Query with pagination
result = client.query(
    entity_type="Sample",
    limit=50,
    offset=100  # Skip first 100 results
)

# result.total is the count before limit/offset
print(f"Page: {len(result.items)} items, {result.total} total")

Full-Text Search¶

# Search using FTS5
results = client.search(
    entity_type="Sample",
    query="brain AND cortex",
    limit=20
)

Write Operations¶

Create¶

# Create a new entity
sample = client.create(
    entity_type="Sample",
    data={
        "external_id": "SAMPLE-001",
        "tissue_type": "brain",
        "tissue_region": "frontal cortex"
    }
)

Update¶

# Update an existing entity
sample = client.update(
    entity_type="Sample",
    entity_id="abc-123",
    data={
        "external_id": "SAMPLE-001",
        "tissue_type": "brain",
        "tissue_region": "temporal cortex"  # Corrected value
    }
)

Upsert¶

# Create or update by ID
sample = client.put(
    entity_type="Sample",
    entity_id="abc-123",  # If provided and exists → update; if not provided → create
    data={...}
)

Delete¶

# Delete an entity (soft delete - sets is_available=false)
result = client.delete(
    entity_type="Sample",
    entity_id="abc-123"
)

Gap: The design spec describes soft delete via availability transitions, but the current implementation calls storage.delete() directly. The SQLite adapter may implement soft delete internally.

SDK Types Reference¶

This section documents the user-facing types exported from hippo.core.types.

FilterCondition¶

A single filter condition for queries.

from hippo.core.types import FilterCondition, FilterOperator

condition = FilterCondition(
    field="tissue_type",
    operator=FilterOperator.EQ,
    value="brain"
)

A group of conditions combined with a logical operator. Supports nested groups via the groups field.

from hippo.core.types import FilterGroup, FilterCondition, FilterOperator, LogicalOperator

group = FilterGroup(
    conditions=[
        FilterCondition(field="tissue_type", operator=FilterOperator.EQ, value="brain"),
        FilterCondition(field="passage", operator=FilterOperator.GTE, value=5)
    ],
    logical_operator=LogicalOperator.AND
)

Filter¶

Top-level filter container supporting nested groups.

from hippo.core.types import Filter, FilterGroup

filter_obj = Filter(root=FilterGroup(conditions=[...]))

FilterOperator¶

Enum of supported comparison operators.

from hippo.core.types import FilterOperator

# EQ, NE, GT, GTE, LT, LTE, IN, NOT_IN, CONTAINS, STARTS_WITH, ENDS_WITH, IS_NULL, IS_NOT_NULL

LogicalOperator¶

Enum for combining filter conditions.

from hippo.core.types import LogicalOperator

# AND, OR

PaginatedResult¶

Paginated query result returned by client.query().

Field	Type	Description
`items`	`list[Any]`	The entities on this page
`total`	`int`	Total matching entities across all pages (ignoring `limit`/`offset`)
`limit`	`int`	Maximum items per page; `0` means no limit
`offset`	`int`	Number of items skipped

from hippo.core.types import PaginatedResult

result = client.query("Sample", limit=50, offset=0)

# result.items — list of entity dicts on this page
# result.total — count before limit/offset was applied
# result.limit — the limit that was passed (50)
# result.offset — the offset that was passed (0)

ScoredMatch¶

Search result with relevance scoring. Returned by search operations.

Field	Type	Description
`score`	`float`	Relevance score (higher is more relevant)
`match_data`	`dict[str, Any]`	The matched entity data
`matched_fields`	`list[str]`	Fields that matched the query

from hippo.core.types import ScoredMatch

match = ScoredMatch(
    score=0.95,
    match_data={"id": "abc-123", "data": {...}},
    matched_fields=["preferred_label", "description"]
)

WriteOperation¶

Represents a write operation result.

Field	Type	Description
`success`	`bool`	Whether the operation succeeded
`operation`	`str`	Type of operation: `"insert"`, `"update"`, or `"delete"`
`entity_type`	`str`	The entity type affected
`entity_id`	`str \\| None`	ID of the affected entity
`metadata`	`dict[str, Any]`	Additional operation metadata

from hippo.core.types import WriteOperation

operation = WriteOperation(
    success=True,
    operation="insert",
    entity_type="Sample",
    entity_id="abc-123",
    metadata={}
)

ProvenanceRecord¶

A single record in the provenance log.

Field	Type	Description
`source`	`str`	Origin system or entity
`timestamp`	`datetime`	When the operation occurred
`operation`	`str`	Type of operation: `"create"`, `"update"`, `"read"`, `"delete"`
`entity_type`	`str \\| None`	Type of entity affected
`entity_id`	`str \\| None`	ID of the entity
`user_context`	`str \\| None`	User or system context that initiated the operation
`payload`	`dict[str, Any]`	Complete entity state as JSON

from hippo.core.types import ProvenanceRecord
from datetime import datetime

record = ProvenanceRecord(
    source="hippo-sdk",
    timestamp=datetime.now(),
    operation="create",
    entity_type="Sample",
    entity_id="abc-123",
    user_context="pipeline-run-456",
    payload={"external_id": "SAMPLE-001", ...}
)

IngestStatus¶

Status enum for bulk ingestion operations.

from hippo.core.types import IngestStatus

# IngestStatus.SUCCESS  — all items processed successfully
# IngestStatus.PARTIAL  — some items failed
# IngestStatus.FAILED   — all items failed

IngestResult¶

Result of a bulk ingestion operation.

Field	Type	Description
`status`	`IngestStatus`	Overall ingestion status
`total_processed`	`int`	Total items processed
`successful`	`int`	Number of successfully processed items
`failed`	`int`	Number of failed items
`errors`	`list[dict[str, Any]]`	Error details for failed items
`metadata`	`dict[str, Any]`	Additional result metadata

from hippo.core.types import IngestResult, IngestStatus

result = IngestResult(
    status=IngestStatus.PARTIAL,
    total_processed=100,
    successful=95,
    failed=5,
    errors=[
        {"index": 5, "message": "Validation failed for field X"}
    ],
    metadata={}
)

Gaps Between Design and Implementation¶

This section documents known gaps between the design specification and current implementation.

1. Provenance-Computed Temporal Fields¶

Design: created_at, updated_at, schema_version are computed exclusively from the provenance log at read time; never stored on the entity record
Implementation: created_at and updated_at are cached directly on the entity row in storage. client.get() reads provenance timestamps when available and falls back to the cached values. schema_version is not yet derived from provenance.

2. Soft Delete Implementation¶

Design: Delete operations set is_available = false via availability transitions
Implementation: The delete() method calls storage.delete() directly; soft delete behavior depends on the storage adapter implementation

3. Relationship Properties¶

Design: Relationships can carry typed properties declared in schema
Implementation: The RelationshipManager exists but relationship properties are not fully implemented

4. Schema-Declared Search Modes¶

Design: Schema declares search: fts, search: embedding, or search: synonym
Implementation: FTS is implemented; embedding and synonym search are adapter-dependent