Hippo — Metadata Tracking Service¶
Specification Index¶
Codename: Hippo
Component: Metadata Tracking Service (MTS)
Version: 0.1 — Implementation Ready
Status: Ready for implementation
Document Map¶
| File | Section | Status | Notes |
|---|---|---|---|
sec1_overview.md |
1. Overview & Scope | ✅ Draft v0.1 | Domain-neutral; no deployment-specific types |
sec2_architecture.md |
2. Architecture | ✅ Draft v0.1 | SDK-first layering, all plugin ABCs, validation infrastructure, search |
sec3_data_model.md |
3. Data Model | ✅ Draft v0.1 | Polymorphic inheritance, search field declarations, requires: block |
sec3b_relational_storage.md |
3b. Relational Storage Mapping | ✅ Draft v0.1 | SQLite/PostgreSQL reference impl; §3b.8 polymorphic storage |
sec4_api_layer.md |
4. API Layer | ✅ Draft v0.1 | HippoClient interface, REST endpoints, pagination, polling |
sec5_ingestion.md |
5. Ingestion & Integration | ✅ Draft v0.1 | Flat-file, reference data, upsert-by-ExternalID, error handling |
sec6_provenance.md |
6. Provenance & Audit | ✅ Draft v0.1 | Event model, structured context, storage, history API, retention |
sec7_nfr.md |
7. Non-Functional Requirements | ✅ Draft v0.1 | Performance targets, scalability tiers, reliability, schema sync roadmap |
sec8_auth_integration.md |
8. Authentication & Authorization Integration | ✅ Draft v0.1 | Bridge-aware AuthMiddleware impl, actor propagation, audit trail split |
appendix_a_example_schema_omics.md |
Appendix A. Example Schema (Omics) | ✅ Draft v0.1 | Complete LinkML schema with search:, polymorphic extension example |
appendix_b_implementation_guide.md |
Appendix B. Implementation Guide | ✅ Draft v0.1 | Build order, module map, error hierarchy, invariants, test strategy, OpenSpec mapping |
reference_hippo_yaml.md |
Reference: hippo.yaml Config Schema |
✅ Draft v0.1 | All valid keys, types, defaults, env var substitution, minimal configs |
reference_validators_yaml.md |
Reference: validators.yaml Format |
✅ Draft v0.1 | Complete field spec, expand syntax, all built-in presets, execution semantics |
reference_cel_context.md |
Reference: CEL Evaluation Context | ✅ Draft v0.1 | All context variables, CEL types, expanded field shapes, available functions, common patterns |
Key Decisions Log¶
| Decision | Choice | Section |
|---|---|---|
| Deployment model | SDK-first; REST and GraphQL are independent transport adapters | sec2 |
| Async strategy | Sync SDK for v0.1; revisit at PostgreSQL adapter | sec2 |
| REST deployment | Standalone (hippo serve) wrapping embedded app object |
sec2 |
| Plugin system | Entry points (hippo.storage_adapters, hippo.external_adapters, hippo.write_validators, hippo.reference_loaders) |
sec2 |
| Storage backend v0.1 | SQLite via stdlib sqlite3 |
sec2 |
| Data model approach | Config-driven relational + graph-shaped API; graph DB as future adapter | sec3 |
| Temporal metadata | Provenance log only — not stored on entities; computed at read time | sec3, sec6 |
| Entity lifecycle | is_available boolean; storage adapters optimize for this filter |
sec3, sec3b |
| Lifecycle semantics | Reason for unavailability stored in provenance events, not on entities | sec3, sec6 |
| Supersession | System-level superseded_by relationship; atomic SDK operation |
sec3 |
| Schema authoring | Direct LinkML (YAML/JSON) — no intermediate DSL or compilation step | sec3 |
| Graph DB | Future adapter option; not v0.1 scope | sec3 |
| Multi-tenancy | Out of scope for v0.1 | sec3 |
| Conceptual/storage separation | Sec3 defines conceptual data model only; relational storage mapping in sec3b | sec3, sec3b |
| Domain-neutral spec | System spec contains no domain-specific schema; omics types in Appendix A only | sec1, sec2, sec3 |
| Schema inheritance semantics | base: creates polymorphic (is-a) inheritance — subtypes are queryable as their parent type |
sec3, sec3b |
Validator entity_types matching |
Subtype-aware — declaring a parent type covers all subtypes; redundant entries emit startup warning | sec2 |
| Config-driven validation | validators.yaml with CEL conditions, expand path pre-fetching, when pre-conditions, requires shorthand, existing context |
sec2 |
| CEL expression language | Used for config-driven validator conditions; sandboxed, no I/O, cel-python binding |
sec2 |
| Expand path syntax | field, field.child, field[], field[].child; batch-fetched; cycle detection; max_expand_list_size cap |
sec2 |
| Lazy evaluation rejected | Explicit expand paths preferred — CEL short-circuit makes implicit I/O non-deterministic |
sec2 |
| Built-in validator presets | ref_check, count_constraint, immutable_field, field_required_if, no_self_ref as ergonomic shortcuts |
sec2 |
| External adapter boundary | STARLIMS, HALO, REDCap concrete implementations in Cappella; ExternalSourceAdapter ABC stays in Hippo |
sec2 |
| Reference loader plugin system | hippo.reference_loaders entry point; ReferenceLoader ABC ships schema_fragment(); install auto-migrates |
sec2, sec5 |
| Fuzzy search | EntityStore.search() ABC method; per-field search: mode; ScoredMatch core SDK type; adapter capability declaration at startup |
sec2, sec3, sec4 |
| Provenance event model | Typed events (EntityCreated/Updated/AvailabilityChanged/Superseded/Relationship/ExternalId/Migration/ReferenceData) | sec6 |
| Provenance context | Unstructured JSON context field on all events; Cappella conventions documented |
sec6 |
| Batch ingestion semantics | Per-record transactions by default; --fail-fast flag; --atomic deferred |
sec5 |
| Upsert identity resolution | Priority: explicit UUID → ExternalID lookup → new entity | sec5 |
| Pagination | Offset-based (limit/offset); cursor-based deferred | sec4 |
| Multi-instance REST | Fully stateless; all distributed concerns delegated to storage adapter | sec7 |
| Schema sync v0.1 | Restart-on-migrate; three-phase post-v0.1 roadmap documented | sec7 |
| Multi-instance adapter contract | Must implement atomic server-side upsert (not read-then-write) for entity and ExternalID creation | sec2, sec7 |
| Entity namespaces | Optional namespace: key in schema files scopes entities to named namespaces; FQN = namespace.EntityType; root namespace is the only implicit namespace |
sec3 |
| NamespaceRegistry | Built by SchemaLoader at load time; maps (namespace, entity_name) → EntityConfig; fully populated before cross-namespace reference validation |
sec3 |
| Namespace dependency inference | Dependencies inferred from references.entity_type FQNs — no explicit depends_on: required; topological sort detects cycles |
sec3 |
| Root namespace canonicalization | root.Donor normalized to "Donor" at registry ingestion; only unqualified form in SchemaConfig and storage |
sec3 |
| Namespace backwards compatibility | Schemas without namespace: key load unchanged; no data migration required for existing deployments |
sec3 |
Open Questions¶
Items requiring a future design decision before implementation. Deferred items are explicitly out of scope for v0.1 and documented here for tracking.
| Question | Section | Priority | Status |
|---|---|---|---|
| Where does the example omics schema ultimately live? | — | Low | Open — config repo, schemas/omics/, or community hippo-reference-omics package |
| Entity type remapping / namespace migration path (OQ1) | sec3 §3.12 | High | Open — no migration path exists for moving Sample (root) to tissue.Sample; must be resolved before any production namespace adoption of existing entities. Options: hippo migrate --remap, namespace aliasing, or out-of-band migration script |
| Canonical form for root-namespace entities in storage (OQ2) | sec3 §3.11 | Low | Decided — store as "Donor" (unqualified); registry normalizes root.* to unqualified at load time; documented as a firm invariant |
| Ingestion idempotency for live webhook integrations | sec5 | High | Deferred — ExternalID upsert is the stable foundation; webhook retry deduplication needs a dedicated design session when live integrations are scoped |
| Explicit update endpoint (PUT, 404-on-missing) | sec4 | Medium | Planned v0.5 (Phase 1) — PUT /entities/{type}/{id}; partial update semantics, 404 if entity absent; specced in sec4 §4.3 |
| Bulk availability change endpoint | sec4 | Medium | Planned v0.5 (Phase 1) — POST /entities/{type}/bulk-availability; specced in sec4 §4.3 |
| OR filter composition in query API | sec4 | Low | Planned v0.5 (Phase 1) — multi-value params (same-field OR) + ?filter= CEL (cross-field); specced in sec4 §4.3 |
| Per-validator timeout | sec2 | Low | Deferred — plugin validators should complete in <100ms; configurable timeout post-v0.1 |
hippo_poll efficiency at scale |
sec3b/sec6 | Medium | Deferred — provenance timestamp index handles current workload; denormalised updated_at column if needed at scale |
| Schema version check on writes (503 on mismatch) | sec7 | Medium | Planned for v0.2 — roadmap documented in sec7 §7.3 |
| Dynamic schema reload via polling | sec7 | Medium | Planned for v0.3 — roadmap documented in sec7 §7.3 |
Expand-contract convention enforcement in hippo migrate |
sec2/sec3 | Medium | Planned for v0.3+ — roadmap documented in sec7 §7.3 |
| Cursor-based pagination | sec4 | Low | Planned v0.5 (Phase 1) — cursor-based mode specced in sec4 §4.4; offset mode remains the default |
| GraphQL transport | sec2 | Low | Reserved in hippo/graphql/; deferred post-v0.1 |
| Provenance system vs. entity events table split | sec6 | Low | Open — MigrationApplied and ReferenceDataInstalled stored with entity_id = null; separate system_events table is an alternative |
| Auth / RBAC | sec8 | ✅ Resolved (Phase 3) | BridgeAuthMiddleware spec in sec8; Bridge owns auth, Hippo trusts injected headers; actor flows into provenance |
| GA4GH DRS server (read-only) | sec4 | High | v0.4.0 target — GET /ga4gh/drs/v1/objects/{entity_id} resolves entity UUID → access_methods (s3/https/file). Thin read-only router over existing entity storage. No new data model; entity URI is the download target. Checksums from provenance if available. Enables external tools (Terra, Galaxy, other Canon instances) to resolve drs:// URIs. Canon DRS client for consuming external DRS URIs deferred to Canon v0.3. |
HippoClient.schema_references(entity_type) |
— | ✅ Implemented (Hippo v0.4) | Reads FieldDefinition.references from already-loaded schema. Returns [{field, target_entity_type}] for each field with references: {entity_type: <name>} declared in schema YAML. REST endpoint: GET /schemas/{entity_type}/references. Works today — caller schemas must declare references: on foreign-key fields. |
How to Use This Spec¶
Each section document is self-contained and includes Depends on and Feeds into headers
to make inter-document dependencies explicit. When starting a new section, read the documents
it depends on first.
This spec is designed to feed into the openplan pipeline:
Each completed section maps to one or more epics in the openplan roadmap.