Provenance & Audit

6. Provenance & Audit¶

Document status: Draft v0.1 Depends on: sec3_data_model.md Feeds into: sec4_api_layer.md, sec5_ingestion.md

6.1 Design Philosophy¶

Provenance is a first-class feature of Hippo, not an afterthought. Every mutation to the system — entity writes, availability changes, relationship operations, schema migrations, reference data installs — produces a structured, immutable provenance event. The provenance log is:

The authoritative source of temporal metadata — created_at, updated_at, and schema_version are derived from it at read time; they are not stored on entity tables
The audit trail — every change carries an actor, timestamp, and reason
The basis for history queries — callers can retrieve the full change history of any entity
Permanent — provenance records are never deleted or modified; there is no purge mechanism in v0.1

6.2 Provenance Event Model¶

Every provenance event shares a common structure:

Field	Type	Description
`id`	UUID	Unique event identifier
`event_type`	string	One of the event types in §6.3
`entity_id`	UUID	The entity this event pertains to
`entity_type`	string	The entity type name
`actor`	string	Identity of the caller who triggered the change
`timestamp`	datetime (UTC)	When the event was recorded
`schema_version`	string	The schema config version at the time of the event
`context`	JSON	Structured context from the caller (see §6.5)
`payload`	JSON	Event-type-specific data (see §6.3)

Immutability: Provenance records are written once and never modified. The storage adapter must enforce this at the database level (e.g., no UPDATE or DELETE on the provenance table).

Actor: A free-form string identity supplied by the caller. The SDK never validates or interprets the actor value — it is passed through and stored verbatim. In v0.1, auth is out of scope; the REST transport sets actor = "anonymous" by default. In future auth-enabled deployments, the transport layer will resolve the authenticated user and pass their identity as actor.

6.3 Event Types¶

EntityCreated¶

Fired when a new entity is written for the first time.

{
  "event_type": "EntityCreated",
  "payload": {
    "initial_state": { ...entity fields... }
  }
}

EntityUpdated¶

Fired when an existing entity's fields are changed.

{
  "event_type": "EntityUpdated",
  "payload": {
    "previous_state": { ...fields before change... },
    "new_state":      { ...fields after change... },
    "changed_fields": ["tissue_type", "collection_date"]
  }
}

Only changed fields are listed in changed_fields. previous_state and new_state are full entity snapshots (excluding system fields). Callers can reconstruct the state of an entity at any point in time by replaying events in order.

AvailabilityChanged¶

Fired when an entity's is_available flag changes.

{
  "event_type": "AvailabilityChanged",
  "payload": {
    "previous": true,
    "current": false,
    "reason": "Sample quality insufficient for sequencing"
  }
}

The reason field is the primary mechanism for recording why an entity became unavailable. It is required when current = false and optional when current = true (re-activation).

EntitySuperseded¶

Fired on the old entity when it is superseded via client.supersede_entity(). The operation_type value in the provenance record is "EntitySuperseded".

A companion EntityUpdated event is fired on the replacement entity in the same transaction, making the audit trail bidirectional: both the old and new entities carry a provenance record documenting the supersession event. A superseded_by relationship edge is also created from the old entity to the replacement entity in the same transaction.

All five writes (availability change, superseded_by column update, EntitySuperseded event, relationship edge, EntityUpdated event on replacement) are atomic — they either all succeed or all roll back.

{
  "event_type": "EntitySuperseded",
  "payload": {
    "superseded_by_id": "uuid-of-new-entity",
    "reason": "Corrected tissue region annotation"
  }
}

The companion event on the replacement entity:

{
  "event_type": "EntityUpdated",
  "payload": {
    "note": "Now the active replacement for superseded entity <old-entity-id>",
    "supersedes": "<old-entity-id>"
  }
}

RelationshipCreated¶

Fired when a relationship edge is created between two entities.

{
  "event_type": "RelationshipCreated",
  "payload": {
    "relationship": "donated",
    "from_id": "subject-uuid",
    "from_type": "Subject",
    "to_id": "sample-uuid",
    "to_type": "Sample",
    "properties": {}
  }
}

RelationshipRemoved¶

Fired when a relationship edge is soft-deleted (status → removed).

{
  "event_type": "RelationshipRemoved",
  "payload": {
    "relationship_id": "edge-uuid",
    "relationship": "donated",
    "reason": "Incorrectly linked"
  }
}

ExternalIdAdded¶

Fired when an external ID is registered for an entity.

{
  "event_type": "ExternalIdAdded",
  "payload": {
    "system": "starlims",
    "external_id": "SL-12345"
  }
}

ExternalIdSuperseded¶

Fired when an existing external ID mapping is corrected (old mapping invalidated, new one added).

{
  "event_type": "ExternalIdSuperseded",
  "payload": {
    "old_external_id_record_id": "uuid",
    "new_external_id_record_id": "uuid",
    "system": "starlims",
    "old_value": "SL-12345",
    "new_value": "SL-12346",
    "reason": "Transcription error in source LIMS"
  }
}

MigrationApplied¶

Fired once per hippo migrate run. Recorded at the instance level (not entity-specific). entity_id and entity_type are null for this event type.

{
  "event_type": "MigrationApplied",
  "payload": {
    "from_version": "1.0",
    "to_version": "1.1",
    "changes_applied": ["Added field Sample.passage", "New entity type CellLine"]
  }
}

ReferenceDataInstalled¶

Fired when a reference loader completes installation or update. entity_id and entity_type are null for this event type.

{
  "event_type": "ReferenceDataInstalled",
  "payload": {
    "loader": "hippo-reference-fma",
    "version": "3.3",
    "entities_created": 15234,
    "entities_updated": 0
  }
}

6.4 Computed Temporal Fields¶

The system fields created_at, updated_at, and schema_version are derived from the provenance log at read time. The derivation is:

Field	Derivation
`created_at`	Timestamp of the `EntityCreated` event for the entity
`updated_at`	Timestamp of the most recent provenance event for the entity (any type)
`schema_version`	`schema_version` from the most recent provenance event for the entity

Performance: Deriving these fields for individual entity reads (single MAX(timestamp) query against the provenance table) is acceptable for low-volume use. For high-volume batch reads (e.g. client.query() returning hundreds of entities), the adapter should use a provenance summary view or denormalized cache. See §6.6 for the recommended SQLite implementation.

6.5 Provenance Context¶

The context field carries structured caller-supplied metadata that enriches the audit trail without requiring new event types. It is particularly important for Cappella, which needs to associate write operations with specific workflow runs and sync jobs.

Context schema (all fields optional):

{
  "workflow_run_id": "uuid-of-cappella-workflow-run",
  "workflow_name": "qc_pipeline",
  "sync_run_id": "uuid-of-cappella-sync-run",
  "adapter": "starlims",
  "trigger": "nightly_redcap_sync",
  "notes": "free-form annotation string"
}

Callers supply context via provenance_context parameter on write operations:

client.put(
    "Sample", sample_data,
    actor="cappella",
    provenance_context={
        "sync_run_id": "abc-123",
        "adapter": "starlims",
        "trigger": "starlims_new_specimen"
    }
)

The context is stored verbatim in the context JSON column of the provenance record. Hippo does not validate or interpret context keys — it is the caller's responsibility to use consistent key names. The keys above are the recommended conventions for Cappella integrations.

Opinionated decision: Context is unstructured JSON rather than a typed schema to avoid tight coupling between Hippo and the systems that write to it. Structured context keys are documented as conventions, not enforced constraints.

6.6 Relational Storage for Provenance¶

Provenance records are stored in a single provenance_events table:

CREATE TABLE provenance_events (
    id              TEXT PRIMARY KEY,
    event_type      TEXT NOT NULL,
    entity_id       TEXT,                -- null for instance-level events
    entity_type     TEXT,                -- null for instance-level events
    actor           TEXT NOT NULL,
    timestamp       TEXT NOT NULL,       -- ISO 8601 UTC
    schema_version  TEXT NOT NULL,
    context         TEXT,                -- JSON string, nullable
    payload         TEXT NOT NULL        -- JSON string
);

-- Core lookup index: all events for a given entity, chronological
CREATE INDEX idx_provenance_entity
ON provenance_events (entity_id, entity_type, timestamp);

-- Actor + time range queries (audit use case)
CREATE INDEX idx_provenance_actor_time
ON provenance_events (actor, timestamp);

-- Event type filter (e.g. "show all MigrationApplied events")
CREATE INDEX idx_provenance_event_type
ON provenance_events (event_type, timestamp);

No UPDATE or DELETE permitted on provenance_events. The SQLite adapter enforces this via triggers:

-- Block primary key updates
CREATE TRIGGER IF NOT EXISTS prevent_provenance_pk_update
BEFORE UPDATE OF entity_id ON provenance
BEGIN
    SELECT RAISE(ABORT, 'Cannot update primary key of provenance record');
END;

-- Block timestamp field updates
CREATE TRIGGER IF NOT EXISTS prevent_provenance_timestamp_update
BEFORE UPDATE OF timestamp ON provenance
BEGIN
    SELECT RAISE(ABORT, 'Cannot update timestamp of provenance record');
END;

-- Block user_context (metadata) field updates
CREATE TRIGGER IF NOT EXISTS prevent_provenance_metadata_update
BEFORE UPDATE OF user_context ON provenance
BEGIN
    SELECT RAISE(ABORT, 'Cannot update user_context field of provenance record');
END;

-- Block payload (content) field updates
CREATE TRIGGER IF NOT EXISTS prevent_provenance_content_update
BEFORE UPDATE OF payload ON provenance
BEGIN
    SELECT RAISE(ABORT, 'Cannot update payload field of provenance record');
END;

-- Block DELETE operations
CREATE TRIGGER IF NOT EXISTS prevent_provenance_delete
BEFORE DELETE ON provenance
BEGIN
    SELECT RAISE(ABORT, 'Cannot delete provenance record');
END;

The triggers use CREATE TRIGGER IF NOT EXISTS for idempotent initialization. Triggers fire at statement level (BEFORE), providing database-level enforcement complementary to any application-level checks.

Provenance summary view (REQUIRED — not optional):

The entity_provenance_summary view is required for correct operation of client.query() with provenance-derived created_at and updated_at fields. hippo migrate creates this view before any entity table migrations so it is always available.

Expected columns and derivation logic:

Column	Derivation
`entity_id`	The entity UUID
`entity_type`	The entity type name
`created_at`	`MIN(timestamp)` — timestamp of first provenance event (any `operation_type`)
`updated_at`	`MAX(timestamp)` for non-`SOFT_DELETE` events — timestamp of most recent write
`schema_version`	`NULL` in v0.1 (not stored on provenance records)

CREATE VIEW IF NOT EXISTS entity_provenance_summary AS
SELECT
    entity_id,
    entity_type,
    MIN(timestamp) AS created_at,
    MAX(CASE WHEN operation_type != 'SOFT_DELETE' THEN timestamp ELSE NULL END) AS updated_at,
    NULL AS schema_version
FROM provenance
WHERE entity_id IS NOT NULL
GROUP BY entity_id, entity_type;

Note: the actual provenance table is named provenance (not provenance_events) in the v0.1 SQLite implementation. The SDK's query() implementation JOINs against this view to resolve created_at and updated_at in a single query rather than N+1 provenance lookups. The get() implementation uses a direct provenance subquery for individual entity reads.

6.7 History API¶

The SDK exposes entity history as a first-class operation:

# Full provenance history for an entity
events = client.history("Sample", "abc-123")
# Returns list[ProvenanceRecord] in chronological order

# Filtered history
events = client.history("Sample", "abc-123",
    event_types=["EntityUpdated", "AvailabilityChanged"])

# State reconstruction: what did this entity look like at a point in time?
state = client.state_at("Sample", "abc-123", timestamp="2024-06-01T00:00:00Z")

client.state_at() reconstructs entity state by replaying provenance events up to the given timestamp. This is a read-only operation and does not require any additional storage.

The REST API exposes history at GET /entities/{entity_type}/{entity_id}/history.

6.8 Retention Policy¶

v0.1 position: no retention policy. All provenance records are retained indefinitely. There is no archive, purge, or truncation mechanism.

Rationale: For the expected v0.1 workload (small-to-medium research deployments), provenance storage is not a significant concern. The provenance log grows at one row per write operation; typical research deployments will accumulate millions, not billions, of rows.

Future: A configurable retention policy (e.g. compress or archive events older than N years while retaining the most recent snapshot per entity) is a reasonable future addition. This is flagged as an open question.

Open question: Should MigrationApplied and ReferenceDataInstalled events be stored separately from entity-level events, given that they are instance-level rather than entity-level events? The current design stores them in the same table with entity_id = null. An alternative is a separate system_events table. Deferred to a future revision.