Skip to content

Section 2: Architecture

Status: Draft v0.1
Last updated: 2026-03-25


2.1 Architectural Overview

Cappella is structured as five cooperating layers, all stateless, all writing exclusively to Hippo:

┌─────────────────────────────────────────────────────────────┐
│  REST API / CLI                                             │
│  /resolve  /ingest  /triggers/{name}/run  /status          │
└──────────────────────┬──────────────────────────────────────┘
          ┌────────────┼────────────┐
          │            │            │
          ▼            ▼            ▼
   ┌─────────────┐ ┌──────────┐ ┌──────────────────┐
   │  Collection │ │  Ingest  │ │  Trigger Engine  │
   │  Resolver   │ │ Pipeline │ │ (schedule/manual/│
   │             │ │          │ │  webhook/event)  │
   └──────┬──────┘ └────┬─────┘ └────────┬─────────┘
          │             │                │
          ▼             ▼                ▼
   ┌─────────────────────────────────────────────────┐
   │  Adapter Registry                               │
   │  ExternalSourceAdapter plugins (STARLIMS, HALO) │
   │  Canon client (artifact resolution)             │
   └──────────────────────┬──────────────────────────┘
   ┌─────────────────────────────────────────────────┐
   │  HippoClient                                    │
   │  All reads and writes go through here           │
   └─────────────────────────────────────────────────┘

2.2 Adapter Registry

ExternalSourceAdapter ABC

The adapter base class extends Hippo's EntityLoader (hippo.core.loaders), versioning the contract with Hippo. Cappella's ExternalSourceAdapter lives in cappella.adapters.base and adds adapter-specific fields. Built-in generic adapters (CSV, JSON, XML, SQL) extend the corresponding Hippo loaders (CSVLoader, JSONLoader, SQLLoader) directly, inheriting ConfigurableLoader's field mapping and vocabulary normalization.

from hippo.core.loaders import EntityLoader

class ExternalSourceAdapter(EntityLoader):
    name: str                          # "starlims", "halo", "redcap"
    entity_types: list[str]            # ["Donor", "Sample"] — what this adapter produces
    trust_level: int = 50              # Trust level for conflict resolution
    supports_incremental: bool = False # True if adapter can pull only changed records

    def validate(self, record: TransformedRecord, hippo_client: Any = None) -> list[str]:
        """Optional cross-record validation. Returns list of error messages."""
        return []

    def health_check(self) -> dict[str, Any]:
        """Return adapter health status."""
        return {"status": "unknown", "detail": "health_check not implemented"}

The fetch() and transform() methods are inherited from EntityLoader. Each built-in adapter overrides them with Cappella's typed RawRecord/TransformedRecord. The transform() step is where field mapping, vocabulary normalization, and schema conformance happen — for generic adapters this is config-driven via field_map and vocabulary_map; for custom adapters it's implemented in code.

Adapter Discovery

Adapters are discovered via the cappella.adapters entry point group:

# pyproject.toml of cappella-adapter-starlims:
[project.entry-points."cappella.adapters"]
starlims = "cappella_starlims:STARLIMSAdapter"

Cappella's AdapterRegistry discovers and instantiates adapters at startup.

Canon Client

The Canon client is registered in the adapter registry as a special non-ingestion adapter — it resolves artifacts rather than ingesting structured records. The CollectionResolver accesses it directly (see §2.4).


2.3 Ingest Pipeline

Each ingest run follows a fixed pipeline:

fetch(since) → [RawRecord, ...]
transform(record) → TransformedRecord{entity_type, data, external_id, source}
validate(record, hippo) → [] or [errors...]
    │ (stop if errors)
upsert(entity_type, data, external_id, source)
    → HippoClient.create() if new
    → HippoClient.update() if changed
    → skip if identical
record provenance event on entity
    {source: "starlims", sync_run_id: "...", fetched_at: "..."}

Upsert Identity Resolution

Priority: 1. Explicit UUID — if the external record carries a Hippo entity UUID 2. ExternalID lookuphippo.get_entity_by_external_id(system, external_id) 3. Create new — if no match found

This is the same identity resolution Hippo's ingestion pipeline uses. Cappella delegates to HippoClient for all writes.

Conflict Detection

When transform() produces data that differs from the existing Hippo entity for a matched ExternalID, Cappella applies conflict resolution:

  • Trusted source wins — each adapter declares trust_level: int; higher trust overwrites lower
  • Last-write wins — if trust levels are equal, most recent sync wins
  • Manual review flag — if declared fields conflict across same-trust sources, flag the entity for reconciliation review

Conflicts are recorded as structured HarmonizationConflict provenance events on the entity, not silently overwritten.


2.4 Collection Resolver

The collection resolver is Cappella's highest-value capability — translating a high-level user request into a fully-resolved set of entities.

Resolution Request

POST /resolve
{
  "entity_type": "GeneCounts",
  "criteria": {
    "donor.diagnosis": "CTE",
    "sample.tissue": "DLPFC",
    "dataset.assay": "RNASeq"
  },
  "parameters": {
    "genome": "GRCh38",
    "annotation": "ref:GeneAnnotation{source=ensembl, release=110}"
  },
  "selection": {
    "strategy": "most_recent",
    "filters": {"dataset.qc.min_reads": 1000000}
  }
}

Resolution Steps

Step 1: Entity traversal (Hippo queries)

Cappella walks the entity graph bottom-up using the criteria filters:

Donor{diagnosis=CTE}
  → Sample{tissue=DLPFC, donor_id ∈ matching_donors}
    → SequencingDataset{assay=RNASeq, sample_id ∈ matching_samples}

The traversal path is inferred from the Hippo schema's references: declarations. Cappella doesn't need hardcoded traversal logic — it reads the schema graph.

Step 2: Selection logic

When a sample has multiple candidate datasets (multiple sequencing runs, replicates), selection logic picks one per sample:

class SelectionStrategy(ABC):
    @abstractmethod
    def select(self, candidates: list[Entity], filters: dict) -> Entity | None:
        ...

Built-in strategies: - most_recent — highest created_at after applying filters - highest_quality — sort by declared quality field (configurable per entity type) - explicit — caller provides an explicit list of entity IDs to use

Selection strategies are pluggable via cappella.selection_strategies entry point. Labs can implement custom strategies (e.g., "prefer datasets from core facility X over resubmissions").

Step 3: Canon delegation

For each selected dataset, Cappella calls canon.resolve() for the requested entity_type:

for dataset in selected_datasets:
    try:
        uri = canon_client.resolve(
            entity_type=request.entity_type,
            params={**request.parameters, "dataset_id": dataset.id}
        )
        resolved.append(ResolvedItem(sample=dataset.sample_id, uri=uri, status="resolved"))
    except CanonNoRuleError:
        unresolved.append(UnresolvedItem(sample=dataset.sample_id, reason="no_rule"))
    except CanonResolveError as e:
        unresolved.append(UnresolvedItem(sample=dataset.sample_id, reason="canon_error", detail=str(e)))

Cappella never raises on partial failure — it collects all resolved and all unresolved items and returns both.

Step 4: HarmonizedCollection response

See §1.5 for the full format. The collection includes: - All resolved entities with URIs and Canon decision (REUSE/FETCH/BUILD) - All unresolved items with structured reasons - Full provenance (versions, genome build entity, selection criteria)


2.5 Trigger Engine

Triggers are the mechanism by which Cappella executes ingest and resolution operations automatically.

Trigger Types (v0.1)

Type Mechanism Use case
schedule Cron expression Nightly STARLIMS sync, weekly full reconciliation
manual API call User-initiated ingest or resolution
internal_event Named event emitted by another action Chain: sample_created → trigger alignment resolution

Trigger Configuration (cappella.yaml)

triggers:
  - name: nightly_starlims_sync
    type: schedule
    schedule: "0 2 * * *"    # 2 AM daily
    action:
      type: ingest
      adapter: starlims
      incremental: true
    on_success:
      emit: starlims_sync_complete

  - name: resolve_on_new_sample
    type: internal_event
    event: starlims_sync_complete
    action:
      type: resolve
      entity_type: AlignmentFile
      criteria:
        sample.tissue: DLPFC
      parameters:
        genome: GRCh38

Trigger Action Types

Action Description
ingest Run an adapter's fetch/transform/upsert pipeline
resolve Run collection resolution, optionally store result as ResolutionRun entity
reconcile Run inconsistency detection for specified entity types
notify Send a notification (Hippo event, webhook, email)

Action Chaining

Actions emit named internal events (emit:) that other triggers subscribe to. This is simple event-driven composition — not a DAG scheduler. Cycles are detected at config validation time and rejected with an error.


2.6 Reconciliation Engine

Reconciliation detects and surfaces inconsistencies across sources without automatically resolving them (resolution is a human decision for ambiguous conflicts).

Checks (v0.1)

Check Description
missing_entity Entity referenced in external system has no Hippo record
stale_entity Hippo entity not updated in external system within expected window
field_conflict Same entity field has different values in two trusted sources
broken_reference Entity has a references: field pointing to a nonexistent entity
missing_artifact Entity has no associated file artifact where one is expected

Each check produces a structured ReconciliationFinding — not an error, not an automatic fix. Findings are queryable from Hippo and surfaced in Aperture.

Reconciliation Run

POST /reconcile
{
  "entity_types": ["Donor", "Sample"],
  "adapters": ["starlims", "redcap"],
  "checks": ["field_conflict", "missing_entity"]
}

Returns a list of ReconciliationFinding objects. Each finding includes the entity ID, field, source A value, source B value, and a suggested resolution action (human review, trust source A, trust source B).


2.7 Provenance Model

Every Cappella write to Hippo carries a structured context on the provenance event:

{
  "cappella_version": "0.1.0",
  "source": "starlims",
  "sync_run_id": "uuid-run-123",
  "adapter_version": "1.2.0",
  "fetched_at": "2026-03-25T17:30:00Z",
  "trigger": "nightly_starlims_sync",
  "selection_strategy": "most_recent"
}

This mirrors the pattern established in Hippo's provenance event model and is consistent with the context Cappella receives from Canon for artifact entities.


2.8 API Surface (v0.1)

Endpoint Description
POST /resolve Submit a collection resolution request
GET /resolve/{run_id} Get status/result of a resolution run
POST /ingest Trigger an immediate ingest for a named adapter
GET /ingest/{run_id} Get status of an ingest run
POST /triggers/{name}/run Manually fire a named trigger
GET /triggers List all configured triggers and their last-run status
POST /reconcile Run reconciliation checks
GET /findings Query reconciliation findings
GET /status Cappella health, connected adapters, Hippo version

2.9 Deployment Model

Cappella is a stateless Python service. It connects to: - A running Hippo instance (via HippoClient) - A running Canon instance (via CanonClient HTTP or in-process) - External systems as configured in adapters

No Cappella-local database. All persistent state — sync history, reconciliation findings, resolution runs — is stored as Hippo entities.

Cappella can be run as: - A standalone CLI tool (cappella resolve, cappella ingest starlims) - A REST service (cappella serve) - Embedded as a Python library in Composer or other tools


2.10 Open Questions for v0.1

Question Priority Notes
Schema-driven traversal Resolved ✅ HippoClient.schema_references(entity_type) implemented in Hippo v0.4. Reads FieldDefinition.references from schema. REST: GET /schemas/{entity_type}/references. Cappella's EntityTraversal calls it at runtime. Schema YAML must declare references: {entity_type: <name>} on foreign-key fields.
Selection strategy config syntax High How are per-entity-type quality fields declared? In cappella.yaml or in the Hippo schema?
Canon client transport Resolved ✅ Both modes implemented. In-process mode imports canon.resolve() directly. HTTP mode calls POST {canon_url}/resolve with {"entity_type": ..., "params": ...}, returns {"decision": ..., "uri": ...}. Default for v0.1 is HTTP (cappella.yaml: canon.mode: http); in-process mode available via canon.mode: in_process. Canon API exposes /resolve alongside /api/v1/rules.
ResolutionRun entity storage Medium Should every POST /resolve create a ResolutionRun entity in Hippo? Useful for audit but adds write overhead. Deferred to v0.2.
Webhook triggers Medium Deferred to v0.2 — requires endpoint registration, signature verification, retry logic.
Hippo poll triggers Medium Deferred to v0.2 — requires efficient change detection (polling updated_at index).