Section 2: Architecture¶
Status: Draft v0.1
Last updated: 2026-03-25
2.1 Architectural Overview¶
Cappella is structured as five cooperating layers, all stateless, all writing exclusively to Hippo:
┌─────────────────────────────────────────────────────────────┐
│ REST API / CLI │
│ /resolve /ingest /triggers/{name}/run /status │
└──────────────────────┬──────────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌──────────────────┐
│ Collection │ │ Ingest │ │ Trigger Engine │
│ Resolver │ │ Pipeline │ │ (schedule/manual/│
│ │ │ │ │ webhook/event) │
└──────┬──────┘ └────┬─────┘ └────────┬─────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ Adapter Registry │
│ ExternalSourceAdapter plugins (STARLIMS, HALO) │
│ Canon client (artifact resolution) │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ HippoClient │
│ All reads and writes go through here │
└─────────────────────────────────────────────────┘
2.2 Adapter Registry¶
ExternalSourceAdapter ABC¶
The adapter base class extends Hippo's EntityLoader (hippo.core.loaders), versioning the contract with Hippo. Cappella's ExternalSourceAdapter lives in cappella.adapters.base and adds adapter-specific fields. Built-in generic adapters (CSV, JSON, XML, SQL) extend the corresponding Hippo loaders (CSVLoader, JSONLoader, SQLLoader) directly, inheriting ConfigurableLoader's field mapping and vocabulary normalization.
from hippo.core.loaders import EntityLoader
class ExternalSourceAdapter(EntityLoader):
name: str # "starlims", "halo", "redcap"
entity_types: list[str] # ["Donor", "Sample"] — what this adapter produces
trust_level: int = 50 # Trust level for conflict resolution
supports_incremental: bool = False # True if adapter can pull only changed records
def validate(self, record: TransformedRecord, hippo_client: Any = None) -> list[str]:
"""Optional cross-record validation. Returns list of error messages."""
return []
def health_check(self) -> dict[str, Any]:
"""Return adapter health status."""
return {"status": "unknown", "detail": "health_check not implemented"}
The fetch() and transform() methods are inherited from EntityLoader. Each built-in adapter overrides them with Cappella's typed RawRecord/TransformedRecord. The transform() step is where field mapping, vocabulary normalization, and schema conformance happen — for generic adapters this is config-driven via field_map and vocabulary_map; for custom adapters it's implemented in code.
Adapter Discovery¶
Adapters are discovered via the cappella.adapters entry point group:
# pyproject.toml of cappella-adapter-starlims:
[project.entry-points."cappella.adapters"]
starlims = "cappella_starlims:STARLIMSAdapter"
Cappella's AdapterRegistry discovers and instantiates adapters at startup.
Canon Client¶
The Canon client is registered in the adapter registry as a special non-ingestion adapter — it resolves artifacts rather than ingesting structured records. The CollectionResolver accesses it directly (see §2.4).
2.3 Ingest Pipeline¶
Each ingest run follows a fixed pipeline:
fetch(since) → [RawRecord, ...]
│
▼
transform(record) → TransformedRecord{entity_type, data, external_id, source}
│
▼
validate(record, hippo) → [] or [errors...]
│ (stop if errors)
▼
upsert(entity_type, data, external_id, source)
→ HippoClient.create() if new
→ HippoClient.update() if changed
→ skip if identical
│
▼
record provenance event on entity
{source: "starlims", sync_run_id: "...", fetched_at: "..."}
Upsert Identity Resolution¶
Priority:
1. Explicit UUID — if the external record carries a Hippo entity UUID
2. ExternalID lookup — hippo.get_entity_by_external_id(system, external_id)
3. Create new — if no match found
This is the same identity resolution Hippo's ingestion pipeline uses. Cappella delegates to HippoClient for all writes.
Conflict Detection¶
When transform() produces data that differs from the existing Hippo entity for a matched ExternalID, Cappella applies conflict resolution:
- Trusted source wins — each adapter declares
trust_level: int; higher trust overwrites lower - Last-write wins — if trust levels are equal, most recent sync wins
- Manual review flag — if declared fields conflict across same-trust sources, flag the entity for reconciliation review
Conflicts are recorded as structured HarmonizationConflict provenance events on the entity, not silently overwritten.
2.4 Collection Resolver¶
The collection resolver is Cappella's highest-value capability — translating a high-level user request into a fully-resolved set of entities.
Resolution Request¶
POST /resolve
{
"entity_type": "GeneCounts",
"criteria": {
"donor.diagnosis": "CTE",
"sample.tissue": "DLPFC",
"dataset.assay": "RNASeq"
},
"parameters": {
"genome": "GRCh38",
"annotation": "ref:GeneAnnotation{source=ensembl, release=110}"
},
"selection": {
"strategy": "most_recent",
"filters": {"dataset.qc.min_reads": 1000000}
}
}
Resolution Steps¶
Step 1: Entity traversal (Hippo queries)
Cappella walks the entity graph bottom-up using the criteria filters:
Donor{diagnosis=CTE}
→ Sample{tissue=DLPFC, donor_id ∈ matching_donors}
→ SequencingDataset{assay=RNASeq, sample_id ∈ matching_samples}
The traversal path is inferred from the Hippo schema's references: declarations. Cappella doesn't need hardcoded traversal logic — it reads the schema graph.
Step 2: Selection logic
When a sample has multiple candidate datasets (multiple sequencing runs, replicates), selection logic picks one per sample:
class SelectionStrategy(ABC):
@abstractmethod
def select(self, candidates: list[Entity], filters: dict) -> Entity | None:
...
Built-in strategies:
- most_recent — highest created_at after applying filters
- highest_quality — sort by declared quality field (configurable per entity type)
- explicit — caller provides an explicit list of entity IDs to use
Selection strategies are pluggable via cappella.selection_strategies entry point. Labs can implement custom strategies (e.g., "prefer datasets from core facility X over resubmissions").
Step 3: Canon delegation
For each selected dataset, Cappella calls canon.resolve() for the requested entity_type:
for dataset in selected_datasets:
try:
uri = canon_client.resolve(
entity_type=request.entity_type,
params={**request.parameters, "dataset_id": dataset.id}
)
resolved.append(ResolvedItem(sample=dataset.sample_id, uri=uri, status="resolved"))
except CanonNoRuleError:
unresolved.append(UnresolvedItem(sample=dataset.sample_id, reason="no_rule"))
except CanonResolveError as e:
unresolved.append(UnresolvedItem(sample=dataset.sample_id, reason="canon_error", detail=str(e)))
Cappella never raises on partial failure — it collects all resolved and all unresolved items and returns both.
Step 4: HarmonizedCollection response
See §1.5 for the full format. The collection includes: - All resolved entities with URIs and Canon decision (REUSE/FETCH/BUILD) - All unresolved items with structured reasons - Full provenance (versions, genome build entity, selection criteria)
2.5 Trigger Engine¶
Triggers are the mechanism by which Cappella executes ingest and resolution operations automatically.
Trigger Types (v0.1)¶
| Type | Mechanism | Use case |
|---|---|---|
schedule |
Cron expression | Nightly STARLIMS sync, weekly full reconciliation |
manual |
API call | User-initiated ingest or resolution |
internal_event |
Named event emitted by another action | Chain: sample_created → trigger alignment resolution |
Trigger Configuration (cappella.yaml)¶
triggers:
- name: nightly_starlims_sync
type: schedule
schedule: "0 2 * * *" # 2 AM daily
action:
type: ingest
adapter: starlims
incremental: true
on_success:
emit: starlims_sync_complete
- name: resolve_on_new_sample
type: internal_event
event: starlims_sync_complete
action:
type: resolve
entity_type: AlignmentFile
criteria:
sample.tissue: DLPFC
parameters:
genome: GRCh38
Trigger Action Types¶
| Action | Description |
|---|---|
ingest |
Run an adapter's fetch/transform/upsert pipeline |
resolve |
Run collection resolution, optionally store result as ResolutionRun entity |
reconcile |
Run inconsistency detection for specified entity types |
notify |
Send a notification (Hippo event, webhook, email) |
Action Chaining¶
Actions emit named internal events (emit:) that other triggers subscribe to. This is simple event-driven composition — not a DAG scheduler. Cycles are detected at config validation time and rejected with an error.
2.6 Reconciliation Engine¶
Reconciliation detects and surfaces inconsistencies across sources without automatically resolving them (resolution is a human decision for ambiguous conflicts).
Checks (v0.1)¶
| Check | Description |
|---|---|
missing_entity |
Entity referenced in external system has no Hippo record |
stale_entity |
Hippo entity not updated in external system within expected window |
field_conflict |
Same entity field has different values in two trusted sources |
broken_reference |
Entity has a references: field pointing to a nonexistent entity |
missing_artifact |
Entity has no associated file artifact where one is expected |
Each check produces a structured ReconciliationFinding — not an error, not an automatic fix. Findings are queryable from Hippo and surfaced in Aperture.
Reconciliation Run¶
POST /reconcile
{
"entity_types": ["Donor", "Sample"],
"adapters": ["starlims", "redcap"],
"checks": ["field_conflict", "missing_entity"]
}
Returns a list of ReconciliationFinding objects. Each finding includes the entity ID, field, source A value, source B value, and a suggested resolution action (human review, trust source A, trust source B).
2.7 Provenance Model¶
Every Cappella write to Hippo carries a structured context on the provenance event:
{
"cappella_version": "0.1.0",
"source": "starlims",
"sync_run_id": "uuid-run-123",
"adapter_version": "1.2.0",
"fetched_at": "2026-03-25T17:30:00Z",
"trigger": "nightly_starlims_sync",
"selection_strategy": "most_recent"
}
This mirrors the pattern established in Hippo's provenance event model and is consistent with the context Cappella receives from Canon for artifact entities.
2.8 API Surface (v0.1)¶
| Endpoint | Description |
|---|---|
POST /resolve |
Submit a collection resolution request |
GET /resolve/{run_id} |
Get status/result of a resolution run |
POST /ingest |
Trigger an immediate ingest for a named adapter |
GET /ingest/{run_id} |
Get status of an ingest run |
POST /triggers/{name}/run |
Manually fire a named trigger |
GET /triggers |
List all configured triggers and their last-run status |
POST /reconcile |
Run reconciliation checks |
GET /findings |
Query reconciliation findings |
GET /status |
Cappella health, connected adapters, Hippo version |
2.9 Deployment Model¶
Cappella is a stateless Python service. It connects to:
- A running Hippo instance (via HippoClient)
- A running Canon instance (via CanonClient HTTP or in-process)
- External systems as configured in adapters
No Cappella-local database. All persistent state — sync history, reconciliation findings, resolution runs — is stored as Hippo entities.
Cappella can be run as:
- A standalone CLI tool (cappella resolve, cappella ingest starlims)
- A REST service (cappella serve)
- Embedded as a Python library in Composer or other tools
2.10 Open Questions for v0.1¶
| Question | Priority | Notes |
|---|---|---|
| Schema-driven traversal | Resolved ✅ | HippoClient.schema_references(entity_type) implemented in Hippo v0.4. Reads FieldDefinition.references from schema. REST: GET /schemas/{entity_type}/references. Cappella's EntityTraversal calls it at runtime. Schema YAML must declare references: {entity_type: <name>} on foreign-key fields. |
| Selection strategy config syntax | High | How are per-entity-type quality fields declared? In cappella.yaml or in the Hippo schema? |
| Canon client transport | Resolved ✅ | Both modes implemented. In-process mode imports canon.resolve() directly. HTTP mode calls POST {canon_url}/resolve with {"entity_type": ..., "params": ...}, returns {"decision": ..., "uri": ...}. Default for v0.1 is HTTP (cappella.yaml: canon.mode: http); in-process mode available via canon.mode: in_process. Canon API exposes /resolve alongside /api/v1/rules. |
| ResolutionRun entity storage | Medium | Should every POST /resolve create a ResolutionRun entity in Hippo? Useful for audit but adds write overhead. Deferred to v0.2. |
| Webhook triggers | Medium | Deferred to v0.2 — requires endpoint registration, signature verification, retry logic. |
| Hippo poll triggers | Medium | Deferred to v0.2 — requires efficient change detection (polling updated_at index). |