Cappella — Integration & Harmonization Engine¶
Specification Index¶
Codename: Cappella
Component: Integration & Harmonization Engine
Version: 0.1 (implemented)
Document Map¶
| File | Section | Status | Notes |
|---|---|---|---|
sec1_overview.md |
1. Overview & Scope | ✅ Draft v0.1 | Harmonization engine, collection resolution, HarmonizedCollection format, v0.1 scope |
sec2_architecture.md |
2. Architecture | ✅ Draft v0.1 | Adapter registry, ingest pipeline, collection resolver, trigger engine, reconciliation, API surface |
sec3_adapters.md |
3. Adapter System | ✅ Draft v0.1 | ExternalSourceAdapter ABC, field mapping config, vocabulary normalization, built-in stubs, error handling |
sec4_audit.md |
4. Audit & Observability | ✅ Draft v0.1 | Structured log events, HarmonizationConflict provenance, health endpoint |
sec5_workflows.md |
5. Collection Resolution Workflow | ✅ Draft v0.1 | Schema-driven traversal, selection strategies, partial failure, async resolution, CLI |
sec6_nfr.md |
6. Non-Functional Requirements | ✅ Draft v0.1 | Performance targets, reliability, scalability tiers, extensibility, deployment, full cappella.yaml |
sec7_testing.md |
7. Trigger Engine Test Strategy | ✅ Draft v0.1 | Behavior matrix for schedule/manual/internal_event (v0.1) and webhook/hippo_poll (v0.2); STARLIMS and HALO scenarios; test file map |
Key Decisions (from Platform Design Sessions)¶
See platform/design/INDEX.md for full rationale and config format examples.
| Decision | Choice |
|---|---|
| Primary role | Integration and harmonization engine — the "conductor" coordinating all data sources into Hippo |
| Storage | Stateless — Hippo is the sole persistent store; Cappella owns no data |
| Hippo dependency | Required — Cappella cannot function without a Hippo instance |
| External adapter implementations | Live in Cappella (STARLIMS, HALO, REDCap, partner portals) — ExternalSourceAdapter ABC stays in Hippo |
| Field mapping and transformation config | Lives in Cappella adapter config — separate from Hippo's canonical schema |
| Trigger model | Unified source + action format: webhook, schedule, hippo_poll, manual, internal_event |
| Action chaining | Actions emit named internal events (emits:); triggers subscribe with type: internal_event |
| Event payload | Entity IDs only — downstream actions query Hippo fresh at execution time |
| Query-fresh-at-invocation | All actions read current Hippo state at execution time; never rely on stale trigger context |
| Hippo reactive triggers (MVP) | Polling (hippo_poll) — sufficient because Cappella drives most Hippo writes in normal operation |
| Hippo reactive triggers (future) | Hippo event hook plugin system — deferred |
| Idempotency (MVP) | Upsert by ExternalID — look up by source system ID, update if changed, create if absent |
| Idempotency (future) | Short-window digest deduplication for webhook retries — deferred to when live integrations are scoped |
| Out-of-order delivery | Deferred — not in MVP scope |
| Operational audit (MVP) | Structured JSON logs per trigger execution (run_id, trigger, adapter, status, entity counts, errors) |
| Operational audit (future) | SyncRun entities in Hippo — log schema mirrors Hippo entity shape for easy migration |
| Conditional triggers | when: CEL condition on trigger source — same expression language as validators |
| Tooling | cappella trigger explain <name> — shows full trigger chain, subscriptions, conditions |
| Artifact resolution | Delegated entirely to Canon — Cappella calls canon.resolve(entity_type, params) for per-sample/per-artifact work; Canon handles REUSE/FETCH/BUILD/FAIL internally |
| Workflow execution | NOT Cappella's concern — Canon handles all CWL execution; Cappella calls canon.resolve() for per-artifact work |
| Aggregate analysis | NOT Cappella's concern — Composer receives HarmonizedCollection and runs aggregate steps (DESeq2, CountsMatrix merge, etc.) |
| HarmonizedCollection | Cappella's primary output — resolved/unresolved entities with URIs, Canon decisions, and provenance in structured JSON |
| Canon relationship | Canon is Cappella's artifact resolution engine, NOT an external data source adapter. Structurally different from STARLIMS/REDCap adapters. |
| Partial failure | Never abort; collect unresolved items with structured reasons; caller decides acceptability |
| Generic adapters | CSVAdapter, JSONAdapter, XMLAdapter, SQLAdapter bundled in core — config-driven field/vocab mapping; SQLAdapter uses SQLAlchemy, query in config; covers most use cases without custom code |
| Custom adapter plugins | cappella.adapters entry points; handle complex auth/pagination/protocols; field mapping in code |
| Field mapping | In adapter config (for generic adapters) or adapter code (for custom plugins); not in cappella.yaml |
| Vocabulary normalization | In adapter config (for generic adapters) or adapter code (for custom plugins); not in cappella.yaml |
| Schema-driven traversal | Entity graph traversal inferred from Hippo schema references declarations; fallback to explicit paths in cappella.yaml for v0.1 |
| Canon transport (v0.1) | In-process (import canon directly); HTTP mode available for distributed deployment |
| Resolution API | Always async — POST /resolve validates immediately (400/422/202); client polls GET /resolve/{run_id}; CLI blocks with progress display; live samples_resolved counter for Aperture/Composer |
| CLI | In scope for v0.1 — cappella resolve/ingest/trigger/status/findings |
| Workflow executor strategy | Resolved — Canon owns per-artifact CWL execution; Cappella delegates to canon.resolve() |
Open Questions¶
| Question | Priority | Notes |
|---|---|---|
| Workflow executor strategy | ~~High~~ Resolved | Canon owns per-artifact CWL execution. Cappella delegates artifact resolution to canon.resolve(). For v0.1 aggregate/cohort-level analyses (multi-sample inputs), Cappella invokes Canon's CWLExecutorAdapter directly and ingests results via OutputIngestionPipeline. Cappella does NOT wrap Nextflow/Snakemake — it reuses Canon's executor adapter layer. Aggregate Canon rules (v0.3) will eventually replace direct Cappella CWL execution. |
| Cappella adapter config format | Resolved ✅ | Adapter config format is finalized and implemented in v0.1. Each built-in adapter (CSV, JSON, XML, SQL) accepts entity_type, external_id_field, field_map, vocabulary_map, trust_level, and adapter-specific fields (source, url, records_path, records_xpath, connection_string, query, incremental_query) via a flat config: dict in cappella.yaml. Vocabulary normalization and field renaming are handled at the adapter level. Custom adapters implement their own config parsing. See sec3_adapters.md for full reference. |
| Idempotency for live synchronous integrations | High (deferred) | Full design when live integrations are scoped: webhook digest deduplication, out-of-order delivery, source timestamp handling. ExternalID upsert is the stable foundation. |
| Execution backend model | Medium | Local process vs. queue-backed worker vs. HPC submission. Must scale from laptop to cloud without code changes — consistent with Hippo's deployment tier model. |
| Trigger failure handling | Medium | Retry logic, dead-letter queue, alerting strategy for failed trigger executions. |
✅ User docs have been updated to align with the v0.1 design and implementation. See
cappella/docs/for the current documentation.