Skip to content

Architecture

2. Architecture

Document status: Draft v0.1
Depends on: sec1_overview.md


2.1 Component Overview

Canon is structured as a three-layer pipeline. Each layer has a single responsibility and a clean interface to the layers around it.

┌─────────────────────────────────────────────────────────────────────┐
│                         Canon Pipeline                               │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  1. Rules Layer                                               │   │
│  │                                                               │   │
│  │  canon_rules.yaml + *.canon.yaml sidecars                    │   │
│  │  RuleRegistry — maps (entity_type, params) → CWL workflow    │   │
│  │  Rule validation at startup                                   │   │
│  └──────────────────────────┬────────────────────────────────────┘   │
│                             │  matching rule                         │
│  ┌──────────────────────────▼────────────────────────────────────┐   │
│  │  2. Resolver Layer                                             │   │
│  │                                                               │   │
│  │  CanonResolver — the core engine                              │   │
│  │    ├── EntityRefResolver: ref:T{...} → Hippo UUID             │   │
│  │    ├── HippoQueryClient: does this entity exist?              │   │
│  │    ├── RecursivePlanner: resolve requires[], detect cycles    │   │
│  │    └── Decision: REUSE (found) or BUILD (not found)           │   │
│  └──────────────────────────┬────────────────────────────────────┘   │
│                             │  BUILD: resolved inputs + CWL path     │
│  ┌──────────────────────────▼────────────────────────────────────┐   │
│  │  3. Executor Layer                                             │   │
│  │                                                               │   │
│  │  CWLExecutorAdapter (ABC)                                     │   │
│  │    ├── CwltoolAdapter (bundled — only built-in adapter)                          │   │
│  │    ├── canon-executor-toil, canon-executor-nextflow (plugins)                          │   │
│  │    └── (any canon.executor_adapters entry point)   │   │
│  │                                                               │   │
│  │  OutputIngestionPipeline                                      │   │
│  │    ├── Parse CWL output + .canon.yaml sidecar                 │   │
│  │    ├── Ingest output entity into Hippo                        │   │
│  │    └── Record WorkflowRun provenance entity                   │   │
│  └──────────────────────────┬────────────────────────────────────┘   │
│                             │  Hippo entity UUID + URI               │
└─────────────────────────────┼───────────────────────────────────────-┘
                    ┌─────────▼──────────┐
                    │       Hippo        │
                    │  (entity registry) │
                    └────────────────────┘

Every canon get call passes through all three layers. REUSE short-circuits at the Resolver Layer — no execution, no ingestion. BUILD continues through to the Executor Layer.


2.2 Layer 1: Rules Layer

The Rules Layer is Canon's knowledge base — it knows how to produce artifacts.

canon_rules.yaml is the registry of production rules. Each rule declares: - What it produces: entity type + identity parameters (scalars and entity references) - What it requires: named input bindings, each resolved by a recursive canon get - How to execute: path to a CWL workflow file + input parameter mappings

canon/
  canon_rules.yaml          # rule registry
  workflows/
    star_align.cwl           # CWL workflow
    star_align.canon.yaml    # Canon sidecar: output → Hippo entity mapping
    cutadapt.cwl
    cutadapt.canon.yaml
    htseq_count.cwl
    htseq_count.canon.yaml

Canon sidecar (.canon.yaml) — a small YAML file alongside each CWL workflow declaring what Hippo entity types the CWL outputs map to and which CWL output values become which Hippo entity fields:

# star_align.canon.yaml
outputs:
  bam:
    entity_type: AlignmentFile
    identity_fields:          # these fields uniquely identify the artifact
      - aligner
      - genome_build
      - sample
    hippo_fields:             # CWL output value → Hippo entity field mappings
      uri: "{outputs.bam.location}"
      aligner: "{inputs.aligner}"
      genome_build: "{inputs.genome_build}"
      sample: "{inputs.sample}"
      read_count: "{outputs.bam.stats.mapped_reads}"  # optional scalar

The sidecar keeps CWL files entirely standard and valid — no Canon-specific extensions to the CWL format itself.

RuleRegistry loads canon_rules.yaml at startup, validates all rules (no duplicate produces specs, all referenced CWL files exist, all sidecars present), and provides find_rule(entity_type, params) → Rule | None.


2.3 Layer 2: Resolver Layer

The Resolver Layer is the core of Canon. It is the only layer that understands the difference between REUSE and BUILD.

EntityRefResolver resolves entity reference parameters before any Hippo query. ref:ToolVersion{tool.name=STAR, version=2.7.11a} becomes a Hippo UUID by querying the entity registry. Dot notation traverses reference fields. Exact match required; multiple matches or zero matches both raise CanonResolutionError.

HippoQueryClient wraps the Hippo REST API for Canon's two read operations: - find_entity(entity_type, params) → Entity | None — does this artifact exist? - get_entity(entity_id) → Entity — fetch by UUID

RecursivePlanner is the central algorithm (described in full in sec4):

def resolve(entity_type, params) → URI:
    # 1. Resolve all entity references in params to UUIDs
    resolved_params = entity_ref_resolver.resolve(params)

    # 2. Check Hippo: does this artifact already exist?
    entity = hippo.find_entity(entity_type, resolved_params)
    if entity:
        return entity.uri  # REUSE

    # 3. Find a rule that can produce it
    rule = rule_registry.find_rule(entity_type, resolved_params)
    if not rule:
        raise CanonNoRuleError(entity_type, resolved_params)

    # 4. Resolve all required inputs (recursive)
    inputs = {}
    for binding in rule.requires:
        inputs[binding.name] = resolve(binding.entity_type, binding.params)

    # 5. Execute (Layer 3)
    output_uri = executor.run(rule.cwl_workflow, inputs, resolved_params)
    return output_uri

Cycle detection runs during recursive resolution using a grey-set (in-progress) tracker. If resolve(A) triggers resolve(B) which triggers resolve(A), Canon raises CanonCycleError with the full cycle path before any execution begins.


2.4 Layer 3: Executor Layer

The Executor Layer runs CWL workflows and ingests their outputs. It has no knowledge of Canon rules or entity references — it receives concrete file paths and parameter values.

CWLExecutorAdapter (ABC):

class CWLExecutorAdapter(ABC):
    def run(
        self,
        cwl_path: Path,
        inputs: dict[str, Any],   # concrete values, no entity refs
        work_dir: Path,
    ) -> CWLRunResult:
        """Execute the CWL workflow. Returns output file locations and metadata."""

    def version(self) -> str:
        """CWL runner version string, recorded in WorkflowRun provenance."""

Built-in adapters: - CwltoolAdapter (bundled): invokes cwltool <workflow.cwl> <inputs.json> as a subprocess, captures stdout/stderr, parses the JSON output object - All other adapters (Toil, Nextflow, etc.) are installable plugin packages, not bundled (pip install canon-executor-toil, pip install canon-executor-nextflow, etc.) - Plugins discovered via entry point group canon.executor_adapters

OutputIngestionPipeline runs after a successful CWL execution: 1. Reads the CWL JSON output (file locations, checksums, any scalar outputs) 2. Reads the .canon.yaml sidecar to determine Hippo entity type and field mappings 3. Constructs the Hippo entity payload and POSTs it to POST /entities 4. Creates a WorkflowRun entity in Hippo recording the full execution provenance

WorkflowRun entity (recorded for every BUILD execution):

entity_type: WorkflowRun
data:
  rule_name: align_reads
  cwl_workflow: "workflows/star_align.cwl"
  cwl_workflow_hash: "sha256:abc123..."
  cwl_runner: cwltool
  cwl_runner_version: "3.1.20240112164112"
  execution_environment:
    type: docker                                    # or singularity, conda, module, local
    image: "quay.io/biocontainers/star:2.7.11a"
    digest: "sha256:def456..."
  inputs:                                           # resolved concrete values
    fastq: "drs://bass.lab.org/fastq-uuid"
    genome_index: "s3://bucket/star_index_GRCh38"
  output_entity_id: "alignment-uuid"
  started_at: "2026-03-24T09:00:00Z"
  completed_at: "2026-03-24T09:45:00Z"
  status: completed                                 # completed | failed
  exit_code: 0

2.5 Package Structure

canon/
├── config.py                  # CanonConfig: load from canon.yaml
├── exceptions.py              # Exception hierarchy
├── types.py                   # Spec, EntityRef, ResolvedInput, WildcardBinding
├── rules/
│   ├── models.py              # ProductionRule, InputBinding, ProducesSpec, ExecuteSpec
│   ├── loader.py              # RulesLoader: parse + validate canon_rules.yaml
│   └── registry.py            # RuleRegistry: find_rule(entity_type, params)
├── resolver/
│   ├── entity_ref.py          # EntityRefResolver: ref:T{...} → Hippo UUID
│   ├── hippo_client.py        # HippoQueryClient: find_entity, get_entity, ingest_entity
│   └── planner.py             # RecursivePlanner: resolve(), cycle detection
├── executors/
│   ├── base.py                # CWLExecutorAdapter ABC, CWLRunResult
│   └── cwltool.py             # CwltoolAdapter (bundled, only built-in adapter)
├── ingestion/
│   ├── sidecar.py             # Parse .canon.yaml sidecar files
│   ├── pipeline.py            # OutputIngestionPipeline: CWL output → Hippo entity
│   └── provenance.py          # WorkflowRun entity construction + POST
└── cli/
    ├── main.py                # Typer app entry point
    └── commands/
        ├── get.py             # canon get — resolve one artifact
        ├── plan.py            # canon plan — dry run, show REUSE/BUILD decisions
        ├── rules.py           # canon rules list/validate
        └── status.py          # canon status — recent WorkflowRun entities from Hippo

2.6 Data Flow: REUSE path

User: canon get AlignmentFile \
        --param genome_build="ref:GenomeBuild{name=GRCh38}" \
        --param aligner="ref:ToolVersion{tool.name=STAR,version=2.7.11a}" \
        --param sample="ref:Sample{id=AD001}"

1. EntityRefResolver resolves each ref:... to a Hippo UUID
   genome_build → uuid:gbuild-123
   aligner      → uuid:toolv-456
   sample       → uuid:sample-789

2. HippoQueryClient queries Hippo:
   GET /entities?entity_type=AlignmentFile
     &genome_build=gbuild-123&aligner=toolv-456&sample=sample-789

3. Found: AlignmentFile entity uuid:align-abc
   URI: s3://bucket/alignments/AD001_GRCh38_STAR.bam

4. Return: s3://bucket/alignments/AD001_GRCh38_STAR.bam
   (No execution performed)

2.7 Data Flow: BUILD path

User: canon get AlignmentFile \
        --param genome_build="ref:GenomeBuild{name=GRCh38}" \
        --param aligner="ref:ToolVersion{tool.name=STAR,version=2.7.11a}" \
        --param sample="ref:Sample{id=AD002}"    ← new sample, not yet aligned

1. EntityRefResolver resolves refs → UUIDs (as above)

2. HippoQueryClient queries Hippo → Not found

3. RuleRegistry.find_rule("AlignmentFile", {...}) → align_reads rule

4. RecursivePlanner resolves requires[]:
   a. canon get FastqFile{sample=sample-002}
      → Found in Hippo: s3://bucket/fastq/AD002_R1.fastq.gz  (REUSE)
   b. canon get StarIndex{genome_build=gbuild-123, aligner=toolv-456}
      → Found in Hippo: s3://bucket/indices/GRCh38_STAR_2.7.11a/  (REUSE)

5. CwltoolAdapter.run(
     cwl_path="workflows/star_align.cwl",
     inputs={
       fastq: "s3://bucket/fastq/AD002_R1.fastq.gz",
       genome_index: "s3://bucket/indices/GRCh38_STAR_2.7.11a/",
       genome_build: "GRCh38",
       aligner_version: "2.7.11a",
       sample_id: "AD002"
     },
     work_dir=".canon/work/align-runs/20260324-090000"
   )
   → CWL runs STAR, produces AD002_GRCh38_STAR.bam

6. OutputIngestionPipeline:
   a. Parse star_align.canon.yaml sidecar
   b. POST /entities → AlignmentFile{uri=s3://..., genome_build=gbuild-123, ...}
      → uuid:align-def

7. ProvenanceRecorder:
   POST /entities → WorkflowRun{rule=align_reads, cwl=star_align.cwl,
                                runner=cwltool/3.1, env=docker/sha256:...,
                                output=align-def, status=completed}

8. Return: s3://bucket/alignments/AD002_GRCh38_STAR.bam

2.8 Hippo Entity Types Used by Canon

Canon relies on the following entity types being present in the Hippo deployment. These are defined in the Canon Hippo reference schema, which is bundled inside the canon package and registered as a hippo.reference_loaders entry point.

Installing Canon's schema into a Hippo deployment:

pip install canon          # installs canon + bundles the reference schema loader
hippo reference install canon   # writes Tool, ToolVersion, etc. into Hippo's schema + migrates

hippo reference install canon only needs to be run once per Hippo deployment (or after upgrading Canon to a new version). Canon raises CanonConfigError at startup if these types are not found in the Hippo instance it is configured to use.

The Canon package and its Hippo reference schema are versioned and released together. Schema changes always require a new Canon release — there is no independent schema version.

Entity Type Base Type Purpose
Tool Software tool identity (name, category, bio.tools ID)
ToolVersion Tool Specific version of a tool. Required in all Canon rules.
GenomeBuild Reference genome assembly (name, patch, species, UCSC/NCBI accession)
GeneAnnotation Gene annotation release (source, version, genome build ref)
WorkflowRun Canon execution provenance record

All domain entity types (AlignmentFile, FastqFile, StarIndex, CountsMatrix, etc.) are deployment-specific and defined in the user's Hippo schema configuration — not by Canon.

Hippo's single-inheritance polymorphism means client.query("Tool") returns both Tool and ToolVersion entities. Canon always queries ToolVersion directly — exact version matching is always required.

See sec6_hippo_integration.md for the full field-level schema of each Canon entity type.


2.9 Configuration

Canon is configured via canon.yaml in the project directory:

# Minimum required configuration
hippo_url: "http://127.0.0.1:8000"
hippo_token: "dev-token"
executor: cwltool
rules_file: canon_rules.yaml

# Optional
work_dir: .canon/work
cwltool_options:
  - "--no-container"     # for local runs without Docker

executor selects the CWL runner adapter. cwltool is the default and is bundled with Canon — no additional install needed. Additional adapters are installed as Python packages and discovered via the canon.executor_adapters entry point group.

See reference_canon_yaml.md for the complete configuration schema.