Non-Functional Requirements

7. Non-Functional Requirements¶

Document status: Draft v0.1
Depends on: sec1_overview.md, sec2_architecture.md

7.1 Reproducibility¶

Reproducibility is Canon's primary non-functional requirement. Every artifact Canon produces must be re-producible from its provenance record alone.

Requirements:

Every produced artifact has a WorkflowRun entity in Hippo recording: CWL workflow file path + SHA256 hash, CWL runner name + version, execution environment type + image digest/hash, all input entity UUIDs, all parameters
CWL workflow file hashes are captured at execution time — if the workflow file changes after execution, the hash in WorkflowRun still reflects what was actually run
Container image digests are captured (not just tags) — sha256:abc123 is reproducible; latest is not
Canon raises CanonRuleValidationError at startup if any tool reference in a rule lacks a version — "STAR without a version" is not allowed
All entity reference parameters are stored as Hippo UUIDs in the produced entity's metadata — UUIDs are stable identifiers; names can change

Achieving reproducibility from a WorkflowRun record:

Given a WorkflowRun entity, a researcher can reproduce the execution by: 1. Obtaining the CWL file at the recorded hash (from git history or a CWL registry) 2. Pulling the exact container image by digest 3. Resolving all input entity UUIDs to their current URIs 4. Running cwltool (or the recorded runner) with the same inputs

This is possible without Canon itself — the WorkflowRun is a self-contained recipe.

7.2 Idempotency¶

Canon's canon get operation is idempotent: calling it multiple times with the same specification produces the same result without duplicating computation or storage.

Requirements:

A canon get call that finds an existing Hippo entity always returns that entity's URI without running any computation
A canon get call that finds an in-progress WorkflowRun raises CanonExecutorError rather than launching a duplicate execution
A canon get call after a failed WorkflowRun re-runs the workflow (failed results are not cached)
Two concurrent canon get calls for the same spec are safe: one will run and the other will either find the completed result (REUSE) or the in-progress run (error)

Idempotency is not guaranteed in concurrent deployments without a distributed lock. In v0.1, Canon relies on Hippo's atomic write semantics for single-instance deployments. Multi-instance Canon deployments (e.g. two Cappella workers calling Canon in parallel for the same spec) may produce duplicate executions. Distributed locking is deferred to v0.2.

7.3 Correctness¶

Canon must never silently return an artifact that does not match the requested specification.

Requirements:

All field comparisons in Hippo queries are exact match — Canon never uses fuzzy matching or range queries in v0.1
Entity reference resolution raises CanonResolutionError on zero matches or multiple matches — Canon never silently picks one from many
Tool version is always required in rules — Canon never resolves "any STAR" to a specific version without explicit declaration
The unpropagated wildcard validation at startup ensures upstream parameters are always preserved in produced artifact metadata — Canon never silently drops provenance
Sidecar identity_fields are validated against the rule's produces.match at startup — a mismatch between what Canon queries and what it stores is a startup error

7.4 Observability¶

Canon provides sufficient visibility into its operations for debugging and monitoring.

Requirements:

canon plan provides a dry-run view of the full REUSE/BUILD decision tree before any execution
canon status shows recent WorkflowRun entities from Hippo (running, completed, failed) with timing information
All Canon operations are logged at configurable levels (DEBUG, INFO, WARNING, ERROR) to stderr
DEBUG logging includes: Hippo queries + response times, rule matching decisions, wildcard bindings, entity ref resolutions
INFO logging includes: REUSE/BUILD decisions, CWL execution start/end, ingestion confirmation
CWL runner stderr is captured in WorkflowRun.stderr (truncated to 64KB) for post-execution debugging
Failed WorkflowRun entities in Hippo are queryable — canon status --failed shows all failures

7.5 Performance¶

Canon is not a high-throughput system. Its performance envelope is appropriate for research workloads, not real-time services.

Targets:

Operation	Target latency	Notes
REUSE (Hippo query hit)	< 500ms	Dominated by Hippo network roundtrip
Entity ref resolution	< 200ms per ref	One Hippo query per `ref:T{...}` expression
Rule matching	< 10ms	In-memory after startup
CWL execution	Minutes to hours	Determined by the workflow, not Canon
Canon startup (rule validation)	< 5s for 100 rules	CWL file validation + Hippo schema check

Canon does not cache Hippo query results within a single canon get call. Each entity ref and each registry lookup is a fresh Hippo query. This ensures correctness — the registry may have been updated by another process between queries — at the cost of additional network roundtrips. Caching may be added in v0.2 with a configurable TTL.

Sequential input resolution in v0.1. Required inputs are resolved one at a time in the order declared in requires:. Parallel resolution (resolving independent inputs concurrently) is deferred to v0.2. For most pipelines with 2–4 inputs per rule, the overhead is negligible compared to CWL execution time.

7.6 Security¶

Canon inherits Hippo's security model for all data access.

Requirements:

The hippo_token in canon.yaml must have read+write access to all entity types used by Canon rules — Canon does not perform partial-permission graceful degradation
canon.yaml should not be committed to version control if it contains credentials — use environment variable substitution: hippo_token: "${HIPPO_TOKEN}"
Canon does not execute arbitrary code from canon_rules.yaml — rules are data, not code; wildcard values are never evaluated as expressions
CWL ExpressionTool (JavaScript) is supported but discouraged — JavaScript execution in CWL steps is outside Canon's security perimeter; restrict with cwltool_options: ["--disable-ext"] if needed
Staging directory (work_dir/staging) should be on a filesystem inaccessible to other users if input data is sensitive
Canon does not implement authentication beyond forwarding hippo_token — access control for produced artifacts is Hippo's responsibility

7.7 Deployment and Operations¶

Requirements:

Canon runs on any system with Python 3.11+ and Docker (or another container runtime)
No server process required for local use — canon get is a CLI command that runs and exits
Single canon.yaml per working directory — supports multiple Canon configurations for different Hippo deployments or rule sets by running from different directories
Canon is stateless except for ephemeral CWL work directories and the optional ~/.canon/runs.db SQLite database for canon status (rebuildable from Hippo)
canon rules validate should be run in CI/CD alongside CWL file changes to catch rule errors before deployment
Canon version and the Canon Hippo reference schema version are always identical — upgrading Canon requires re-running hippo reference install canon to apply any schema changes

7.8 Extensibility¶

Canon is designed to be extended without modifying the core package.

Extension points:

Extension point	Mechanism	Example
New CWL executor backend	`canon.executor_adapters` entry point	`canon-executor-toil`
Community workflow packages	`canon.workflow_packages` entry point	`canon-workflows-rnaseq`
Domain entity types	`hippo.reference_loaders` entry point	bundled in workflow packages
Custom Canon entity types	`hippo.reference_loaders` entry point	lab-specific entity extensions

All extension points use standard Python entry points — no Canon-specific plugin API or registration step beyond pip install.

7.9 Versioning and Compatibility¶

Canon follows semantic versioning (semver)
The Canon Hippo reference schema version always matches the Canon package version
Breaking changes to canon_rules.yaml syntax increment the major version
CWL v1.2 is the minimum supported version — older CWL files are not supported
Canon guarantees that existing rules continue to work across minor version upgrades
The WorkflowRun entity schema is stable within a major version — fields may be added (additive) but not renamed or removed within a major release