Canon — User Guide: RNA-seq Analysis Walkthrough¶
This guide walks through a complete bulk RNA-seq analysis using Canon, from raw FASTQ files to a count matrix, using three DLPFC (dorsolateral prefrontal cortex) samples from a small Alzheimer's disease study.
Pipeline:
By the end of this guide you will have:
- All required Hippo reference entities registered
- A complete canon_rules.yaml with three rules (trim → align → count)
- Resolved count matrices for three samples, with full provenance in Hippo
- Hands-on experience with canon get, canon plan, and canon status
Prerequisites¶
- Canon and
cwltoolinstalled (pip install canon cwltool) - A running Hippo instance with Canon's reference schema installed:
hippo reference install canon - FASTQ files accessible at S3 or local paths
- See Quickstart for the initial setup steps
Part 1: Register Reference Data in Hippo¶
Before Canon can resolve any artifacts, the reference entities (tools, genome builds, annotations, samples, and raw data files) must exist in Hippo.
1.1 Tools and Tool Versions¶
# Tools (the software itself)
hippo entity create Tool --name STAR \
--category aligner --biotools-id STAR --bioconda-name star
hippo entity create Tool --name cutadapt \
--category trimmer --biotools-id cutadapt --bioconda-name cutadapt
hippo entity create Tool --name HTSeq \
--category counter --biotools-id HTSeq --bioconda-name htseq
# Tool versions (pinned releases used in this study)
hippo entity create ToolVersion --name STAR --version 2.7.11a \
--bioconda-build "2.7.11a--h9ee0642_0" --release-date 2023-11-01
hippo entity create ToolVersion --name cutadapt --version 4.6 \
--bioconda-build "4.6--py311h38fbfac_1" --release-date 2023-10-15
hippo entity create ToolVersion --name HTSeq --version 2.0.5 \
--bioconda-build "2.0.5--py311h38fbfac_0" --release-date 2023-09-01
1.2 Genome Build and Annotation¶
hippo entity create GenomeBuild --name GRCh38 --patch p14 \
--species "Homo sapiens" --ucsc-name hg38 \
--ncbi-accession GCA_000001405.15 \
--fasta-uri "s3://lab-references/genomes/GRCh38/GRCh38.primary_assembly.genome.fa" \
--fai-uri "s3://lab-references/genomes/GRCh38/GRCh38.primary_assembly.genome.fa.fai"
hippo entity create GeneAnnotation --source GENCODE --version 44 \
--genome-build "ref:GenomeBuild{name=GRCh38}" \
--release-date 2023-06-01 \
--gtf-uri "s3://lab-references/annotations/gencode.v44.primary_assembly.annotation.gtf.gz" \
--gene-count 61852
1.3 Samples¶
for id in DLPFC_001 DLPFC_002 DLPFC_003; do
hippo entity create Sample --id "$id" \
--tissue "dorsolateral prefrontal cortex" \
--species "Homo sapiens" \
--library-type "bulk RNA-seq" \
--library-prep "TruSeq Stranded Total RNA"
done
1.4 Raw FASTQ Files¶
hippo entity create FastqFile \
--sample "ref:Sample{id=DLPFC_001}" \
--uri "s3://lab-rnaseq-data/raw/DLPFC_001_R1.fastq.gz" \
--read-number 1 --read-count 45218903
hippo entity create FastqFile \
--sample "ref:Sample{id=DLPFC_002}" \
--uri "s3://lab-rnaseq-data/raw/DLPFC_002_R1.fastq.gz" \
--read-number 1 --read-count 52441287
hippo entity create FastqFile \
--sample "ref:Sample{id=DLPFC_003}" \
--uri "s3://lab-rnaseq-data/raw/DLPFC_003_R1.fastq.gz" \
--read-number 1 --read-count 41887652
Part 2: Define Rules in canon_rules.yaml¶
Canon rules map artifact types to CWL workflows. Rules are matched by entity type and parameter set. Here are the three rules for this pipeline:
# canon_rules.yaml
rules:
# Rule 1: Adapter and quality trimming with cutadapt
- name: trim_reads
produces:
entity_type: TrimmedFastqFile
match:
sample: "{sample}"
trimmer: "ref:ToolVersion{tool.name=cutadapt, version={cutadapt_version}}"
quality_cutoff: "{quality_cutoff}"
min_length: "{min_length}"
requires:
- bind: raw_fastq
entity_type: FastqFile
match:
sample: "{sample}"
execute:
workflow: workflows/cutadapt.cwl
inputs:
fastq: "{raw_fastq.uri}"
quality_cutoff: "{quality_cutoff}"
min_length: "{min_length}"
adapter: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
sample_id: "{sample}"
# Rule 2: STAR alignment
- name: align_reads
produces:
entity_type: AlignmentFile
match:
sample: "{sample}"
genome_build: "ref:GenomeBuild{name={genome_build}}"
aligner: "ref:ToolVersion{tool.name=STAR, version={star_version}}"
trimmer: "ref:ToolVersion{tool.name=cutadapt, version={cutadapt_version}}"
quality_cutoff: "{quality_cutoff}"
min_length: "{min_length}"
requires:
- bind: trimmed_fastq
entity_type: TrimmedFastqFile
match:
sample: "{sample}"
trimmer: "ref:ToolVersion{tool.name=cutadapt, version={cutadapt_version}}"
quality_cutoff: "{quality_cutoff}"
min_length: "{min_length}"
execute:
workflow: workflows/star_align.cwl
inputs:
fastq: "{trimmed_fastq.uri}"
genome_build: "{genome_build}"
star_version: "{star_version}"
sample_id: "{sample}"
threads: 16
# Rule 3: Gene counting with HTSeq
- name: count_reads
produces:
entity_type: CountMatrix
match:
sample: "{sample}"
genome_build: "ref:GenomeBuild{name={genome_build}}"
annotation: "ref:GeneAnnotation{source=GENCODE, version={gencode_version}}"
aligner: "ref:ToolVersion{tool.name=STAR, version={star_version}}"
counter: "ref:ToolVersion{tool.name=HTSeq, version={htseq_version}}"
trimmer: "ref:ToolVersion{tool.name=cutadapt, version={cutadapt_version}}"
quality_cutoff: "{quality_cutoff}"
min_length: "{min_length}"
requires:
- bind: alignment
entity_type: AlignmentFile
match:
sample: "{sample}"
genome_build: "ref:GenomeBuild{name={genome_build}}"
aligner: "ref:ToolVersion{tool.name=STAR, version={star_version}}"
trimmer: "ref:ToolVersion{tool.name=cutadapt, version={cutadapt_version}}"
quality_cutoff: "{quality_cutoff}"
min_length: "{min_length}"
- bind: annotation_file
entity_type: GeneAnnotationFile
match:
annotation: "ref:GeneAnnotation{source=GENCODE, version={gencode_version}}"
execute:
workflow: workflows/htseq_count.cwl
inputs:
bam: "{alignment.uri}"
gtf: "{annotation_file.uri}"
sample_id: "{sample}"
strand: "reverse"
mode: "union"
Part 3: Plan and Execute¶
3.1 Dry Run with canon plan¶
Before submitting any workflows, inspect Canon's resolution plan for one sample:
canon plan CountMatrix \
sample=ref:Sample{id=DLPFC_001} \
genome_build=GRCh38 \
gencode_version=44 \
star_version=2.7.11a \
htseq_version=2.0.5 \
cutadapt_version=4.6 \
quality_cutoff=20 \
min_length=20
First-run output:
Canon plan — CountMatrix for DLPFC_001
Step 1: TrimmedFastqFile (DLPFC_001, cutadapt 4.6, q20, l20)
→ BUILD (rule: trim_reads)
Input: FastqFile DLPFC_001 FOUND (uuid: fa01...)
Step 2: AlignmentFile (DLPFC_001, GRCh38, STAR 2.7.11a)
→ BUILD (rule: align_reads)
Input: TrimmedFastqFile will be built in Step 1
Step 3: CountMatrix (DLPFC_001, GRCh38, GENCODE v44, STAR 2.7.11a, HTSeq 2.0.5)
→ BUILD (rule: count_reads)
Input: AlignmentFile will be built in Step 2
Input: GeneAnnotationFile FOUND (uuid: ge44...)
Total steps: 3 BUILD, 0 REUSE
Estimated runtime: ~90min (cwltool, sequential)
3.2 Resolve Artifacts for All Three Samples¶
Run canon get for each sample. Canon will resolve each pipeline end-to-end:
for sample in DLPFC_001 DLPFC_002 DLPFC_003; do
echo "=== Processing $sample ==="
canon get CountMatrix \
sample="ref:Sample{id=$sample}" \
genome_build=GRCh38 \
gencode_version=44 \
star_version=2.7.11a \
htseq_version=2.0.5 \
cutadapt_version=4.6 \
quality_cutoff=20 \
min_length=20
done
Canon automatically reuses the STAR genome index (built once, shared across all samples):
=== Processing DLPFC_001 ===
TrimmedFastqFile DLPFC_001 → BUILD (cutadapt, 14m)
AlignmentFile DLPFC_001 → BUILD (STAR, 31m)
CountMatrix DLPFC_001 → BUILD (HTSeq, 8m)
✓ CountMatrix/5e6f7a8b-... file:///data/canon-outputs/DLPFC_001_counts.tsv
=== Processing DLPFC_002 ===
TrimmedFastqFile DLPFC_002 → BUILD (cutadapt, 16m)
AlignmentFile DLPFC_002 → BUILD (STAR, 36m) [StarIndex REUSE]
CountMatrix DLPFC_002 → BUILD (HTSeq, 9m)
✓ CountMatrix/6f7a8b9c-... file:///data/canon-outputs/DLPFC_002_counts.tsv
=== Processing DLPFC_003 ===
TrimmedFastqFile DLPFC_003 → BUILD (cutadapt, 13m)
AlignmentFile DLPFC_003 → BUILD (STAR, 29m) [StarIndex REUSE]
CountMatrix DLPFC_003 → BUILD (HTSeq, 7m)
✓ CountMatrix/7a8b9c0d-... file:///data/canon-outputs/DLPFC_003_counts.tsv
3.3 Re-Run Safety¶
Running the same commands a second time is safe. Every artifact is found in Hippo and returned immediately — no workflows are submitted:
canon get CountMatrix \
sample="ref:Sample{id=DLPFC_001}" \
genome_build=GRCh38 gencode_version=44 \
star_version=2.7.11a htseq_version=2.0.5 \
cutadapt_version=4.6 quality_cutoff=20 min_length=20
Part 4: Inspect Provenance¶
4.1 View Recent Runs¶
Recent WorkflowRuns (last 10):
UUID Entity Rule Sample Started Runtime Status
7a8b9c0d-... CountMatrix count_reads DLPFC_003 2026-09-15T16:41:00Z 7m22s ✓
69ab0c1d-... AlignmentFile align_reads DLPFC_003 2026-09-15T16:12:00Z 29m14s ✓
58ba0b1c-... TrimmedFastqFl trim_reads DLPFC_003 2026-09-15T15:59:00Z 12m55s ✓
6f7a8b9c-... CountMatrix count_reads DLPFC_002 2026-09-15T15:21:00Z 9m03s ✓
...
4.2 Full Provenance for a Count Matrix¶
Use Hippo directly to inspect the complete provenance chain for any entity:
Provenance chain: CountMatrix/5e6f7a8b (DLPFC_001)
EntityCreated 2026-09-15T14:32:00Z
actor: service:canon-runner
source_rule: count_reads
workflow_run: WorkflowRun/6f7a8b9c-...
← input: AlignmentFile/4d5e6f7a (DLPFC_001 / GRCh38 / STAR 2.7.11a)
EntityCreated 2026-09-15T13:51:00Z
source_rule: align_reads
workflow_run: WorkflowRun/5e6f7a8b-...
← input: TrimmedFastqFile/3c4d5e6f (DLPFC_001 / cutadapt 4.6 / q20 / l20)
EntityCreated 2026-09-15T13:20:00Z
source_rule: trim_reads
workflow_run: WorkflowRun/4d5e6f7a-...
← input: FastqFile/0a1b2c3d (DLPFC_001 / R1)
EntityCreated 2026-09-01T09:14:00Z
actor: data-team
The full lineage — from raw FASTQ through trimming and alignment to the count matrix — is traceable in a single Hippo history query.
4.3 Query All Count Matrices Across Samples¶
hippo query CountMatrix \
genome_build="ref:GenomeBuild{name=GRCh38}" \
annotation="ref:GeneAnnotation{source=GENCODE, version=44}"
3 entities found:
CountMatrix/5e6f7a8b DLPFC_001 GRCh38 GENCODE v44 STAR 2.7.11a HTSeq 2.0.5
CountMatrix/6f7a8b9c DLPFC_002 GRCh38 GENCODE v44 STAR 2.7.11a HTSeq 2.0.5
CountMatrix/7a8b9c0d DLPFC_003 GRCh38 GENCODE v44 STAR 2.7.11a HTSeq 2.0.5
Part 5: Parameter Changes and Re-Analysis¶
A key strength of Canon's semantic identity model is that changing any parameter automatically distinguishes new artifacts from existing ones — no manual file management required.
5.1 Re-run with a Different Quality Cutoff¶
canon plan CountMatrix \
sample=ref:Sample{id=DLPFC_001} \
genome_build=GRCh38 gencode_version=44 \
star_version=2.7.11a htseq_version=2.0.5 \
cutadapt_version=4.6 quality_cutoff=30 min_length=20
Because quality_cutoff is part of the artifact's semantic identity, Canon correctly
identifies that no TrimmedFastqFile with quality_cutoff=30 exists and plans a full BUILD:
Step 1: TrimmedFastqFile (DLPFC_001, cutadapt 4.6, q30, l20) → BUILD
Step 2: AlignmentFile (DLPFC_001, GRCh38, STAR 2.7.11a) → BUILD [StarIndex REUSE]
Step 3: CountMatrix (DLPFC_001, GENCODE v44, q30, l20) → BUILD
The original quality_cutoff=20 entities remain in Hippo unchanged. Both parameter sets
coexist in the registry — Canon never overwrites results.
5.2 Upgrade a Tool Version¶
To re-run all samples with STAR 2.7.11b:
hippo entity create ToolVersion --name STAR --version 2.7.11b \
--bioconda-build "2.7.11b--h9ee0642_0" --release-date 2024-02-01
for sample in DLPFC_001 DLPFC_002 DLPFC_003; do
canon get CountMatrix \
sample="ref:Sample{id=$sample}" \
genome_build=GRCh38 gencode_version=44 \
star_version=2.7.11b htseq_version=2.0.5 \
cutadapt_version=4.6 quality_cutoff=20 min_length=20
done
Canon builds new AlignmentFile and CountMatrix entities for the new tool version while
leaving the original 2.7.11a results intact.
Troubleshooting¶
EntityRefNotFound: A ref:T{...} expression matched no Hippo entities. Check that
the referenced entity exists with hippo query <EntityType>.
RuleNotFound: No rule in canon_rules.yaml produces the requested entity type with
the given parameters. Check the rule's produces.match keys against your get parameters.
MissingRequiredInput: A rule's requires block could not be satisfied. Run
canon plan to see which input step is failing and why.
CwltoolError: The CWL workflow failed during execution. Check the workflow run log
at $work_dir/<run-id>/cwltool.log and the WorkflowRun entity in Hippo for the failure
context.