- Python 94.8%
- Shell 5.2%
The Track-1 low-latency variant (history 014): same 4-instrument net trained at a
128-sample hop (375 Hz), halving the algorithmic latency floor (21.3 -> 10.7 ms) at
identical onset timing and comparable note-F1 (violin 0.515 vs 0.500, piano 0.648 vs
0.656), for ~2x compute. Commercial-safe (CC0/CC-BY training data only), like
continuo_stream.
Adds (LFS): continuo_stream_hop128.pt (hop stamped), continuo_stream_hop128.onnx
(block-streaming, all instruments), and continuo_stream_hop128_perlayer_{violin,
eguitar,piano,aguitar}.onnx (per-layer cached, instrument baked). All N_EMIT=1, ORT
validated; per-layer == offline torch at hop128 (3.3e-6). README documents the
DAWMODEL_HOP=128 load requirement (the hop is stamped in the checkpoint too).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| augment | ||
| continuo | ||
| docs/ai-history | ||
| model | ||
| setup | ||
| tests | ||
| tier1gen | ||
| tier2gen | ||
| tier3gen | ||
| weights | ||
| .gitattributes | ||
| .gitignore | ||
| CLAUDE.md | ||
| LICENSE.md | ||
| pyproject.toml | ||
| README.md | ||
Continuo — representation, data generators, and audio→Continuo model
An audio-transcription research stack built around Continuo, a continuous,
microtonal, technique-aware performance representation. Five packages implement
the specs in docs/ai-history/:
| package | role | README |
|---|---|---|
continuo/ |
the representation: Envelope/Note/Stream/Document, instrument profiles, .cont serialization, validation |
continuo/README |
tier1gen/ |
Tier-1 data — Faust physical-model synthesis (DawDreamer) → exact continuous technique-channel labels | tier1gen/README |
tier2gen/ |
Tier-2 data — real sample libraries (VSCO-2 CE) via sfizz keyswitches → real timbre + discrete articulation labels | tier2gen/README |
model/ |
low-latency streaming audio→Continuo model (front end → causal TCN+GRU → multi-task heads), training, decoder | model/README |
augment/ |
curve-aware (audio_fn, label_fn) augmentation + license-filtered asset registry |
augment/README |
Performance layer only — no symbolic/notation layer (by design). CLAUDE.md
is the working guide; the docs/ai-history/00{3,5,7} entries are the
implementation logs (decisions, deviations, results) — the authoritative record.
Repository layout
continuo/ tier1gen/ tier2gen/ model/ augment/ # the five packages (+ per-dir README)
setup/ # idempotent env scripts (run in order; see below)
tests/ # test_continuo.py
docs/ai-history/ # specs (000,001,002,004,006) + implementation logs (003,005,007,009)
/mnt/train/ # NOT in git: datasets, asset banks, sfizz build, VSCO-2, runs/<name>
Environment
Target box: Ubuntu 26.04, AMD Ryzen 9950X3D, RTX 5090 (Blackwell/sm_120),
/mnt/train fast storage. The original briefs targeted 22.04/24.04 + Python 3.11;
see docs/ai-history/2026-06-13 003 … for the 26.04 adaptations.
Idempotent setup scripts, run in order as needed:
bash setup/setup_env.sh # base: uv + Python 3.12, torch-cu128, DawDreamer, Faust, deps
bash setup/setup_tier2.sh # builds sfizz_render from source, fetches VSCO-2 CE (Tier 2)
bash setup/setup_aug.sh # pyroomacoustics + generates the CC0 procedural asset bank
bash setup/download_ccby_assets.sh # optional: real CC-BY rooms/noise (HOMULA-RIR, DEMAND)
source .venv/bin/activate
Key choices: Python 3.12 (newest with DawDreamer + Blackwell-PyTorch wheels;
26.04 ships only 3.14), provisioned with uv; PyTorch from the cu128 index;
DawDreamer installs as a plain wheel. See docs/ai-history/003 for the full
24.04→26.04 adaptation record.
Generate data
Tier 1 — physical-modelling synthesis (Faust via DawDreamer); supplies continuous technique-channel labels:
python -m tier1gen.cli --profile violin --count 4000 \
--out /mnt/train/violin_tier1 --seed-base 0 --workers 22
Deterministic per seed; writes (wav, .cont) pairs + a dataset_manifest.json
with QA status. A second profile (--profile guitar) works by adding a profile
module only — no core changes (acceptance #5).
Tier 2 — real recorded sample libraries (VSCO-2 CE, CC0) rendered through
sfizz with articulation keyswitches; supplies real timbre + exact discrete
articulation labels (continuous technique channels masked). Run
bash setup/setup_tier2.sh first (builds sfizz_render, fetches VSCO-2):
python -m tier2gen.cli --library vsco2_solo_violin --count 4000 \
--out /mnt/train/vsco2_tier2_solo --seed-base 0 --workers 20
Pitch/dynamics are extracted from the rendered audio (provenance=analysis);
mode/articulation come from the keyswitch (provenance=synthetic). A second
library (--library vsco2_violin_section) works by adding a YAML map only
(acceptance #6). See history entry 005.
Combined Tier 1 + Tier 2 training
Pass multiple dataset dirs; the dataloader masks each .cont's declared
masked_outputs (Tier 2 masks technique_channels), and the articulation head
spans both tiers (8 classes):
python -m model.train --data /mnt/train/violin_tier1 /mnt/train/vsco2_tier2_solo \
--out /mnt/train/runs/combined_v3 --epochs 30 --batch-size 16 --workers 14
Tier 1 metrics are preserved while Tier 2 adds near-perfect articulation/mode and dynamics on real timbre; per-tier results in history entry 005 §7.
Curve-aware augmentation (CC0)
augment/ transforms (audio, .cont) pairs as paired (audio_fn, label_fn) ops
before rasterization (noise/RIR/EQ/codec, gain/compression, time-stretch,
pitch-shift, stem-mixing, SpecAugment), keeping labels analytically in sync. The
Group-A asset pool (room IRs + noise) is procedurally generated — CC0-by-
construction, commercial-safe — by augment.genassets. Run setup/setup_aug.sh
once, then training augments train-only and reports a seeded robustness val
beside the clean val:
python -m model.train --data /mnt/train/violin_tier1 /mnt/train/vsco2_tier2_solo \
--out /mnt/train/runs/combined_aug --epochs 30 --batch-size 16 --workers 20
Augmentation collapses the clean→robust generalization gap (e.g. articulation gap
0.30→0.06). Real CC-BY corpora (HOMULA-RIR rooms + DEMAND noise,
commercial-OK with attribution) can be added alongside the CC0 bank via
setup/download_ccby_assets.sh; on a real-conditions robustness val the ordering
is no-aug < procedural-CC0 < CC0+CC-BY on every metric. See history entry 007.
Multiple instruments
The model supports violin, electric guitar, and piano via an instrument-id
FiLM embedding and union heads masked per-channel (a violin sample trains only its
bow channels, a piano only its pedals). Data: Tier-1 (Faust pm.elecGuitar) +
Tier-2 electric guitar (Karoryfer Emilyguitar, CC-BY) and piano (VSCO-2
Upright Piano, CC0). Train across all datasets at once; evaluate per instrument:
python -m model.train --data /mnt/train/violin_tier1 /mnt/train/vsco2_tier2_solo \
/mnt/train/eguitar_tier1 /mnt/train/eguitar_tier2 /mnt/train/piano_tier2 \
--out /mnt/train/runs/multi_crepe16 --epochs 28 --batch-size 16 --workers 22
python -m model.eval_per_instrument --ckpt /mnt/train/runs/multi_crepe16/best.pt --data <dirs...>
The grid extends to C2–C8 (360 bins) with a dual-resolution front end (~32 ms).
Adding an instrument = a Continuo profile + a Tier-1 profile / Tier-2 library YAML.
Pitch uses a CREPE-style head (model/crepe.py); the synthetic guitar's
pathological pitch is masked so guitar pitch is learned from real Emily samples.
Per-instrument pitch acc: violin 0.79, electric guitar 0.52 (was 0.35 with the
old harmonic-stack head), piano 0.55 — guitar/piano still trail bowed violin and
cost latency (~48 ms). See history 008 (plan), 009 (results), 010 (CREPE + fixes).
Train + evaluate
python -m model.train --data /mnt/train/violin_tier1 \
--out /mnt/train/runs/violin_v1 --epochs 30 --batch-size 16 --workers 12
Writes history.json, best.pt/last.pt, and final_metrics.json (val + test,
split by generator seed). Streaming decode of frame outputs into Continuo notes is
in model/decoder.py.
The front end is a swappable seam (model/frontend.py). The default is a
learned short-STFT front end (ModelConfig.frontend="stft", stft_win=512,
lookahead=3) → ~26.7 ms total algorithmic latency, the best onset/dynamics
point of the window×lookahead sweep. The analytic HCQT ("hcqt", ~244 ms) and
other points (--stft-win/--lookahead) are available. At matched accuracy the
STFT cuts front-end latency ~15× — see history entry 003 §8.
Tests
python tests/test_continuo.py # envelope lowering, .cont round-trip, validation