Files
music/idea.md
2026-02-19 20:24:22 +00:00

539 lines
26 KiB
Markdown

# artdag Audio Primitives
## A Universal Foundation for Generative Music
artdag's video pipeline demonstrates that complex visual output can emerge from the composition of simple, declarative primitives expressed as s-expressions. The same principle applies to music. This document describes a minimal set of audio primitives that, when composed, can produce any genre of music — from techno to orchestral, from ambient to gabber — without baking any genre-specific assumptions into the engine.
The core insight is that music decomposes into a surprisingly small number of fundamental operations. Everything else — sequencers, arpeggiators, chord generators, scales, tuning systems, song structures — is userland composition of these primitives, shareable and forkable on ActivityPub.
## Architecture
```
Emotion arc (what you want the listener to feel)
↓ emotion-to-parameter maps (userland, shareable, forkable)
Score / pattern layer (userland, shareable, forkable)
↓ compile
S-expressions (universal intermediate representation)
↓ evaluate
DAG of base primitives (osc, filter, env, mul, mix, delay)
↓ render (GPU, audio + video unified)
Audiovisual output
↓ share
ActivityPub (the s-expressions, not the rendered output)
```
Each layer compiles down to the one below it. The base primitives are the only things the engine needs to implement. Everything above them is expressed in s-expressions and lives in userland.
## Base Primitives
The engine implements nine primitives. Every musical sound that has ever existed can be constructed from compositions of these.
### osc — Periodic Waveform Generator
Generates a repeating waveform at a given frequency. This is the fundamental building block of pitched sound.
```
(osc :type sine :freq 440)
(osc :type saw :freq 110)
(osc :type square :freq 110 :pw 0.5)
(osc :type triangle :freq 220)
```
A sine wave is a pure tone — one frequency, no harmonics. A saw wave contains all harmonics at decreasing amplitude — it sounds bright and buzzy. A square wave contains odd harmonics — it sounds hollow. A triangle wave is like a softer square. The `:pw` parameter on square waves controls pulse width, which changes the harmonic content.
Every pitched instrument in existence is some combination of these waveforms shaped by the other primitives.
### noise — Aperiodic Signal Generator
Generates random signal with no discernible pitch. Essential for percussion, texture, and breath sounds.
```
(noise :type white)
(noise :type pink)
(noise :type brown)
```
White noise has equal energy at all frequencies. Pink noise has equal energy per octave (more natural sounding). Brown noise rolls off more steeply at high frequencies (deep rumble). Hi-hats, snare rattle, wind, rain, vinyl crackle — all built from filtered noise.
### sample — Buffer Playback
Plays back a buffer of recorded audio. Bridges the gap between synthesis and sampling.
```
(sample :path "kick.wav")
(sample :path "vocal.wav" :speed 0.5 :loop true)
(sample :buffer :some-ref :start 0.1 :end 0.4)
```
### filter — Frequency Domain Shaping
Removes or emphasises frequencies from a signal. This is arguably the single most important effect in electronic music. The entire acid techno genre is built on sweeping a resonant low-pass filter over a saw wave.
```
(lpf :cutoff 800 :resonance 0.7) ; low-pass — removes highs
(hpf :cutoff 200 :resonance 0.3) ; high-pass — removes lows
(bpf :center 1000 :q 5) ; band-pass — isolates a band
(notch :center 1000 :q 2) ; notch — removes a band
```
The `:resonance` (or `:q`) parameter creates a peak at the cutoff frequency. High resonance produces the screaming, squelchy sound characteristic of the TB-303. At very high resonance, the filter begins to self-oscillate and becomes a sine wave generator — blurring the line between filtering and sound generation.
### env — Envelope Generator
A function from time to a scalar value. Shapes how parameters change over the duration of a note or event.
```
(adsr :a 0.01 :d 0.1 :s 0.0 :r 0.05)
(env :points [[0 0] [0.01 1.0] [0.1 0.3] [0.5 0]])
```
ADSR (attack, decay, sustain, release) is a convenience shorthand. The general form is an arbitrary list of time-value points with interpolation between them. Envelopes can control any parameter — amplitude, filter cutoff, pitch, panning, effect depth.
A techno kick drum is defined almost entirely by its envelopes: a very fast pitch envelope sweeping from ~200 Hz down to ~50 Hz, combined with a tight amplitude envelope. Change the envelope and the kick changes character completely.
### mix — Additive Combination
Sums multiple signals together.
```
(mix :sources [signal-a signal-b signal-c] :levels [0.8 0.5 0.3])
```
This is how you layer sounds — multiple oscillators for a richer timbre, multiple instruments for a full mix, multiple tracks for a complete piece.
### mul — Multiplicative Combination
Multiplies two signals together. This single primitive covers an enormous range of musical operations.
```
(mul :a signal :b envelope) ; amplitude shaping (VCA)
(mul :a signal :b gate-pattern) ; rhythmic gating
(mul :a osc-a :b osc-b) ; ring modulation
```
When one signal is an envelope, `mul` acts as a VCA (voltage-controlled amplifier) — it shapes the volume. When one signal is a binary gate pattern (0s and 1s), it acts as a rhythmic sequencer. When both signals are oscillators, it produces ring modulation — metallic, bell-like tones. Same primitive, different uses depending on what you feed it.
### delay-line — Delayed Signal Copy
Produces a time-delayed copy of a signal, optionally fed back into itself.
```
(delay-line :time 0.375 :feedback 0.4 :mix 0.3)
```
This single primitive is the foundation of an enormous family of effects. Echo is a delay line with moderate delay time. Chorus is multiple short delay lines with modulated times. Flanging is a very short delay line with modulated time and high feedback. Reverb is a network of delay lines with varying times and cross-feedback. Comb filtering is a delay line with very short time and high feedback.
All of these effects can be built in userland from the base `delay-line` primitive.
### compress — Dynamic Range Control
Reduces the difference between loud and quiet parts of a signal.
```
(compress :threshold -12 :ratio 4 :attack 0.01 :release 0.1)
```
In techno, sidechain compression — where the kick drum's amplitude controls the compression of other elements — creates the pumping, breathing effect that defines the genre. This is just `compress` with a sidechain input, which is itself just the kick's amplitude envelope controlling the compressor's gain reduction.
## Why Only Nine Primitives
Every sound in music reduces to some combination of:
- **Periodic vibration** (osc) and **aperiodic vibration** (noise) — these are the only two categories of sound that exist
- **Recorded sound** (sample) — for when synthesis isn't practical
- **Spectral shaping** (filter) — sculpting which frequencies are present
- **Temporal shaping** (env) — controlling how things change over time
- **Additive combination** (mix) — layering signals
- **Multiplicative combination** (mul) — modulating signals with other signals
- **Time displacement** (delay-line) — creating echoes, space, and interference
- **Dynamic control** (compress) — managing loudness relationships
A violin is `osc` (strings vibrating) → `filter` (body resonance shaping the harmonics) → `env` (the bow attack and decay) → `mul` (vibrato from the left hand modulating pitch).
A techno kick is `osc :type sine``env` on pitch (fast sweep down) → `env` on amplitude (tight punch) → `compress` (glue and impact).
A reverberant cathedral is a network of `delay-line` primitives with cross-feedback and filtering.
The primitives are genre-agnostic. Genre emerges from how you compose them.
## Userland Layers
Everything above the base primitives is expressed as s-expressions and lives in userland. This means it's composable, shareable, and forkable.
### Time and Pattern
Time and pattern are not base primitives. They are control flow — userland functions that determine when and how the base primitives are invoked.
The engine provides one thing: a monotonically increasing sample count. Everything else derives from this:
```
sample_count → seconds (sample_count / sample_rate)
seconds → beats (seconds * bpm / 60)
beats → subdivisions (beats * steps_per_beat)
```
A sequencer is just a function that maps a list of values onto time divisions and produces a control signal:
```
(defn step-seq [pattern step-duration]
"Maps a pattern onto time, producing a gate signal"
(let [step (mod (floor (/ current-beat step-duration)) (length pattern))]
(nth pattern step)))
```
A four-on-the-floor kick pattern:
```
(step-seq [1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0] 0.25)
```
This is just data parameterising when `mul` switches between 0 and 1. The "sequencer" dissolves — it's not a primitive, it's a userland function.
Euclidean rhythms — which generate a huge range of useful percussion patterns from just two parameters — are a userland function:
```
(defn euclidean [steps hits]
"Distributes hits as evenly as possible across steps"
...)
(euclidean 16 5) ; → [1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0]
```
Anyone can share alternative pattern generators on ActivityPub: probabilistic triggers, cellular automata, L-systems, Markov chains — all producing gate patterns that feed into `mul`.
### Pitch, Scales, and Tuning
The `osc` primitive takes frequency in Hz. A "note" is just a lookup from a table:
```
(defn note->freq [note octave tuning-table]
(lookup tuning-table note octave))
; Western 12-tone equal temperament
(def 12tet
{:C4 261.63 :Cs4 277.18 :D4 293.66 ...})
; Just intonation
(def just-intonation
{:C4 261.63 :D4 294.33 :E4 327.03 ...})
; Javanese slendro (5-tone)
(def slendro
{:1 250.0 :2 281.0 :3 316.0 :4 356.0 :5 397.0 ...})
```
A scale is a subset of a tuning table. A chord is multiple frequencies sounding simultaneously. None of this needs to be baked into the engine. Someone can share a microtonal tuning system, a historical temperament, or an entirely invented scale — all as userland s-expressions.
### Instrument Definitions
A score says "violin" but the primitives don't know what a violin is. Instrument definitions bridge this gap:
```
(def-instrument :violin
(fn [pitch velocity duration]
(-> (mix
(osc :type saw :freq pitch)
(osc :type saw :freq (* pitch 1.002))) ; slight detune
(lpf :cutoff (* pitch 4) :resonance 0.3)
(mul (adsr :a 0.08 :d 0.1 :s 0.6 :r 0.15))
(mul velocity))))
```
This is a crude approximation. Someone else could share a more sophisticated definition using additive synthesis with 30 harmonics matched to spectral analysis of a real instrument. Someone else might use granular synthesis from recorded samples. All interchangeable, all forkable, all expressed in the same base primitives.
### Arrangement and Structure
Song structure is a higher-level pattern over time:
```
(arrange :bpm 130 :sections [
(section :name "intro" :bars 16 :tracks [kick hat])
(section :name "build" :bars 16 :tracks [kick hat bass])
(section :name "drop" :bars 32 :tracks [kick hat bass lead])
(section :name "breakdown" :bars 16 :tracks [pad])
(section :name "drop-2" :bars 32 :tracks [kick hat bass lead pad])
(section :name "outro" :bars 16 :tracks [kick hat])])
```
This is just a higher-level pattern that controls which tracks are active at which times — which is just more gate signals feeding into `mul`. The entire arrangement layer dissolves into compositions of the base primitives.
## Score Compilation
Musical scores are declarative descriptions of music — a different serialisation format for the same underlying information that s-expressions express. artdag can compile from standard score formats into its s-expression IR.
### Supported Input Formats
**MusicXML** — The standard interchange format. Musescore, Sibelius, and Finale all export it. Every note has explicit pitch, duration, voice, and position.
**MIDI** — Note-on/note-off events with pitch, velocity, and timing. Less expressive than MusicXML but universally supported.
**ABC notation** — Compact text format popular in folk traditions. Already close to a DSL.
**Lilypond** — Text-based notation that is essentially a music programming language.
### Compilation Mapping
The compiler translates score elements to s-expression compositions:
| Score element | Compiles to |
|---|---|
| Note | `osc` at frequency + `env` for duration + `mul` for velocity |
| Dynamic marking (pp, ff) | Gain scalar on `mul` |
| Articulation (staccato, legato) | Different `env` shapes |
| Instrument name | Instrument definition lookup (userland) |
| Tempo and time signature | Clock division parameters |
| Simultaneous parts | Parallel DAG branches → `mix` |
| Crescendo / diminuendo | `env` controlling gain over time |
| Slurs and ties | Modified `env` — sustained rather than re-triggered |
The s-expression layer becomes a universal intermediate representation for music. Any input format compiles down to it. The source doesn't matter — the DAG is the canonical form.
## Emotion Layer
Emotion in music is not mystical. Decades of music psychology research have quantified how specific musical properties correlate with emotional responses. These correlations are themselves functions — which means they can be userland layers in artdag.
### Dimensional Model
The emotion layer uses a dimensional model derived from Russell's circumplex model and Thayer's model of mood, with music-specific refinements from Juslin & Sloboda's research and the Geneva Emotional Music Scale (GEMS):
**Valence** — Positive to negative mood. Ranges from -1.0 (grief, despair) to +1.0 (joy, euphoria).
**Arousal** — Energy level. Ranges from 0.0 (calm, meditative) to 1.0 (frantic, explosive).
**Tension** — Unresolved expectation. Ranges from 0.0 (completely resolved, at rest) to 1.0 (maximum suspense, anticipation).
### Known Parameter Correlations
These are statistically validated correlations between acoustic features and reported emotional responses:
| Emotion dimension | Musical parameter | Correlation |
|---|---|---|
| Arousal | Tempo | Higher tempo → higher arousal |
| Arousal | Rhythmic density | More events per beat → higher arousal |
| Arousal | Amplitude envelope | Shorter attack → higher arousal |
| Valence | Mode / scale | Major → positive, minor → negative |
| Valence | Register | Higher pitch → brighter affect |
| Valence | Filter cutoff | Brighter timbre → more positive |
| Tension | Dissonance | More dissonant intervals → more tension |
| Tension | Resonance | Higher filter resonance → more tension |
| Tension | Harmonic rhythm | Unresolved chords → sustained tension |
| Tension | Rhythmic regularity | Less regular → more tension |
### Emotion-to-Parameter Maps
These mappings are userland s-expressions — shareable and forkable. Different maps for different contexts:
```
(def-emotion-map :dark-techno
{:valence→scale (fn [v] (if (< v 0) :minor :major))
:valence→cutoff (fn [v] (lerp 400 8000 (+ v 1)))
:arousal→bpm (fn [a] (lerp 90 160 a))
:arousal→density (fn [a] (lerp 4 16 a))
:tension→resonance (fn [t] (lerp 0.1 0.95 t))
:tension→dissonance (fn [t] (lerp 0 0.5 t))})
```
Someone scoring a film would use a different map. Someone making ambient would use another. Someone working in a non-Western musical tradition might create maps reflecting entirely different emotional associations. The emotion layer is culturally aware because it's parameterised, not hardcoded.
### Emotion Arcs
A track isn't one static emotion. It has a narrative arc — a trajectory through emotional space over time:
```
(emotion-arc :duration 360 ; 6 minutes
:map :dark-techno
:keyframes [
[0:00 {:arousal 0.3 :valence 0.0 :tension 0.2}] ; intro — neutral, calm
[1:00 {:arousal 0.5 :valence -0.1 :tension 0.4}] ; building
[2:00 {:arousal 0.8 :valence -0.3 :tension 0.8}] ; pre-drop tension
[2:15 {:arousal 0.9 :valence 0.2 :tension 0.3}] ; drop — release, euphoria
[4:00 {:arousal 0.7 :valence -0.2 :tension 0.6}] ; second build
[4:30 {:arousal 0.95 :valence 0.3 :tension 0.2}] ; peak
[5:30 {:arousal 0.4 :valence 0.0 :tension 0.1}] ; outro — winding down
]
:interpolation :smooth)
```
The system interpolates between keyframes. The emotion-to-parameter maps translate the curves into concrete filter sweeps, tempo shifts, arrangement changes, and harmonic movement. The emotion arc is the compositional intent. The primitives are the realisation.
### Audio-Visual Emotion Unification
Since artdag already has video primitives, the same emotion parameters that drive audio can simultaneously drive video:
```
(def-emotion-visual-map
{:arousal→brightness (fn [a] (lerp 0.2 1.0 a))
:tension→contrast (fn [t] (lerp 1.0 2.5 t))
:valence→hue-shift (fn [v] (lerp 240 60 (+ v 1)))
:tension→glitch-amount (fn [t] (lerp 0 0.8 t))})
```
High tension means more contrast, more glitch, more red. Euphoria means brighter, warmer, smoother. The emotion layer becomes a unified control surface for the entire audiovisual experience, with audio and video driven by the same DAG.
## GPU Execution
artdag's DAG architecture maps naturally to GPU computation.
### Why GPU for Audio
Conventional DAWs run serial plugin chains on the CPU. artdag's architecture is different — it has a DAG of primitives where nodes at the same level are independent and can execute in parallel.
The base primitives are embarrassingly parallel internally. Generating 128 samples of a sine wave is 128 independent `sin()` calculations. Applying an envelope to a block is 128 independent multiplications. Mixing two signals is 128 independent additions. This is exactly what GPUs are designed for.
### Specific Advantages
**Oscillator banks** — Generating thousands of samples across multiple oscillators simultaneously.
**Additive synthesis** — Summing hundreds of sine partials. Each partial is independent.
**FFT-based processing** — Spectral filtering, convolution reverb, spectral analysis. FFTs are inherently parallel.
**Granular synthesis** — Hundreds of tiny overlapping grains, each needing independent processing.
**Unified audio-visual pipeline** — Since artdag already runs video on the GPU, audio primitives on the same GPU eliminate data transfer overhead. A spectrum analysis of the bassline feeding into a video shader is just two nodes in the same DAG running on the same memory. Audio-reactive visuals come for free.
### Execution Model
The DAG is evaluated per-block (64 or 128 samples at a time). At each level of the DAG, all nodes execute in parallel on the GPU. The base primitives operate on blocks internally using compute shaders (Vulkan compute or OpenCL).
The per-block approach balances parallelism with latency. For generative rendering (non-realtime), block size can be increased for maximum throughput. For live performance, smaller blocks reduce latency at the cost of some GPU efficiency.
## Examples
### Techno Kick Drum
A techno kick is a sine wave with a fast pitch sweep downward and a tight amplitude envelope:
```
(-> (osc :type sine :freq 60)
(pitch-env :start 200 :end 60 :time 0.05)
(mul (adsr :a 0.001 :d 0.15 :s 0.0 :r 0.1))
(compress :threshold -6 :ratio 8))
```
### Acid Bassline
The TB-303 sound — a saw wave through a resonant low-pass filter with per-step cutoff modulation:
```
(let [notes (step-seq [C2 C2 _ C2 _ _ Eb2 _ C2 _ _ _ F2 _ C2 _] 0.25)
cutoffs (step-seq [800 2000 400 3000 200 600 4000 300
800 2000 400 3000 200 600 4000 300] 0.25)]
(-> (osc :type saw :freq (note->freq notes :12tet))
(lpf :cutoff cutoffs :resonance 0.85)
(mul (adsr :a 0.01 :d 0.2 :s 0.3 :r 0.05))
(distortion :drive 0.5)))
```
### Ambient Pad
Multiple detuned oscillators with slow filter movement and reverb:
```
(-> (mix
(osc :type saw :freq 220)
(osc :type saw :freq 220.5)
(osc :type saw :freq 219.3)
(osc :type saw :freq 440.8))
(lpf :cutoff (mod :source (lfo :type sine :freq 0.05)
:range [400 2000])
:resonance 0.3)
(mul (adsr :a 2.0 :d 0.5 :s 0.7 :r 3.0))
(delay-line :time 0.375 :feedback 0.5 :mix 0.3)
(delay-line :time 0.5 :feedback 0.6 :mix 0.4))
```
### Complete Minimal Techno Track (Structural Sketch)
```
(let [bpm 132
kick (-> (osc :type sine :freq 55)
(pitch-env :start 180 :end 55 :time 0.04)
(mul (adsr :a 0.001 :d 0.2 :s 0.0 :r 0.1))
(compress :threshold -8 :ratio 6))
hat (-> (noise :type white)
(hpf :cutoff 8000)
(mul (adsr :a 0.001 :d 0.05 :s 0.0 :r 0.02))
(mul 0.3))
bass (-> (osc :type saw :freq (step-seq [C1 _ _ C1 _ _ C1 _ _ _ C1 _ C1 _ _ _] 0.25))
(lpf :cutoff 600 :resonance 0.6)
(mul (adsr :a 0.01 :d 0.15 :s 0.4 :r 0.1)))
perc (-> (noise :type pink)
(bpf :center 1200 :q 8)
(mul (adsr :a 0.001 :d 0.08 :s 0.0 :r 0.05))
(mul (euclidean 16 5))
(mul 0.4))]
(arrange :bpm bpm :sections [
(section :bars 16 :tracks [kick])
(section :bars 16 :tracks [kick hat])
(section :bars 32 :tracks [kick hat bass])
(section :bars 16 :tracks [kick hat bass perc])
(section :bars 32 :tracks [kick hat bass perc])
(section :bars 16 :tracks [kick hat])
(section :bars 8 :tracks [kick])]))
```
## ActivityPub Federation
On ActivityPub, what gets shared is the s-expression — the compositional description, not the rendered audio. This enables:
**Forking** — Someone takes a track, swaps the kick synthesis for their own, pushes the filter cutoff higher, changes the pattern from euclidean to probabilistic, and shares their version. The DAG preserves lineage back to the original.
**Component reuse** — Someone shares a particularly good kick drum synthesis as a standalone s-expression. Others incorporate it into their tracks with attribution.
**Emotion arc reuse** — Someone shares an emotion arc that creates a compelling tension-and-release narrative. Others apply it to completely different sounds. The emotional journey is the same but the sonic palette is different.
**Instrument definition sharing** — Someone builds a detailed instrument model using additive synthesis matched to spectral analysis of a real instrument. The entire community benefits.
**Tuning system sharing** — Someone encodes a historical temperament or a microtonal system. Others use it in their compositions.
**Collaborative composition** — Multiple people contribute different layers. The DAG makes it clear who contributed what. Attribution is structural, not metadata.
Because the shared object is a declarative description rather than rendered audio, it is:
- **Editable** — change any parameter and the whole thing recomputes
- **Decomposable** — pull out any component and reuse it
- **Inspectable** — read the s-expression and understand what it does
- **Attributable** — the DAG records the provenance of every component
- **Lightweight** — an entire track description is kilobytes of text, not megabytes of audio
## Comparison with AI Music Generation
AI music generators (Suno, Udio, MusicLM, MusicGen) train neural networks on audio data and produce rendered waveforms from text prompts. The output is an opaque audio file — there is no structure underneath, no way to reach in and tweak the filter resonance or change the kick pattern.
artdag's approach is fundamentally different:
| | AI generation | artdag |
|---|---|---|
| Output | Rendered audio (opaque) | S-expression DAG (transparent) |
| Editable | No | Yes, at any level |
| Decomposable | No | Yes, any node is reusable |
| Inspectable | No (black box) | Yes, human-readable |
| Forkable | No | Yes, with attribution |
| Deterministic | No (stochastic) | Yes (same input → same output) |
AI could complement artdag by generating s-expressions rather than audio — suggesting patterns, parameter values, or arrangement structures that remain within the system's grammar and are therefore transparent, editable, and shareable.
## Accessibility
artdag's architecture has specific advantages for deaf and hard-of-hearing users:
**Visual feedback** — The DAG is inherently visual. Spectrum analysers, waveform displays, and pattern visualisations are just additional nodes in the same graph. Because audio and video run on the same GPU, visual representations of the audio have zero overhead.
**Tactile monitoring** — Devices like the SubPac (wearable subwoofer) allow users to physically feel bass and sub-bass frequencies. The emotion layer's arousal and tension parameters can drive both audio intensity and tactile feedback intensity simultaneously.
**Declarative composition** — Because music is described as s-expressions rather than performed in real-time, the feedback loop doesn't require hearing in the traditional sense. A user can inspect the DAG, read the parameters, view the spectrum, feel the vibration, and adjust — all without relying solely on auditory perception.
**Visual reference matching** — A professional reference track can be loaded as a spectrum profile. The user composes by matching their track's visual spectral shape to the reference, a technique used by hearing producers as well.
## Summary
Nine base primitives — `osc`, `noise`, `sample`, `filter`, `env`, `mix`, `mul`, `delay-line`, `compress` — form a complete foundation for all music. Pattern, time, scales, tuning, instruments, arrangement, and emotion are all userland compositions of these primitives, expressed as s-expressions, shareable on ActivityPub.
Genre is not a property of the engine. It is an emergent property of how the primitives are composed. The same system that produces acid techno produces orchestral scores produces ambient soundscapes produces things that don't have names yet.
The DAG is the universal intermediate representation. Scores compile to it. Emotion arcs compile to it. Humans write it directly. What gets shared on the fediverse is the s-expression — not rendered audio but the recipe, forkable and attributable, a living document rather than a frozen artifact.