Squashed 'core/' content from commit 4957443

git-subtree-dir: core git-subtree-split: 4957443184ae0eb6323635a90a19acffb3e01d07
2026-02-24 23:09:39 +00:00
commit cc2dcbddd4
80 changed files with 25711 additions and 0 deletions
--- a/docs/EXECUTION_MODEL.md
+++ b/docs/EXECUTION_MODEL.md
@@ -0,0 +1,384 @@
+# Art DAG 3-Phase Execution Model
+
+## Overview
+
+The execution model separates DAG processing into three distinct phases:
+
+```
+Recipe + Inputs → ANALYZE → Analysis Results
+                      ↓
+Analysis + Recipe → PLAN → Execution Plan (with cache IDs)
+                      ↓
+Execution Plan → EXECUTE → Cached Results
+```
+
+This separation enables:
+1. **Incremental development** - Re-run recipes without reprocessing unchanged steps
+2. **Parallel execution** - Independent steps run concurrently via Celery
+3. **Deterministic caching** - Same inputs always produce same cache IDs
+4. **Cost estimation** - Plan phase can estimate work before executing
+
+## Phase 1: Analysis
+
+### Purpose
+Extract features from input media that inform downstream processing decisions.
+
+### Inputs
+- Recipe YAML with input references
+- Input media files (by content hash)
+
+### Outputs
+Analysis results stored as JSON, keyed by input hash:
+
+```python
+@dataclass
+class AnalysisResult:
+    input_hash: str
+    features: Dict[str, Any]
+    # Audio features
+    beats: Optional[List[float]]        # Beat times in seconds
+    downbeats: Optional[List[float]]    # Bar-start times
+    tempo: Optional[float]              # BPM
+    energy: Optional[List[Tuple[float, float]]]  # (time, value) envelope
+    spectrum: Optional[Dict[str, List[Tuple[float, float]]]]  # band envelopes
+    # Video features
+    duration: float
+    frame_rate: float
+    dimensions: Tuple[int, int]
+    motion_tempo: Optional[float]       # Estimated BPM from motion
+```
+
+### Implementation
+```python
+class Analyzer:
+    def analyze(self, input_hash: str, features: List[str]) -> AnalysisResult:
+        """Extract requested features from input."""
+
+    def analyze_audio(self, path: Path) -> AudioFeatures:
+        """Extract all audio features using librosa/essentia."""
+
+    def analyze_video(self, path: Path) -> VideoFeatures:
+        """Extract video metadata and motion analysis."""
+```
+
+### Caching
+Analysis results are cached by:
+```
+analysis_cache_id = SHA3-256(input_hash + sorted(feature_names))
+```
+
+## Phase 2: Planning
+
+### Purpose
+Convert recipe + analysis into a complete execution plan with pre-computed cache IDs.
+
+### Inputs
+- Recipe YAML (parsed)
+- Analysis results for all inputs
+- Recipe parameters (user-supplied values)
+
+### Outputs
+An ExecutionPlan containing ordered steps, each with a pre-computed cache ID:
+
+```python
+@dataclass
+class ExecutionStep:
+    step_id: str                    # Unique identifier
+    node_type: str                  # Primitive type (SOURCE, SEQUENCE, etc.)
+    config: Dict[str, Any]          # Node configuration
+    input_steps: List[str]          # IDs of steps this depends on
+    cache_id: str                   # Pre-computed: hash(inputs + config)
+    estimated_duration: float       # Optional: for progress reporting
+
+@dataclass
+class ExecutionPlan:
+    plan_id: str                    # Hash of entire plan
+    recipe_id: str                  # Source recipe
+    steps: List[ExecutionStep]      # Topologically sorted
+    analysis: Dict[str, AnalysisResult]
+    output_step: str                # Final step ID
+
+    def compute_cache_ids(self):
+        """Compute all cache IDs in dependency order."""
+```
+
+### Cache ID Computation
+
+Cache IDs are computed in topological order so each step's cache ID
+incorporates its inputs' cache IDs:
+
+```python
+def compute_cache_id(step: ExecutionStep, resolved_inputs: Dict[str, str]) -> str:
+    """
+    Cache ID = SHA3-256(
+        node_type +
+        canonical_json(config) +
+        sorted([input_cache_ids])
+    )
+    """
+    components = [
+        step.node_type,
+        json.dumps(step.config, sort_keys=True),
+        *sorted(resolved_inputs[s] for s in step.input_steps)
+    ]
+    return sha3_256('|'.join(components))
+```
+
+### Plan Generation
+
+The planner expands recipe nodes into concrete steps:
+
+1. **SOURCE nodes** → Direct step with input hash as cache ID
+2. **ANALYZE nodes** → Step that references analysis results
+3. **TRANSFORM nodes** → Step with static config
+4. **TRANSFORM_DYNAMIC nodes** → Expanded to per-frame steps (or use BIND output)
+5. **SEQUENCE nodes** → Tree reduction for parallel composition
+6. **MAP nodes** → Expanded to N parallel steps + reduction
+
+### Tree Reduction for Composition
+
+Instead of sequential pairwise composition:
+```
+A → B → C → D  (3 sequential steps)
+```
+
+Use parallel tree reduction:
+```
+A ─┬─ AB ─┬─ ABCD
+B ─┘      │
+C ─┬─ CD ─┘
+D ─┘
+
+Level 0: [A, B, C, D]     (4 parallel)
+Level 1: [AB, CD]         (2 parallel)
+Level 2: [ABCD]           (1 final)
+```
+
+This reduces O(N) to O(log N) levels.
+
+## Phase 3: Execution
+
+### Purpose
+Execute the plan, skipping steps with cached results.
+
+### Inputs
+- ExecutionPlan with pre-computed cache IDs
+- Cache state (which IDs already exist)
+
+### Process
+
+1. **Claim Check**: For each step, atomically check if result is cached
+2. **Task Dispatch**: Uncached steps dispatched to Celery workers
+3. **Parallel Execution**: Independent steps run concurrently
+4. **Result Storage**: Each step stores result with its cache ID
+5. **Progress Tracking**: Real-time status updates
+
+### Hash-Based Task Claiming
+
+Prevents duplicate work when multiple workers process the same plan:
+
+```lua
+-- Redis Lua script for atomic claim
+local key = KEYS[1]
+local data = redis.call('GET', key)
+if data then
+    local status = cjson.decode(data)
+    if status.status == 'running' or
+       status.status == 'completed' or
+       status.status == 'cached' then
+        return 0  -- Already claimed/done
+    end
+end
+local claim_data = ARGV[1]
+local ttl = tonumber(ARGV[2])
+redis.call('SETEX', key, ttl, claim_data)
+return 1  -- Successfully claimed
+```
+
+### Celery Task Structure
+
+```python
+@app.task(bind=True)
+def execute_step(self, step_json: str, plan_id: str) -> dict:
+    """Execute a single step with caching."""
+    step = ExecutionStep.from_json(step_json)
+
+    # Check cache first
+    if cache.has(step.cache_id):
+        return {'status': 'cached', 'cache_id': step.cache_id}
+
+    # Try to claim this work
+    if not claim_task(step.cache_id, self.request.id):
+        # Another worker is handling it, wait for result
+        return wait_for_result(step.cache_id)
+
+    # Do the work
+    executor = get_executor(step.node_type)
+    input_paths = [cache.get(s) for s in step.input_steps]
+    output_path = cache.get_output_path(step.cache_id)
+
+    result_path = executor.execute(step.config, input_paths, output_path)
+    cache.put(step.cache_id, result_path)
+
+    return {'status': 'completed', 'cache_id': step.cache_id}
+```
+
+### Execution Orchestration
+
+```python
+class PlanExecutor:
+    def execute(self, plan: ExecutionPlan) -> ExecutionResult:
+        """Execute plan with parallel Celery tasks."""
+
+        # Group steps by level (steps at same level can run in parallel)
+        levels = self.compute_dependency_levels(plan.steps)
+
+        for level_steps in levels:
+            # Dispatch all steps at this level
+            tasks = [
+                execute_step.delay(step.to_json(), plan.plan_id)
+                for step in level_steps
+                if not self.cache.has(step.cache_id)
+            ]
+
+            # Wait for level completion
+            results = [task.get() for task in tasks]
+
+        return self.collect_results(plan)
+```
+
+## Data Flow Example
+
+### Recipe: beat-cuts
+```yaml
+nodes:
+  - id: music
+    type: SOURCE
+    config: { input: true }
+
+  - id: beats
+    type: ANALYZE
+    config: { feature: beats }
+    inputs: [music]
+
+  - id: videos
+    type: SOURCE_LIST
+    config: { input: true }
+
+  - id: slices
+    type: MAP
+    config: { operation: RANDOM_SLICE }
+    inputs:
+      items: videos
+      timing: beats
+
+  - id: final
+    type: SEQUENCE
+    inputs: [slices]
+```
+
+### Phase 1: Analysis
+```python
+# Input: music file with hash abc123
+analysis = {
+    'abc123': AnalysisResult(
+        beats=[0.0, 0.48, 0.96, 1.44, ...],
+        tempo=125.0,
+        duration=180.0
+    )
+}
+```
+
+### Phase 2: Planning
+```python
+# Expands MAP into concrete steps
+plan = ExecutionPlan(
+    steps=[
+        # Source steps
+        ExecutionStep(id='music', cache_id='abc123', ...),
+        ExecutionStep(id='video_0', cache_id='def456', ...),
+        ExecutionStep(id='video_1', cache_id='ghi789', ...),
+
+        # Slice steps (one per beat group)
+        ExecutionStep(id='slice_0', cache_id='hash(video_0+timing)', ...),
+        ExecutionStep(id='slice_1', cache_id='hash(video_1+timing)', ...),
+        ...
+
+        # Tree reduction for sequence
+        ExecutionStep(id='seq_0_1', inputs=['slice_0', 'slice_1'], ...),
+        ExecutionStep(id='seq_2_3', inputs=['slice_2', 'slice_3'], ...),
+        ExecutionStep(id='seq_final', inputs=['seq_0_1', 'seq_2_3'], ...),
+    ]
+)
+```
+
+### Phase 3: Execution
+```
+Level 0: [music, video_0, video_1] → all cached (SOURCE)
+Level 1: [slice_0, slice_1, slice_2, slice_3] → 4 parallel tasks
+Level 2: [seq_0_1, seq_2_3] → 2 parallel SEQUENCE tasks
+Level 3: [seq_final] → 1 final SEQUENCE task
+```
+
+## File Structure
+
+```
+artdag/
+├── artdag/
+│   ├── analysis/
+│   │   ├── __init__.py
+│   │   ├── analyzer.py      # Main Analyzer class
+│   │   ├── audio.py         # Audio feature extraction
+│   │   └── video.py         # Video feature extraction
+│   ├── planning/
+│   │   ├── __init__.py
+│   │   ├── planner.py       # RecipePlanner class
+│   │   ├── schema.py        # ExecutionPlan, ExecutionStep
+│   │   └── tree_reduction.py # Parallel composition optimizer
+│   └── execution/
+│       ├── __init__.py
+│       ├── executor.py      # PlanExecutor class
+│       └── claiming.py      # Hash-based task claiming
+
+art-celery/
+├── tasks/
+│   ├── __init__.py
+│   ├── analyze.py           # analyze_inputs task
+│   ├── plan.py              # generate_plan task
+│   ├── execute.py           # execute_step task
+│   └── orchestrate.py       # run_plan (coordinates all)
+├── claiming.py              # Redis Lua scripts
+└── ...
+```
+
+## CLI Interface
+
+```bash
+# Full pipeline
+artdag run-recipe recipes/beat-cuts/recipe.yaml \
+    -i music:abc123 \
+    -i videos:def456,ghi789
+
+# Phase by phase
+artdag analyze recipes/beat-cuts/recipe.yaml -i music:abc123
+# → outputs analysis.json
+
+artdag plan recipes/beat-cuts/recipe.yaml --analysis analysis.json
+# → outputs plan.json
+
+artdag execute plan.json
+# → runs with caching, skips completed steps
+
+# Dry run (show what would execute)
+artdag execute plan.json --dry-run
+# → shows which steps are cached vs need execution
+```
+
+## Benefits
+
+1. **Development Speed**: Change recipe, re-run → only affected steps execute
+2. **Parallelism**: Independent steps run on multiple Celery workers
+3. **Reproducibility**: Same inputs + recipe = same cache IDs = same output
+4. **Visibility**: Plan shows exactly what will happen before execution
+5. **Cost Control**: Estimate compute before committing resources
+6. **Fault Tolerance**: Failed runs resume from last successful step
--- a/docs/IPFS_PRIMARY_ARCHITECTURE.md
+++ b/docs/IPFS_PRIMARY_ARCHITECTURE.md
@@ -0,0 +1,443 @@
+# IPFS-Primary Architecture (Sketch)
+
+A simplified L1 architecture for large-scale distributed rendering where IPFS is the primary data store.
+
+## Current vs Simplified
+
+| Component | Current | Simplified |
+|-----------|---------|------------|
+| Local cache | Custom, per-worker | IPFS node handles it |
+| Redis content_index | content_hash → node_id | Eliminated |
+| Redis ipfs_index | content_hash → ipfs_cid | Eliminated |
+| Step inputs | File paths | IPFS CIDs |
+| Step outputs | File path + CID | Just CID |
+| Cache lookup | Local → Redis → IPFS | Just IPFS |
+
+## Core Principle
+
+**Steps receive CIDs, produce CIDs. No file paths cross machine boundaries.**
+
+```
+Step input:  [cid1, cid2, ...]
+Step output: cid_out
+```
+
+## Worker Architecture
+
+Each worker runs:
+
+```
+┌─────────────────────────────────────┐
+│           Worker Node               │
+│                                     │
+│  ┌───────────┐    ┌──────────────┐  │
+│  │  Celery   │────│  IPFS Node   │  │
+│  │  Worker   │    │  (local)     │  │
+│  └───────────┘    └──────────────┘  │
+│       │                  │          │
+│       │            ┌─────┴─────┐    │
+│       │            │ Local     │    │
+│       │            │ Blockstore│    │
+│       │            └───────────┘    │
+│       │                             │
+│  ┌────┴────┐                        │
+│  │ /tmp    │  (ephemeral workspace) │
+│  └─────────┘                        │
+└─────────────────────────────────────┘
+         │
+         │ IPFS libp2p
+         ▼
+   ┌─────────────┐
+   │ Other IPFS  │
+   │   Nodes     │
+   └─────────────┘
+```
+
+## Execution Flow
+
+### 1. Plan Generation (unchanged)
+
+```python
+plan = planner.plan(recipe, input_hashes)
+# plan.steps[].cache_id = deterministic hash
+```
+
+### 2. Input Registration
+
+Before execution, register inputs with IPFS:
+
+```python
+input_cids = {}
+for name, path in inputs.items():
+    cid = ipfs.add(path)
+    input_cids[name] = cid
+
+# Plan now carries CIDs
+plan.input_cids = input_cids
+```
+
+### 3. Step Execution
+
+```python
+@celery.task
+def execute_step(step_json: str, input_cids: dict[str, str]) -> str:
+    """Execute step, return output CID."""
+    step = ExecutionStep.from_json(step_json)
+
+    # Check if already computed (by cache_id as IPNS key or DHT lookup)
+    existing_cid = ipfs.resolve(f"/ipns/{step.cache_id}")
+    if existing_cid:
+        return existing_cid
+
+    # Fetch inputs from IPFS → local temp files
+    input_paths = []
+    for input_step_id in step.input_steps:
+        cid = input_cids[input_step_id]
+        path = ipfs.get(cid, f"/tmp/{cid}")  # IPFS node caches automatically
+        input_paths.append(path)
+
+    # Execute
+    output_path = f"/tmp/{step.cache_id}.mkv"
+    executor = get_executor(step.node_type)
+    executor.execute(step.config, input_paths, output_path)
+
+    # Add output to IPFS
+    output_cid = ipfs.add(output_path)
+
+    # Publish cache_id → CID mapping (optional, for cache hits)
+    ipfs.name_publish(step.cache_id, output_cid)
+
+    # Cleanup temp files
+    cleanup_temp(input_paths + [output_path])
+
+    return output_cid
+```
+
+### 4. Orchestration
+
+```python
+@celery.task
+def run_plan(plan_json: str) -> str:
+    """Execute plan, return final output CID."""
+    plan = ExecutionPlan.from_json(plan_json)
+
+    # CID results accumulate as steps complete
+    cid_results = dict(plan.input_cids)
+
+    for level in plan.get_steps_by_level():
+        # Parallel execution within level
+        tasks = []
+        for step in level:
+            step_input_cids = {
+                sid: cid_results[sid]
+                for sid in step.input_steps
+            }
+            tasks.append(execute_step.s(step.to_json(), step_input_cids))
+
+        # Wait for level to complete
+        results = group(tasks).apply_async().get()
+
+        # Record output CIDs
+        for step, cid in zip(level, results):
+            cid_results[step.step_id] = cid
+
+    return cid_results[plan.output_step]
+```
+
+## What's Eliminated
+
+### No more Redis indexes
+
+```python
+# BEFORE: Complex index management
+self._set_content_index(content_hash, node_id)  # Redis + local
+self._set_ipfs_index(content_hash, ipfs_cid)    # Redis + local
+node_id = self._get_content_index(content_hash)  # Check Redis, fallback local
+
+# AFTER: Just CIDs
+output_cid = ipfs.add(output_path)
+return output_cid
+```
+
+### No more local cache management
+
+```python
+# BEFORE: Custom cache with entries, metadata, cleanup
+cache.put(node_id, source_path, node_type, execution_time)
+cache.get(node_id)
+cache.has(node_id)
+cache.cleanup_lru()
+
+# AFTER: IPFS handles it
+ipfs.add(path)  # Store
+ipfs.get(cid)   # Retrieve (cached by IPFS node)
+ipfs.pin(cid)   # Keep permanently
+ipfs.gc()       # Cleanup unpinned
+```
+
+### No more content_hash vs node_id confusion
+
+```python
+# BEFORE: Two identifiers
+content_hash = sha3_256(file_bytes)  # What the file IS
+node_id = cache_id                    # What computation produced it
+# Need indexes to map between them
+
+# AFTER: One identifier
+cid = ipfs.add(file)  # Content-addressed, includes hash
+# CID IS the identifier
+```
+
+## Cache Hit Detection
+
+Two options:
+
+### Option A: IPNS (mutable names)
+
+```python
+# Publish: cache_id → CID
+ipfs.name_publish(key=cache_id, value=output_cid)
+
+# Lookup before executing
+existing = ipfs.name_resolve(cache_id)
+if existing:
+    return existing  # Cache hit
+```
+
+### Option B: DHT record
+
+```python
+# Store in DHT: cache_id → CID
+ipfs.dht_put(cache_id, output_cid)
+
+# Lookup
+existing = ipfs.dht_get(cache_id)
+```
+
+### Option C: Redis (minimal)
+
+Keep Redis just for cache_id → CID mapping:
+
+```python
+# Store
+redis.hset("artdag:cache", cache_id, output_cid)
+
+# Lookup
+existing = redis.hget("artdag:cache", cache_id)
+```
+
+This is simpler than current approach - one hash, one mapping, no content_hash/node_id confusion.
+
+## Claiming (Preventing Duplicate Work)
+
+Still need Redis for atomic claiming:
+
+```python
+# Claim before executing
+claimed = redis.set(f"artdag:claim:{cache_id}", worker_id, nx=True, ex=300)
+if not claimed:
+    # Another worker is doing it - wait for result
+    return wait_for_result(cache_id)
+```
+
+Or use IPFS pubsub for coordination.
+
+## Data Flow Diagram
+
+```
+                    ┌─────────────┐
+                    │   Recipe    │
+                    │   + Inputs  │
+                    └──────┬──────┘
+                           │
+                           ▼
+                    ┌─────────────┐
+                    │   Planner   │
+                    │ (compute    │
+                    │  cache_ids) │
+                    └──────┬──────┘
+                           │
+                           ▼
+         ┌─────────────────────────────────┐
+         │     ExecutionPlan               │
+         │  - steps with cache_ids         │
+         │  - input_cids (from ipfs.add)   │
+         └─────────────────┬───────────────┘
+                           │
+              ┌────────────┼────────────┐
+              ▼            ▼            ▼
+         ┌────────┐   ┌────────┐   ┌────────┐
+         │Worker 1│   │Worker 2│   │Worker 3│
+         │        │   │        │   │        │
+         │ IPFS   │◄──│ IPFS   │◄──│ IPFS   │
+         │ Node   │──►│ Node   │──►│ Node   │
+         └───┬────┘   └───┬────┘   └───┬────┘
+             │            │            │
+             └────────────┼────────────┘
+                          │
+                          ▼
+                   ┌─────────────┐
+                   │ Final CID   │
+                   │ (output)    │
+                   └─────────────┘
+```
+
+## Benefits
+
+1. **Simpler code** - No custom cache, no dual indexes
+2. **Automatic distribution** - IPFS handles replication
+3. **Content verification** - CIDs are self-verifying
+4. **Scalable** - Add workers = add IPFS nodes = more cache capacity
+5. **Resilient** - Any node can serve any content
+
+## Tradeoffs
+
+1. **IPFS dependency** - Every worker needs IPFS node
+2. **Initial fetch latency** - First fetch may be slower than local disk
+3. **IPNS latency** - Name resolution can be slow (Option C avoids this)
+
+## Trust Domains (Cluster Key)
+
+Systems can share work through IPFS, but how do you trust them?
+
+**Problem:** A malicious system could return wrong CIDs for computed steps.
+
+**Solution:** Cluster key creates isolated trust domains:
+
+```bash
+export ARTDAG_CLUSTER_KEY="my-secret-shared-key"
+```
+
+**How it works:**
+- The cluster key is mixed into all cache_id computations
+- Systems with the same key produce the same cache_ids
+- Systems with different keys have separate cache namespaces
+- Only share the key with trusted partners
+
+```
+cache_id = SHA3-256(cluster_key + node_type + config + inputs)
+```
+
+**Trust model:**
+| Scenario | Same Key? | Can Share Work? |
+|----------|-----------|-----------------|
+| Same organization | Yes | Yes |
+| Trusted partner | Yes (shared) | Yes |
+| Unknown system | No | No (different cache_ids) |
+
+**Configuration:**
+```yaml
+# docker-compose.yml
+environment:
+  - ARTDAG_CLUSTER_KEY=your-secret-key-here
+```
+
+**Programmatic:**
+```python
+from artdag.planning.schema import set_cluster_key
+set_cluster_key("my-secret-key")
+```
+
+## Implementation
+
+The simplified architecture is implemented in `art-celery/`:
+
+| File | Purpose |
+|------|---------|
+| `hybrid_state.py` | Hybrid state manager (Redis + IPNS) |
+| `tasks/execute_cid.py` | Step execution with CIDs |
+| `tasks/analyze_cid.py` | Analysis with CIDs |
+| `tasks/orchestrate_cid.py` | Full pipeline orchestration |
+
+### Key Functions
+
+**Registration (local → IPFS):**
+- `register_input_cid(path)` → `{cid, content_hash}`
+- `register_recipe_cid(path)` → `{cid, name, version}`
+
+**Analysis:**
+- `analyze_input_cid(input_cid, input_hash, features)` → `{analysis_cid}`
+
+**Planning:**
+- `generate_plan_cid(recipe_cid, input_cids, input_hashes, analysis_cids)` → `{plan_cid}`
+
+**Execution:**
+- `execute_step_cid(step_json, input_cids)` → `{cid}`
+- `execute_plan_from_cid(plan_cid, input_cids)` → `{output_cid}`
+
+**Full Pipeline:**
+- `run_recipe_cid(recipe_cid, input_cids, input_hashes)` → `{output_cid, all_cids}`
+- `run_from_local(recipe_path, input_paths)` → registers + runs
+
+### Hybrid State Manager
+
+For distributed L1 coordination, use the `HybridStateManager` which provides:
+
+**Fast path (local Redis):**
+- `get_cached_cid(cache_id)` / `set_cached_cid(cache_id, cid)` - microsecond lookups
+- `try_claim(cache_id, worker_id)` / `release_claim(cache_id)` - atomic claiming
+- `get_analysis_cid()` / `set_analysis_cid()` - analysis cache
+- `get_plan_cid()` / `set_plan_cid()` - plan cache
+- `get_run_cid()` / `set_run_cid()` - run cache
+
+**Slow path (background IPNS sync):**
+- Periodically syncs local state with global IPNS state (default: every 30s)
+- Pulls new entries from remote nodes
+- Pushes local updates to IPNS
+
+**Configuration:**
+```bash
+# Enable IPNS sync
+export ARTDAG_IPNS_SYNC=true
+export ARTDAG_IPNS_SYNC_INTERVAL=30  # seconds
+```
+
+**Usage:**
+```python
+from hybrid_state import get_state_manager
+
+state = get_state_manager()
+
+# Fast local lookup
+cid = state.get_cached_cid(cache_id)
+
+# Fast local write (synced in background)
+state.set_cached_cid(cache_id, output_cid)
+
+# Atomic claim
+if state.try_claim(cache_id, worker_id):
+    # We have the lock
+    ...
+```
+
+**Trade-offs:**
+- Local Redis: Fast (microseconds), single node
+- IPNS sync: Slow (seconds), eventually consistent across nodes
+- Duplicate work: Accepted (idempotent - same inputs → same CID)
+
+### Redis Usage (minimal)
+
+| Key | Type | Purpose |
+|-----|------|---------|
+| `artdag:cid_cache` | Hash | cache_id → output CID |
+| `artdag:analysis_cache` | Hash | input_hash:features → analysis CID |
+| `artdag:plan_cache` | Hash | plan_id → plan CID |
+| `artdag:run_cache` | Hash | run_id → output CID |
+| `artdag:claim:{cache_id}` | String | worker_id (TTL 5 min) |
+
+## Migration Path
+
+1. Keep current system working ✓
+2. Add CID-based tasks ✓
+   - `execute_cid.py` ✓
+   - `analyze_cid.py` ✓
+   - `orchestrate_cid.py` ✓
+3. Add `--ipfs-primary` flag to CLI ✓
+4. Add hybrid state manager for L1 coordination ✓
+5. Gradually deprecate local cache code
+6. Remove old tasks when CID versions are stable
+
+## See Also
+
+- [L1_STORAGE.md](L1_STORAGE.md) - Current L1 architecture
+- [EXECUTION_MODEL.md](EXECUTION_MODEL.md) - 3-phase model
--- a/docs/L1_STORAGE.md
+++ b/docs/L1_STORAGE.md
@@ -0,0 +1,181 @@
+# L1 Distributed Storage Architecture
+
+This document describes how data is stored when running artdag on L1 (the distributed rendering layer).
+
+## Overview
+
+L1 uses four storage systems working together:
+
+| System | Purpose | Data Stored |
+|--------|---------|-------------|
+| **Local Cache** | Hot storage (fast access) | Media files, plans, analysis |
+| **IPFS** | Durable content-addressed storage | All media outputs |
+| **Redis** | Coordination & indexes | Claims, mappings, run status |
+| **PostgreSQL** | Metadata & ownership | User data, provenance |
+
+## Storage Flow
+
+When a step executes on L1:
+
+```
+1. Executor produces output file
+2. Store in local cache (fast)
+3. Compute content_hash = SHA3-256(file)
+4. Upload to IPFS → get ipfs_cid
+5. Update indexes:
+   - content_hash → node_id (Redis + local)
+   - content_hash → ipfs_cid (Redis + local)
+```
+
+Every intermediate step output (SEGMENT, SEQUENCE, etc.) gets its own IPFS CID.
+
+## Local Cache
+
+Hot storage on each worker node:
+
+```
+cache_dir/
+  index.json                    # Cache metadata
+  content_index.json            # content_hash → node_id
+  ipfs_index.json               # content_hash → ipfs_cid
+  plans/
+    {plan_id}.json              # Cached execution plans
+  analysis/
+    {hash}.json                 # Analysis results
+  {node_id}/
+    output.mkv                  # Media output
+    metadata.json               # CacheEntry metadata
+```
+
+## IPFS - Durable Media Storage
+
+All media files are stored in IPFS for durability and content-addressing.
+
+**Supported pinning providers:**
+- Pinata
+- web3.storage
+- NFT.Storage
+- Infura IPFS
+- Filebase (S3-compatible)
+- Storj (decentralized)
+- Local IPFS node
+
+**Configuration:**
+```bash
+IPFS_API=/ip4/127.0.0.1/tcp/5001  # Local IPFS daemon
+```
+
+## Redis - Coordination
+
+Redis handles distributed coordination across workers.
+
+### Key Patterns
+
+| Key | Type | Purpose |
+|-----|------|---------|
+| `artdag:run:{run_id}` | String | Run status, timestamps, celery task ID |
+| `artdag:content_index` | Hash | content_hash → node_id mapping |
+| `artdag:ipfs_index` | Hash | content_hash → ipfs_cid mapping |
+| `artdag:claim:{cache_id}` | String | Task claiming (prevents duplicate work) |
+
+### Task Claiming
+
+Lua scripts ensure atomic claiming across workers:
+
+```
+Status flow: PENDING → CLAIMED → RUNNING → COMPLETED/CACHED/FAILED
+TTL: 5 minutes for claims, 1 hour for results
+```
+
+This prevents two workers from executing the same step.
+
+## PostgreSQL - Metadata
+
+Stores ownership, provenance, and sharing metadata.
+
+### Tables
+
+```sql
+-- Core cache (shared)
+cache_items (content_hash, ipfs_cid, created_at)
+
+-- Per-user ownership
+item_types (content_hash, actor_id, type, metadata)
+
+-- Run cache (deterministic identity)
+run_cache (
+  run_id,           -- SHA3-256(sorted_inputs + recipe)
+  output_hash,
+  ipfs_cid,
+  provenance_cid,
+  recipe, inputs, actor_id
+)
+
+-- Storage backends
+storage_backends (actor_id, provider_type, config, capacity_gb)
+
+-- What's stored where
+storage_pins (content_hash, storage_id, ipfs_cid, pin_type)
+```
+
+## Cache Lookup Flow
+
+When a worker needs a file:
+
+```
+1. Check local cache by cache_id (fastest)
+2. Check Redis content_index: content_hash → node_id
+3. Check PostgreSQL cache_items
+4. Retrieve from IPFS by CID
+5. Store in local cache for next hit
+```
+
+## Local vs L1 Comparison
+
+| Feature | Local Testing | L1 Distributed |
+|---------|---------------|----------------|
+| Local cache | Yes | Yes |
+| IPFS | No | Yes |
+| Redis | No | Yes |
+| PostgreSQL | No | Yes |
+| Multi-worker | No | Yes |
+| Task claiming | No | Yes (Lua scripts) |
+| Durability | Filesystem only | IPFS + PostgreSQL |
+
+## Content Addressing
+
+All storage uses SHA3-256 (quantum-resistant):
+
+- **Files:** `content_hash = SHA3-256(file_bytes)`
+- **Computation:** `cache_id = SHA3-256(type + config + input_hashes)`
+- **Run identity:** `run_id = SHA3-256(sorted_inputs + recipe)`
+- **Plans:** `plan_id = SHA3-256(recipe + inputs + analysis)`
+
+This ensures:
+- Same inputs → same outputs (reproducibility)
+- Automatic deduplication across workers
+- Content verification (tamper detection)
+
+## Configuration
+
+Default locations:
+
+```bash
+# Local cache
+~/.artdag/cache           # Default
+/data/cache               # Docker
+
+# Redis
+redis://localhost:6379/5
+
+# PostgreSQL
+postgresql://user:pass@host/artdag
+
+# IPFS
+/ip4/127.0.0.1/tcp/5001
+```
+
+## See Also
+
+- [OFFLINE_TESTING.md](OFFLINE_TESTING.md) - Local testing without L1
+- [EXECUTION_MODEL.md](EXECUTION_MODEL.md) - 3-phase execution model
--- a/docs/OFFLINE_TESTING.md
+++ b/docs/OFFLINE_TESTING.md
@@ -0,0 +1,211 @@
+# Offline Testing Strategy
+
+This document describes how to test artdag locally without requiring Redis, IPFS, Celery, or any external distributed infrastructure.
+
+## Overview
+
+The artdag system uses a **3-Phase Execution Model** that enables complete offline testing:
+
+1. **Analysis** - Extract features from input media
+2. **Planning** - Generate deterministic execution plan with pre-computed cache IDs
+3. **Execution** - Run plan steps, skipping cached results
+
+This separation allows testing each phase independently and running full pipelines locally.
+
+## Quick Start
+
+Run a full offline test with a video file:
+
+```bash
+./examples/test_local.sh ../artdag-art-source/dog.mkv
+```
+
+This will:
+1. Compute the SHA3-256 hash of the input video
+2. Run the `simple_sequence` recipe
+3. Store all outputs in `test_cache/`
+
+## Test Scripts
+
+### `test_local.sh` - Full Pipeline Test
+
+Location: `./examples/test_local.sh`
+
+Runs the complete artdag pipeline offline with a real video file.
+
+**Usage:**
+```bash
+./examples/test_local.sh <video_file>
+```
+
+**Example:**
+```bash
+./examples/test_local.sh ../artdag-art-source/dog.mkv
+```
+
+**What it does:**
+- Computes content hash of input video
+- Runs `artdag run-recipe` with `simple_sequence.yaml`
+- Stores outputs in `test_cache/` directory
+- No external services required
+
+### `test_plan.py` - Planning Phase Test
+
+Location: `./examples/test_plan.py`
+
+Tests the planning phase without requiring any media files.
+
+**Usage:**
+```bash
+python3 examples/test_plan.py
+```
+
+**What it tests:**
+- Recipe loading and YAML parsing
+- Execution plan generation
+- Cache ID computation (deterministic)
+- Multi-level parallel step organization
+- Human-readable step names
+- Multi-output support
+
+**Output:**
+- Prints plan structure to console
+- Saves full plan to `test_plan_output.json`
+
+### `simple_sequence.yaml` - Sample Recipe
+
+Location: `./examples/simple_sequence.yaml`
+
+A simple recipe for testing that:
+- Takes a video input
+- Extracts two segments (0-2s and 5-7s)
+- Concatenates them with SEQUENCE
+
+## Test Outputs
+
+All test outputs are stored locally and git-ignored:
+
+| Output | Description |
+|--------|-------------|
+| `test_cache/` | Cached execution results (media files, analysis, plans) |
+| `test_cache/plans/` | Cached execution plans by plan_id |
+| `test_cache/analysis/` | Cached analysis results by input hash |
+| `test_plan_output.json` | Generated execution plan from `test_plan.py` |
+
+## Unit Tests
+
+The project includes a comprehensive pytest test suite in `tests/`:
+
+```bash
+# Run all unit tests
+pytest
+
+# Run specific test file
+pytest tests/test_dag.py
+pytest tests/test_engine.py
+pytest tests/test_cache.py
+```
+
+## Testing Each Phase
+
+### Phase 1: Analysis Only
+
+Extract features without full execution:
+
+```bash
+python3 -m artdag.cli analyze <recipe> -i <name>:<hash>@<path> --features beats,energy
+```
+
+### Phase 2: Planning Only
+
+Generate an execution plan (no media needed):
+
+```bash
+python3 -m artdag.cli plan <recipe> -i <name>:<hash>
+```
+
+Or use the test script:
+
+```bash
+python3 examples/test_plan.py
+```
+
+### Phase 3: Execution Only
+
+Execute a pre-generated plan:
+
+```bash
+python3 -m artdag.cli execute plan.json
+```
+
+With dry-run to see what would execute:
+
+```bash
+python3 -m artdag.cli execute plan.json --dry-run
+```
+
+## Key Testing Features
+
+### Content Addressing
+
+All nodes have deterministic IDs computed as:
+```
+SHA3-256(type + config + sorted(input_IDs))
+```
+
+Same inputs always produce same cache IDs, enabling:
+- Reproducibility across runs
+- Automatic deduplication
+- Incremental execution (only changed steps run)
+
+### Local Caching
+
+The `test_cache/` directory stores:
+- `plans/{plan_id}.json` - Execution plans (deterministic hash of recipe + inputs + analysis)
+- `analysis/{hash}.json` - Analysis results (audio beats, tempo, energy)
+- `{cache_id}/output.mkv` - Media outputs from each step
+
+Subsequent test runs automatically skip cached steps. Plans are cached by their `plan_id`, which is a SHA3-256 hash of the recipe, input hashes, and analysis results - so the same recipe with the same inputs always produces the same plan.
+
+### No External Dependencies
+
+Offline testing requires:
+- Python 3.9+
+- ffmpeg (for media processing)
+- No Redis, IPFS, Celery, or network access
+
+## Debugging Tips
+
+1. **Check cache contents:**
+   ```bash
+   ls -la test_cache/
+   ls -la test_cache/plans/
+   ```
+
+2. **View cached plan:**
+   ```bash
+   cat test_cache/plans/*.json | python3 -m json.tool | head -50
+   ```
+
+3. **View execution plan structure:**
+   ```bash
+   cat test_plan_output.json | python3 -m json.tool
+   ```
+
+4. **Run with verbose output:**
+   ```bash
+   python3 -m artdag.cli run-recipe examples/simple_sequence.yaml \
+       -i "video:HASH@path" \
+       --cache-dir test_cache \
+       -v
+   ```
+
+5. **Dry-run to see what would execute:**
+   ```bash
+   python3 -m artdag.cli execute plan.json --dry-run
+   ```
+
+## See Also
+
+- [L1_STORAGE.md](L1_STORAGE.md) - Distributed storage on L1 (IPFS, Redis, PostgreSQL)
+- [EXECUTION_MODEL.md](EXECUTION_MODEL.md) - 3-phase execution model