Squashed 'core/' content from commit 4957443
git-subtree-dir: core git-subtree-split: 4957443184ae0eb6323635a90a19acffb3e01d07
This commit is contained in:
384
docs/EXECUTION_MODEL.md
Normal file
384
docs/EXECUTION_MODEL.md
Normal file
@@ -0,0 +1,384 @@
|
||||
# Art DAG 3-Phase Execution Model
|
||||
|
||||
## Overview
|
||||
|
||||
The execution model separates DAG processing into three distinct phases:
|
||||
|
||||
```
|
||||
Recipe + Inputs → ANALYZE → Analysis Results
|
||||
↓
|
||||
Analysis + Recipe → PLAN → Execution Plan (with cache IDs)
|
||||
↓
|
||||
Execution Plan → EXECUTE → Cached Results
|
||||
```
|
||||
|
||||
This separation enables:
|
||||
1. **Incremental development** - Re-run recipes without reprocessing unchanged steps
|
||||
2. **Parallel execution** - Independent steps run concurrently via Celery
|
||||
3. **Deterministic caching** - Same inputs always produce same cache IDs
|
||||
4. **Cost estimation** - Plan phase can estimate work before executing
|
||||
|
||||
## Phase 1: Analysis
|
||||
|
||||
### Purpose
|
||||
Extract features from input media that inform downstream processing decisions.
|
||||
|
||||
### Inputs
|
||||
- Recipe YAML with input references
|
||||
- Input media files (by content hash)
|
||||
|
||||
### Outputs
|
||||
Analysis results stored as JSON, keyed by input hash:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class AnalysisResult:
|
||||
input_hash: str
|
||||
features: Dict[str, Any]
|
||||
# Audio features
|
||||
beats: Optional[List[float]] # Beat times in seconds
|
||||
downbeats: Optional[List[float]] # Bar-start times
|
||||
tempo: Optional[float] # BPM
|
||||
energy: Optional[List[Tuple[float, float]]] # (time, value) envelope
|
||||
spectrum: Optional[Dict[str, List[Tuple[float, float]]]] # band envelopes
|
||||
# Video features
|
||||
duration: float
|
||||
frame_rate: float
|
||||
dimensions: Tuple[int, int]
|
||||
motion_tempo: Optional[float] # Estimated BPM from motion
|
||||
```
|
||||
|
||||
### Implementation
|
||||
```python
|
||||
class Analyzer:
|
||||
def analyze(self, input_hash: str, features: List[str]) -> AnalysisResult:
|
||||
"""Extract requested features from input."""
|
||||
|
||||
def analyze_audio(self, path: Path) -> AudioFeatures:
|
||||
"""Extract all audio features using librosa/essentia."""
|
||||
|
||||
def analyze_video(self, path: Path) -> VideoFeatures:
|
||||
"""Extract video metadata and motion analysis."""
|
||||
```
|
||||
|
||||
### Caching
|
||||
Analysis results are cached by:
|
||||
```
|
||||
analysis_cache_id = SHA3-256(input_hash + sorted(feature_names))
|
||||
```
|
||||
|
||||
## Phase 2: Planning
|
||||
|
||||
### Purpose
|
||||
Convert recipe + analysis into a complete execution plan with pre-computed cache IDs.
|
||||
|
||||
### Inputs
|
||||
- Recipe YAML (parsed)
|
||||
- Analysis results for all inputs
|
||||
- Recipe parameters (user-supplied values)
|
||||
|
||||
### Outputs
|
||||
An ExecutionPlan containing ordered steps, each with a pre-computed cache ID:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ExecutionStep:
|
||||
step_id: str # Unique identifier
|
||||
node_type: str # Primitive type (SOURCE, SEQUENCE, etc.)
|
||||
config: Dict[str, Any] # Node configuration
|
||||
input_steps: List[str] # IDs of steps this depends on
|
||||
cache_id: str # Pre-computed: hash(inputs + config)
|
||||
estimated_duration: float # Optional: for progress reporting
|
||||
|
||||
@dataclass
|
||||
class ExecutionPlan:
|
||||
plan_id: str # Hash of entire plan
|
||||
recipe_id: str # Source recipe
|
||||
steps: List[ExecutionStep] # Topologically sorted
|
||||
analysis: Dict[str, AnalysisResult]
|
||||
output_step: str # Final step ID
|
||||
|
||||
def compute_cache_ids(self):
|
||||
"""Compute all cache IDs in dependency order."""
|
||||
```
|
||||
|
||||
### Cache ID Computation
|
||||
|
||||
Cache IDs are computed in topological order so each step's cache ID
|
||||
incorporates its inputs' cache IDs:
|
||||
|
||||
```python
|
||||
def compute_cache_id(step: ExecutionStep, resolved_inputs: Dict[str, str]) -> str:
|
||||
"""
|
||||
Cache ID = SHA3-256(
|
||||
node_type +
|
||||
canonical_json(config) +
|
||||
sorted([input_cache_ids])
|
||||
)
|
||||
"""
|
||||
components = [
|
||||
step.node_type,
|
||||
json.dumps(step.config, sort_keys=True),
|
||||
*sorted(resolved_inputs[s] for s in step.input_steps)
|
||||
]
|
||||
return sha3_256('|'.join(components))
|
||||
```
|
||||
|
||||
### Plan Generation
|
||||
|
||||
The planner expands recipe nodes into concrete steps:
|
||||
|
||||
1. **SOURCE nodes** → Direct step with input hash as cache ID
|
||||
2. **ANALYZE nodes** → Step that references analysis results
|
||||
3. **TRANSFORM nodes** → Step with static config
|
||||
4. **TRANSFORM_DYNAMIC nodes** → Expanded to per-frame steps (or use BIND output)
|
||||
5. **SEQUENCE nodes** → Tree reduction for parallel composition
|
||||
6. **MAP nodes** → Expanded to N parallel steps + reduction
|
||||
|
||||
### Tree Reduction for Composition
|
||||
|
||||
Instead of sequential pairwise composition:
|
||||
```
|
||||
A → B → C → D (3 sequential steps)
|
||||
```
|
||||
|
||||
Use parallel tree reduction:
|
||||
```
|
||||
A ─┬─ AB ─┬─ ABCD
|
||||
B ─┘ │
|
||||
C ─┬─ CD ─┘
|
||||
D ─┘
|
||||
|
||||
Level 0: [A, B, C, D] (4 parallel)
|
||||
Level 1: [AB, CD] (2 parallel)
|
||||
Level 2: [ABCD] (1 final)
|
||||
```
|
||||
|
||||
This reduces O(N) to O(log N) levels.
|
||||
|
||||
## Phase 3: Execution
|
||||
|
||||
### Purpose
|
||||
Execute the plan, skipping steps with cached results.
|
||||
|
||||
### Inputs
|
||||
- ExecutionPlan with pre-computed cache IDs
|
||||
- Cache state (which IDs already exist)
|
||||
|
||||
### Process
|
||||
|
||||
1. **Claim Check**: For each step, atomically check if result is cached
|
||||
2. **Task Dispatch**: Uncached steps dispatched to Celery workers
|
||||
3. **Parallel Execution**: Independent steps run concurrently
|
||||
4. **Result Storage**: Each step stores result with its cache ID
|
||||
5. **Progress Tracking**: Real-time status updates
|
||||
|
||||
### Hash-Based Task Claiming
|
||||
|
||||
Prevents duplicate work when multiple workers process the same plan:
|
||||
|
||||
```lua
|
||||
-- Redis Lua script for atomic claim
|
||||
local key = KEYS[1]
|
||||
local data = redis.call('GET', key)
|
||||
if data then
|
||||
local status = cjson.decode(data)
|
||||
if status.status == 'running' or
|
||||
status.status == 'completed' or
|
||||
status.status == 'cached' then
|
||||
return 0 -- Already claimed/done
|
||||
end
|
||||
end
|
||||
local claim_data = ARGV[1]
|
||||
local ttl = tonumber(ARGV[2])
|
||||
redis.call('SETEX', key, ttl, claim_data)
|
||||
return 1 -- Successfully claimed
|
||||
```
|
||||
|
||||
### Celery Task Structure
|
||||
|
||||
```python
|
||||
@app.task(bind=True)
|
||||
def execute_step(self, step_json: str, plan_id: str) -> dict:
|
||||
"""Execute a single step with caching."""
|
||||
step = ExecutionStep.from_json(step_json)
|
||||
|
||||
# Check cache first
|
||||
if cache.has(step.cache_id):
|
||||
return {'status': 'cached', 'cache_id': step.cache_id}
|
||||
|
||||
# Try to claim this work
|
||||
if not claim_task(step.cache_id, self.request.id):
|
||||
# Another worker is handling it, wait for result
|
||||
return wait_for_result(step.cache_id)
|
||||
|
||||
# Do the work
|
||||
executor = get_executor(step.node_type)
|
||||
input_paths = [cache.get(s) for s in step.input_steps]
|
||||
output_path = cache.get_output_path(step.cache_id)
|
||||
|
||||
result_path = executor.execute(step.config, input_paths, output_path)
|
||||
cache.put(step.cache_id, result_path)
|
||||
|
||||
return {'status': 'completed', 'cache_id': step.cache_id}
|
||||
```
|
||||
|
||||
### Execution Orchestration
|
||||
|
||||
```python
|
||||
class PlanExecutor:
|
||||
def execute(self, plan: ExecutionPlan) -> ExecutionResult:
|
||||
"""Execute plan with parallel Celery tasks."""
|
||||
|
||||
# Group steps by level (steps at same level can run in parallel)
|
||||
levels = self.compute_dependency_levels(plan.steps)
|
||||
|
||||
for level_steps in levels:
|
||||
# Dispatch all steps at this level
|
||||
tasks = [
|
||||
execute_step.delay(step.to_json(), plan.plan_id)
|
||||
for step in level_steps
|
||||
if not self.cache.has(step.cache_id)
|
||||
]
|
||||
|
||||
# Wait for level completion
|
||||
results = [task.get() for task in tasks]
|
||||
|
||||
return self.collect_results(plan)
|
||||
```
|
||||
|
||||
## Data Flow Example
|
||||
|
||||
### Recipe: beat-cuts
|
||||
```yaml
|
||||
nodes:
|
||||
- id: music
|
||||
type: SOURCE
|
||||
config: { input: true }
|
||||
|
||||
- id: beats
|
||||
type: ANALYZE
|
||||
config: { feature: beats }
|
||||
inputs: [music]
|
||||
|
||||
- id: videos
|
||||
type: SOURCE_LIST
|
||||
config: { input: true }
|
||||
|
||||
- id: slices
|
||||
type: MAP
|
||||
config: { operation: RANDOM_SLICE }
|
||||
inputs:
|
||||
items: videos
|
||||
timing: beats
|
||||
|
||||
- id: final
|
||||
type: SEQUENCE
|
||||
inputs: [slices]
|
||||
```
|
||||
|
||||
### Phase 1: Analysis
|
||||
```python
|
||||
# Input: music file with hash abc123
|
||||
analysis = {
|
||||
'abc123': AnalysisResult(
|
||||
beats=[0.0, 0.48, 0.96, 1.44, ...],
|
||||
tempo=125.0,
|
||||
duration=180.0
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 2: Planning
|
||||
```python
|
||||
# Expands MAP into concrete steps
|
||||
plan = ExecutionPlan(
|
||||
steps=[
|
||||
# Source steps
|
||||
ExecutionStep(id='music', cache_id='abc123', ...),
|
||||
ExecutionStep(id='video_0', cache_id='def456', ...),
|
||||
ExecutionStep(id='video_1', cache_id='ghi789', ...),
|
||||
|
||||
# Slice steps (one per beat group)
|
||||
ExecutionStep(id='slice_0', cache_id='hash(video_0+timing)', ...),
|
||||
ExecutionStep(id='slice_1', cache_id='hash(video_1+timing)', ...),
|
||||
...
|
||||
|
||||
# Tree reduction for sequence
|
||||
ExecutionStep(id='seq_0_1', inputs=['slice_0', 'slice_1'], ...),
|
||||
ExecutionStep(id='seq_2_3', inputs=['slice_2', 'slice_3'], ...),
|
||||
ExecutionStep(id='seq_final', inputs=['seq_0_1', 'seq_2_3'], ...),
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Phase 3: Execution
|
||||
```
|
||||
Level 0: [music, video_0, video_1] → all cached (SOURCE)
|
||||
Level 1: [slice_0, slice_1, slice_2, slice_3] → 4 parallel tasks
|
||||
Level 2: [seq_0_1, seq_2_3] → 2 parallel SEQUENCE tasks
|
||||
Level 3: [seq_final] → 1 final SEQUENCE task
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
artdag/
|
||||
├── artdag/
|
||||
│ ├── analysis/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── analyzer.py # Main Analyzer class
|
||||
│ │ ├── audio.py # Audio feature extraction
|
||||
│ │ └── video.py # Video feature extraction
|
||||
│ ├── planning/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── planner.py # RecipePlanner class
|
||||
│ │ ├── schema.py # ExecutionPlan, ExecutionStep
|
||||
│ │ └── tree_reduction.py # Parallel composition optimizer
|
||||
│ └── execution/
|
||||
│ ├── __init__.py
|
||||
│ ├── executor.py # PlanExecutor class
|
||||
│ └── claiming.py # Hash-based task claiming
|
||||
|
||||
art-celery/
|
||||
├── tasks/
|
||||
│ ├── __init__.py
|
||||
│ ├── analyze.py # analyze_inputs task
|
||||
│ ├── plan.py # generate_plan task
|
||||
│ ├── execute.py # execute_step task
|
||||
│ └── orchestrate.py # run_plan (coordinates all)
|
||||
├── claiming.py # Redis Lua scripts
|
||||
└── ...
|
||||
```
|
||||
|
||||
## CLI Interface
|
||||
|
||||
```bash
|
||||
# Full pipeline
|
||||
artdag run-recipe recipes/beat-cuts/recipe.yaml \
|
||||
-i music:abc123 \
|
||||
-i videos:def456,ghi789
|
||||
|
||||
# Phase by phase
|
||||
artdag analyze recipes/beat-cuts/recipe.yaml -i music:abc123
|
||||
# → outputs analysis.json
|
||||
|
||||
artdag plan recipes/beat-cuts/recipe.yaml --analysis analysis.json
|
||||
# → outputs plan.json
|
||||
|
||||
artdag execute plan.json
|
||||
# → runs with caching, skips completed steps
|
||||
|
||||
# Dry run (show what would execute)
|
||||
artdag execute plan.json --dry-run
|
||||
# → shows which steps are cached vs need execution
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Development Speed**: Change recipe, re-run → only affected steps execute
|
||||
2. **Parallelism**: Independent steps run on multiple Celery workers
|
||||
3. **Reproducibility**: Same inputs + recipe = same cache IDs = same output
|
||||
4. **Visibility**: Plan shows exactly what will happen before execution
|
||||
5. **Cost Control**: Estimate compute before committing resources
|
||||
6. **Fault Tolerance**: Failed runs resume from last successful step
|
||||
443
docs/IPFS_PRIMARY_ARCHITECTURE.md
Normal file
443
docs/IPFS_PRIMARY_ARCHITECTURE.md
Normal file
@@ -0,0 +1,443 @@
|
||||
# IPFS-Primary Architecture (Sketch)
|
||||
|
||||
A simplified L1 architecture for large-scale distributed rendering where IPFS is the primary data store.
|
||||
|
||||
## Current vs Simplified
|
||||
|
||||
| Component | Current | Simplified |
|
||||
|-----------|---------|------------|
|
||||
| Local cache | Custom, per-worker | IPFS node handles it |
|
||||
| Redis content_index | content_hash → node_id | Eliminated |
|
||||
| Redis ipfs_index | content_hash → ipfs_cid | Eliminated |
|
||||
| Step inputs | File paths | IPFS CIDs |
|
||||
| Step outputs | File path + CID | Just CID |
|
||||
| Cache lookup | Local → Redis → IPFS | Just IPFS |
|
||||
|
||||
## Core Principle
|
||||
|
||||
**Steps receive CIDs, produce CIDs. No file paths cross machine boundaries.**
|
||||
|
||||
```
|
||||
Step input: [cid1, cid2, ...]
|
||||
Step output: cid_out
|
||||
```
|
||||
|
||||
## Worker Architecture
|
||||
|
||||
Each worker runs:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Worker Node │
|
||||
│ │
|
||||
│ ┌───────────┐ ┌──────────────┐ │
|
||||
│ │ Celery │────│ IPFS Node │ │
|
||||
│ │ Worker │ │ (local) │ │
|
||||
│ └───────────┘ └──────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────┴─────┐ │
|
||||
│ │ │ Local │ │
|
||||
│ │ │ Blockstore│ │
|
||||
│ │ └───────────┘ │
|
||||
│ │ │
|
||||
│ ┌────┴────┐ │
|
||||
│ │ /tmp │ (ephemeral workspace) │
|
||||
│ └─────────┘ │
|
||||
└─────────────────────────────────────┘
|
||||
│
|
||||
│ IPFS libp2p
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Other IPFS │
|
||||
│ Nodes │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Execution Flow
|
||||
|
||||
### 1. Plan Generation (unchanged)
|
||||
|
||||
```python
|
||||
plan = planner.plan(recipe, input_hashes)
|
||||
# plan.steps[].cache_id = deterministic hash
|
||||
```
|
||||
|
||||
### 2. Input Registration
|
||||
|
||||
Before execution, register inputs with IPFS:
|
||||
|
||||
```python
|
||||
input_cids = {}
|
||||
for name, path in inputs.items():
|
||||
cid = ipfs.add(path)
|
||||
input_cids[name] = cid
|
||||
|
||||
# Plan now carries CIDs
|
||||
plan.input_cids = input_cids
|
||||
```
|
||||
|
||||
### 3. Step Execution
|
||||
|
||||
```python
|
||||
@celery.task
|
||||
def execute_step(step_json: str, input_cids: dict[str, str]) -> str:
|
||||
"""Execute step, return output CID."""
|
||||
step = ExecutionStep.from_json(step_json)
|
||||
|
||||
# Check if already computed (by cache_id as IPNS key or DHT lookup)
|
||||
existing_cid = ipfs.resolve(f"/ipns/{step.cache_id}")
|
||||
if existing_cid:
|
||||
return existing_cid
|
||||
|
||||
# Fetch inputs from IPFS → local temp files
|
||||
input_paths = []
|
||||
for input_step_id in step.input_steps:
|
||||
cid = input_cids[input_step_id]
|
||||
path = ipfs.get(cid, f"/tmp/{cid}") # IPFS node caches automatically
|
||||
input_paths.append(path)
|
||||
|
||||
# Execute
|
||||
output_path = f"/tmp/{step.cache_id}.mkv"
|
||||
executor = get_executor(step.node_type)
|
||||
executor.execute(step.config, input_paths, output_path)
|
||||
|
||||
# Add output to IPFS
|
||||
output_cid = ipfs.add(output_path)
|
||||
|
||||
# Publish cache_id → CID mapping (optional, for cache hits)
|
||||
ipfs.name_publish(step.cache_id, output_cid)
|
||||
|
||||
# Cleanup temp files
|
||||
cleanup_temp(input_paths + [output_path])
|
||||
|
||||
return output_cid
|
||||
```
|
||||
|
||||
### 4. Orchestration
|
||||
|
||||
```python
|
||||
@celery.task
|
||||
def run_plan(plan_json: str) -> str:
|
||||
"""Execute plan, return final output CID."""
|
||||
plan = ExecutionPlan.from_json(plan_json)
|
||||
|
||||
# CID results accumulate as steps complete
|
||||
cid_results = dict(plan.input_cids)
|
||||
|
||||
for level in plan.get_steps_by_level():
|
||||
# Parallel execution within level
|
||||
tasks = []
|
||||
for step in level:
|
||||
step_input_cids = {
|
||||
sid: cid_results[sid]
|
||||
for sid in step.input_steps
|
||||
}
|
||||
tasks.append(execute_step.s(step.to_json(), step_input_cids))
|
||||
|
||||
# Wait for level to complete
|
||||
results = group(tasks).apply_async().get()
|
||||
|
||||
# Record output CIDs
|
||||
for step, cid in zip(level, results):
|
||||
cid_results[step.step_id] = cid
|
||||
|
||||
return cid_results[plan.output_step]
|
||||
```
|
||||
|
||||
## What's Eliminated
|
||||
|
||||
### No more Redis indexes
|
||||
|
||||
```python
|
||||
# BEFORE: Complex index management
|
||||
self._set_content_index(content_hash, node_id) # Redis + local
|
||||
self._set_ipfs_index(content_hash, ipfs_cid) # Redis + local
|
||||
node_id = self._get_content_index(content_hash) # Check Redis, fallback local
|
||||
|
||||
# AFTER: Just CIDs
|
||||
output_cid = ipfs.add(output_path)
|
||||
return output_cid
|
||||
```
|
||||
|
||||
### No more local cache management
|
||||
|
||||
```python
|
||||
# BEFORE: Custom cache with entries, metadata, cleanup
|
||||
cache.put(node_id, source_path, node_type, execution_time)
|
||||
cache.get(node_id)
|
||||
cache.has(node_id)
|
||||
cache.cleanup_lru()
|
||||
|
||||
# AFTER: IPFS handles it
|
||||
ipfs.add(path) # Store
|
||||
ipfs.get(cid) # Retrieve (cached by IPFS node)
|
||||
ipfs.pin(cid) # Keep permanently
|
||||
ipfs.gc() # Cleanup unpinned
|
||||
```
|
||||
|
||||
### No more content_hash vs node_id confusion
|
||||
|
||||
```python
|
||||
# BEFORE: Two identifiers
|
||||
content_hash = sha3_256(file_bytes) # What the file IS
|
||||
node_id = cache_id # What computation produced it
|
||||
# Need indexes to map between them
|
||||
|
||||
# AFTER: One identifier
|
||||
cid = ipfs.add(file) # Content-addressed, includes hash
|
||||
# CID IS the identifier
|
||||
```
|
||||
|
||||
## Cache Hit Detection
|
||||
|
||||
Two options:
|
||||
|
||||
### Option A: IPNS (mutable names)
|
||||
|
||||
```python
|
||||
# Publish: cache_id → CID
|
||||
ipfs.name_publish(key=cache_id, value=output_cid)
|
||||
|
||||
# Lookup before executing
|
||||
existing = ipfs.name_resolve(cache_id)
|
||||
if existing:
|
||||
return existing # Cache hit
|
||||
```
|
||||
|
||||
### Option B: DHT record
|
||||
|
||||
```python
|
||||
# Store in DHT: cache_id → CID
|
||||
ipfs.dht_put(cache_id, output_cid)
|
||||
|
||||
# Lookup
|
||||
existing = ipfs.dht_get(cache_id)
|
||||
```
|
||||
|
||||
### Option C: Redis (minimal)
|
||||
|
||||
Keep Redis just for cache_id → CID mapping:
|
||||
|
||||
```python
|
||||
# Store
|
||||
redis.hset("artdag:cache", cache_id, output_cid)
|
||||
|
||||
# Lookup
|
||||
existing = redis.hget("artdag:cache", cache_id)
|
||||
```
|
||||
|
||||
This is simpler than current approach - one hash, one mapping, no content_hash/node_id confusion.
|
||||
|
||||
## Claiming (Preventing Duplicate Work)
|
||||
|
||||
Still need Redis for atomic claiming:
|
||||
|
||||
```python
|
||||
# Claim before executing
|
||||
claimed = redis.set(f"artdag:claim:{cache_id}", worker_id, nx=True, ex=300)
|
||||
if not claimed:
|
||||
# Another worker is doing it - wait for result
|
||||
return wait_for_result(cache_id)
|
||||
```
|
||||
|
||||
Or use IPFS pubsub for coordination.
|
||||
|
||||
## Data Flow Diagram
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Recipe │
|
||||
│ + Inputs │
|
||||
└──────┬──────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Planner │
|
||||
│ (compute │
|
||||
│ cache_ids) │
|
||||
└──────┬──────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ ExecutionPlan │
|
||||
│ - steps with cache_ids │
|
||||
│ - input_cids (from ipfs.add) │
|
||||
└─────────────────┬───────────────┘
|
||||
│
|
||||
┌────────────┼────────────┐
|
||||
▼ ▼ ▼
|
||||
┌────────┐ ┌────────┐ ┌────────┐
|
||||
│Worker 1│ │Worker 2│ │Worker 3│
|
||||
│ │ │ │ │ │
|
||||
│ IPFS │◄──│ IPFS │◄──│ IPFS │
|
||||
│ Node │──►│ Node │──►│ Node │
|
||||
└───┬────┘ └───┬────┘ └───┬────┘
|
||||
│ │ │
|
||||
└────────────┼────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ Final CID │
|
||||
│ (output) │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Simpler code** - No custom cache, no dual indexes
|
||||
2. **Automatic distribution** - IPFS handles replication
|
||||
3. **Content verification** - CIDs are self-verifying
|
||||
4. **Scalable** - Add workers = add IPFS nodes = more cache capacity
|
||||
5. **Resilient** - Any node can serve any content
|
||||
|
||||
## Tradeoffs
|
||||
|
||||
1. **IPFS dependency** - Every worker needs IPFS node
|
||||
2. **Initial fetch latency** - First fetch may be slower than local disk
|
||||
3. **IPNS latency** - Name resolution can be slow (Option C avoids this)
|
||||
|
||||
## Trust Domains (Cluster Key)
|
||||
|
||||
Systems can share work through IPFS, but how do you trust them?
|
||||
|
||||
**Problem:** A malicious system could return wrong CIDs for computed steps.
|
||||
|
||||
**Solution:** Cluster key creates isolated trust domains:
|
||||
|
||||
```bash
|
||||
export ARTDAG_CLUSTER_KEY="my-secret-shared-key"
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
- The cluster key is mixed into all cache_id computations
|
||||
- Systems with the same key produce the same cache_ids
|
||||
- Systems with different keys have separate cache namespaces
|
||||
- Only share the key with trusted partners
|
||||
|
||||
```
|
||||
cache_id = SHA3-256(cluster_key + node_type + config + inputs)
|
||||
```
|
||||
|
||||
**Trust model:**
|
||||
| Scenario | Same Key? | Can Share Work? |
|
||||
|----------|-----------|-----------------|
|
||||
| Same organization | Yes | Yes |
|
||||
| Trusted partner | Yes (shared) | Yes |
|
||||
| Unknown system | No | No (different cache_ids) |
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
environment:
|
||||
- ARTDAG_CLUSTER_KEY=your-secret-key-here
|
||||
```
|
||||
|
||||
**Programmatic:**
|
||||
```python
|
||||
from artdag.planning.schema import set_cluster_key
|
||||
set_cluster_key("my-secret-key")
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
The simplified architecture is implemented in `art-celery/`:
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `hybrid_state.py` | Hybrid state manager (Redis + IPNS) |
|
||||
| `tasks/execute_cid.py` | Step execution with CIDs |
|
||||
| `tasks/analyze_cid.py` | Analysis with CIDs |
|
||||
| `tasks/orchestrate_cid.py` | Full pipeline orchestration |
|
||||
|
||||
### Key Functions
|
||||
|
||||
**Registration (local → IPFS):**
|
||||
- `register_input_cid(path)` → `{cid, content_hash}`
|
||||
- `register_recipe_cid(path)` → `{cid, name, version}`
|
||||
|
||||
**Analysis:**
|
||||
- `analyze_input_cid(input_cid, input_hash, features)` → `{analysis_cid}`
|
||||
|
||||
**Planning:**
|
||||
- `generate_plan_cid(recipe_cid, input_cids, input_hashes, analysis_cids)` → `{plan_cid}`
|
||||
|
||||
**Execution:**
|
||||
- `execute_step_cid(step_json, input_cids)` → `{cid}`
|
||||
- `execute_plan_from_cid(plan_cid, input_cids)` → `{output_cid}`
|
||||
|
||||
**Full Pipeline:**
|
||||
- `run_recipe_cid(recipe_cid, input_cids, input_hashes)` → `{output_cid, all_cids}`
|
||||
- `run_from_local(recipe_path, input_paths)` → registers + runs
|
||||
|
||||
### Hybrid State Manager
|
||||
|
||||
For distributed L1 coordination, use the `HybridStateManager` which provides:
|
||||
|
||||
**Fast path (local Redis):**
|
||||
- `get_cached_cid(cache_id)` / `set_cached_cid(cache_id, cid)` - microsecond lookups
|
||||
- `try_claim(cache_id, worker_id)` / `release_claim(cache_id)` - atomic claiming
|
||||
- `get_analysis_cid()` / `set_analysis_cid()` - analysis cache
|
||||
- `get_plan_cid()` / `set_plan_cid()` - plan cache
|
||||
- `get_run_cid()` / `set_run_cid()` - run cache
|
||||
|
||||
**Slow path (background IPNS sync):**
|
||||
- Periodically syncs local state with global IPNS state (default: every 30s)
|
||||
- Pulls new entries from remote nodes
|
||||
- Pushes local updates to IPNS
|
||||
|
||||
**Configuration:**
|
||||
```bash
|
||||
# Enable IPNS sync
|
||||
export ARTDAG_IPNS_SYNC=true
|
||||
export ARTDAG_IPNS_SYNC_INTERVAL=30 # seconds
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from hybrid_state import get_state_manager
|
||||
|
||||
state = get_state_manager()
|
||||
|
||||
# Fast local lookup
|
||||
cid = state.get_cached_cid(cache_id)
|
||||
|
||||
# Fast local write (synced in background)
|
||||
state.set_cached_cid(cache_id, output_cid)
|
||||
|
||||
# Atomic claim
|
||||
if state.try_claim(cache_id, worker_id):
|
||||
# We have the lock
|
||||
...
|
||||
```
|
||||
|
||||
**Trade-offs:**
|
||||
- Local Redis: Fast (microseconds), single node
|
||||
- IPNS sync: Slow (seconds), eventually consistent across nodes
|
||||
- Duplicate work: Accepted (idempotent - same inputs → same CID)
|
||||
|
||||
### Redis Usage (minimal)
|
||||
|
||||
| Key | Type | Purpose |
|
||||
|-----|------|---------|
|
||||
| `artdag:cid_cache` | Hash | cache_id → output CID |
|
||||
| `artdag:analysis_cache` | Hash | input_hash:features → analysis CID |
|
||||
| `artdag:plan_cache` | Hash | plan_id → plan CID |
|
||||
| `artdag:run_cache` | Hash | run_id → output CID |
|
||||
| `artdag:claim:{cache_id}` | String | worker_id (TTL 5 min) |
|
||||
|
||||
## Migration Path
|
||||
|
||||
1. Keep current system working ✓
|
||||
2. Add CID-based tasks ✓
|
||||
- `execute_cid.py` ✓
|
||||
- `analyze_cid.py` ✓
|
||||
- `orchestrate_cid.py` ✓
|
||||
3. Add `--ipfs-primary` flag to CLI ✓
|
||||
4. Add hybrid state manager for L1 coordination ✓
|
||||
5. Gradually deprecate local cache code
|
||||
6. Remove old tasks when CID versions are stable
|
||||
|
||||
## See Also
|
||||
|
||||
- [L1_STORAGE.md](L1_STORAGE.md) - Current L1 architecture
|
||||
- [EXECUTION_MODEL.md](EXECUTION_MODEL.md) - 3-phase model
|
||||
181
docs/L1_STORAGE.md
Normal file
181
docs/L1_STORAGE.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# L1 Distributed Storage Architecture
|
||||
|
||||
This document describes how data is stored when running artdag on L1 (the distributed rendering layer).
|
||||
|
||||
## Overview
|
||||
|
||||
L1 uses four storage systems working together:
|
||||
|
||||
| System | Purpose | Data Stored |
|
||||
|--------|---------|-------------|
|
||||
| **Local Cache** | Hot storage (fast access) | Media files, plans, analysis |
|
||||
| **IPFS** | Durable content-addressed storage | All media outputs |
|
||||
| **Redis** | Coordination & indexes | Claims, mappings, run status |
|
||||
| **PostgreSQL** | Metadata & ownership | User data, provenance |
|
||||
|
||||
## Storage Flow
|
||||
|
||||
When a step executes on L1:
|
||||
|
||||
```
|
||||
1. Executor produces output file
|
||||
2. Store in local cache (fast)
|
||||
3. Compute content_hash = SHA3-256(file)
|
||||
4. Upload to IPFS → get ipfs_cid
|
||||
5. Update indexes:
|
||||
- content_hash → node_id (Redis + local)
|
||||
- content_hash → ipfs_cid (Redis + local)
|
||||
```
|
||||
|
||||
Every intermediate step output (SEGMENT, SEQUENCE, etc.) gets its own IPFS CID.
|
||||
|
||||
## Local Cache
|
||||
|
||||
Hot storage on each worker node:
|
||||
|
||||
```
|
||||
cache_dir/
|
||||
index.json # Cache metadata
|
||||
content_index.json # content_hash → node_id
|
||||
ipfs_index.json # content_hash → ipfs_cid
|
||||
plans/
|
||||
{plan_id}.json # Cached execution plans
|
||||
analysis/
|
||||
{hash}.json # Analysis results
|
||||
{node_id}/
|
||||
output.mkv # Media output
|
||||
metadata.json # CacheEntry metadata
|
||||
```
|
||||
|
||||
## IPFS - Durable Media Storage
|
||||
|
||||
All media files are stored in IPFS for durability and content-addressing.
|
||||
|
||||
**Supported pinning providers:**
|
||||
- Pinata
|
||||
- web3.storage
|
||||
- NFT.Storage
|
||||
- Infura IPFS
|
||||
- Filebase (S3-compatible)
|
||||
- Storj (decentralized)
|
||||
- Local IPFS node
|
||||
|
||||
**Configuration:**
|
||||
```bash
|
||||
IPFS_API=/ip4/127.0.0.1/tcp/5001 # Local IPFS daemon
|
||||
```
|
||||
|
||||
## Redis - Coordination
|
||||
|
||||
Redis handles distributed coordination across workers.
|
||||
|
||||
### Key Patterns
|
||||
|
||||
| Key | Type | Purpose |
|
||||
|-----|------|---------|
|
||||
| `artdag:run:{run_id}` | String | Run status, timestamps, celery task ID |
|
||||
| `artdag:content_index` | Hash | content_hash → node_id mapping |
|
||||
| `artdag:ipfs_index` | Hash | content_hash → ipfs_cid mapping |
|
||||
| `artdag:claim:{cache_id}` | String | Task claiming (prevents duplicate work) |
|
||||
|
||||
### Task Claiming
|
||||
|
||||
Lua scripts ensure atomic claiming across workers:
|
||||
|
||||
```
|
||||
Status flow: PENDING → CLAIMED → RUNNING → COMPLETED/CACHED/FAILED
|
||||
TTL: 5 minutes for claims, 1 hour for results
|
||||
```
|
||||
|
||||
This prevents two workers from executing the same step.
|
||||
|
||||
## PostgreSQL - Metadata
|
||||
|
||||
Stores ownership, provenance, and sharing metadata.
|
||||
|
||||
### Tables
|
||||
|
||||
```sql
|
||||
-- Core cache (shared)
|
||||
cache_items (content_hash, ipfs_cid, created_at)
|
||||
|
||||
-- Per-user ownership
|
||||
item_types (content_hash, actor_id, type, metadata)
|
||||
|
||||
-- Run cache (deterministic identity)
|
||||
run_cache (
|
||||
run_id, -- SHA3-256(sorted_inputs + recipe)
|
||||
output_hash,
|
||||
ipfs_cid,
|
||||
provenance_cid,
|
||||
recipe, inputs, actor_id
|
||||
)
|
||||
|
||||
-- Storage backends
|
||||
storage_backends (actor_id, provider_type, config, capacity_gb)
|
||||
|
||||
-- What's stored where
|
||||
storage_pins (content_hash, storage_id, ipfs_cid, pin_type)
|
||||
```
|
||||
|
||||
## Cache Lookup Flow
|
||||
|
||||
When a worker needs a file:
|
||||
|
||||
```
|
||||
1. Check local cache by cache_id (fastest)
|
||||
2. Check Redis content_index: content_hash → node_id
|
||||
3. Check PostgreSQL cache_items
|
||||
4. Retrieve from IPFS by CID
|
||||
5. Store in local cache for next hit
|
||||
```
|
||||
|
||||
## Local vs L1 Comparison
|
||||
|
||||
| Feature | Local Testing | L1 Distributed |
|
||||
|---------|---------------|----------------|
|
||||
| Local cache | Yes | Yes |
|
||||
| IPFS | No | Yes |
|
||||
| Redis | No | Yes |
|
||||
| PostgreSQL | No | Yes |
|
||||
| Multi-worker | No | Yes |
|
||||
| Task claiming | No | Yes (Lua scripts) |
|
||||
| Durability | Filesystem only | IPFS + PostgreSQL |
|
||||
|
||||
## Content Addressing
|
||||
|
||||
All storage uses SHA3-256 (quantum-resistant):
|
||||
|
||||
- **Files:** `content_hash = SHA3-256(file_bytes)`
|
||||
- **Computation:** `cache_id = SHA3-256(type + config + input_hashes)`
|
||||
- **Run identity:** `run_id = SHA3-256(sorted_inputs + recipe)`
|
||||
- **Plans:** `plan_id = SHA3-256(recipe + inputs + analysis)`
|
||||
|
||||
This ensures:
|
||||
- Same inputs → same outputs (reproducibility)
|
||||
- Automatic deduplication across workers
|
||||
- Content verification (tamper detection)
|
||||
|
||||
## Configuration
|
||||
|
||||
Default locations:
|
||||
|
||||
```bash
|
||||
# Local cache
|
||||
~/.artdag/cache # Default
|
||||
/data/cache # Docker
|
||||
|
||||
# Redis
|
||||
redis://localhost:6379/5
|
||||
|
||||
# PostgreSQL
|
||||
postgresql://user:pass@host/artdag
|
||||
|
||||
# IPFS
|
||||
/ip4/127.0.0.1/tcp/5001
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [OFFLINE_TESTING.md](OFFLINE_TESTING.md) - Local testing without L1
|
||||
- [EXECUTION_MODEL.md](EXECUTION_MODEL.md) - 3-phase execution model
|
||||
211
docs/OFFLINE_TESTING.md
Normal file
211
docs/OFFLINE_TESTING.md
Normal file
@@ -0,0 +1,211 @@
|
||||
# Offline Testing Strategy
|
||||
|
||||
This document describes how to test artdag locally without requiring Redis, IPFS, Celery, or any external distributed infrastructure.
|
||||
|
||||
## Overview
|
||||
|
||||
The artdag system uses a **3-Phase Execution Model** that enables complete offline testing:
|
||||
|
||||
1. **Analysis** - Extract features from input media
|
||||
2. **Planning** - Generate deterministic execution plan with pre-computed cache IDs
|
||||
3. **Execution** - Run plan steps, skipping cached results
|
||||
|
||||
This separation allows testing each phase independently and running full pipelines locally.
|
||||
|
||||
## Quick Start
|
||||
|
||||
Run a full offline test with a video file:
|
||||
|
||||
```bash
|
||||
./examples/test_local.sh ../artdag-art-source/dog.mkv
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Compute the SHA3-256 hash of the input video
|
||||
2. Run the `simple_sequence` recipe
|
||||
3. Store all outputs in `test_cache/`
|
||||
|
||||
## Test Scripts
|
||||
|
||||
### `test_local.sh` - Full Pipeline Test
|
||||
|
||||
Location: `./examples/test_local.sh`
|
||||
|
||||
Runs the complete artdag pipeline offline with a real video file.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
./examples/test_local.sh <video_file>
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
./examples/test_local.sh ../artdag-art-source/dog.mkv
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
- Computes content hash of input video
|
||||
- Runs `artdag run-recipe` with `simple_sequence.yaml`
|
||||
- Stores outputs in `test_cache/` directory
|
||||
- No external services required
|
||||
|
||||
### `test_plan.py` - Planning Phase Test
|
||||
|
||||
Location: `./examples/test_plan.py`
|
||||
|
||||
Tests the planning phase without requiring any media files.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python3 examples/test_plan.py
|
||||
```
|
||||
|
||||
**What it tests:**
|
||||
- Recipe loading and YAML parsing
|
||||
- Execution plan generation
|
||||
- Cache ID computation (deterministic)
|
||||
- Multi-level parallel step organization
|
||||
- Human-readable step names
|
||||
- Multi-output support
|
||||
|
||||
**Output:**
|
||||
- Prints plan structure to console
|
||||
- Saves full plan to `test_plan_output.json`
|
||||
|
||||
### `simple_sequence.yaml` - Sample Recipe
|
||||
|
||||
Location: `./examples/simple_sequence.yaml`
|
||||
|
||||
A simple recipe for testing that:
|
||||
- Takes a video input
|
||||
- Extracts two segments (0-2s and 5-7s)
|
||||
- Concatenates them with SEQUENCE
|
||||
|
||||
## Test Outputs
|
||||
|
||||
All test outputs are stored locally and git-ignored:
|
||||
|
||||
| Output | Description |
|
||||
|--------|-------------|
|
||||
| `test_cache/` | Cached execution results (media files, analysis, plans) |
|
||||
| `test_cache/plans/` | Cached execution plans by plan_id |
|
||||
| `test_cache/analysis/` | Cached analysis results by input hash |
|
||||
| `test_plan_output.json` | Generated execution plan from `test_plan.py` |
|
||||
|
||||
## Unit Tests
|
||||
|
||||
The project includes a comprehensive pytest test suite in `tests/`:
|
||||
|
||||
```bash
|
||||
# Run all unit tests
|
||||
pytest
|
||||
|
||||
# Run specific test file
|
||||
pytest tests/test_dag.py
|
||||
pytest tests/test_engine.py
|
||||
pytest tests/test_cache.py
|
||||
```
|
||||
|
||||
## Testing Each Phase
|
||||
|
||||
### Phase 1: Analysis Only
|
||||
|
||||
Extract features without full execution:
|
||||
|
||||
```bash
|
||||
python3 -m artdag.cli analyze <recipe> -i <name>:<hash>@<path> --features beats,energy
|
||||
```
|
||||
|
||||
### Phase 2: Planning Only
|
||||
|
||||
Generate an execution plan (no media needed):
|
||||
|
||||
```bash
|
||||
python3 -m artdag.cli plan <recipe> -i <name>:<hash>
|
||||
```
|
||||
|
||||
Or use the test script:
|
||||
|
||||
```bash
|
||||
python3 examples/test_plan.py
|
||||
```
|
||||
|
||||
### Phase 3: Execution Only
|
||||
|
||||
Execute a pre-generated plan:
|
||||
|
||||
```bash
|
||||
python3 -m artdag.cli execute plan.json
|
||||
```
|
||||
|
||||
With dry-run to see what would execute:
|
||||
|
||||
```bash
|
||||
python3 -m artdag.cli execute plan.json --dry-run
|
||||
```
|
||||
|
||||
## Key Testing Features
|
||||
|
||||
### Content Addressing
|
||||
|
||||
All nodes have deterministic IDs computed as:
|
||||
```
|
||||
SHA3-256(type + config + sorted(input_IDs))
|
||||
```
|
||||
|
||||
Same inputs always produce same cache IDs, enabling:
|
||||
- Reproducibility across runs
|
||||
- Automatic deduplication
|
||||
- Incremental execution (only changed steps run)
|
||||
|
||||
### Local Caching
|
||||
|
||||
The `test_cache/` directory stores:
|
||||
- `plans/{plan_id}.json` - Execution plans (deterministic hash of recipe + inputs + analysis)
|
||||
- `analysis/{hash}.json` - Analysis results (audio beats, tempo, energy)
|
||||
- `{cache_id}/output.mkv` - Media outputs from each step
|
||||
|
||||
Subsequent test runs automatically skip cached steps. Plans are cached by their `plan_id`, which is a SHA3-256 hash of the recipe, input hashes, and analysis results - so the same recipe with the same inputs always produces the same plan.
|
||||
|
||||
### No External Dependencies
|
||||
|
||||
Offline testing requires:
|
||||
- Python 3.9+
|
||||
- ffmpeg (for media processing)
|
||||
- No Redis, IPFS, Celery, or network access
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
1. **Check cache contents:**
|
||||
```bash
|
||||
ls -la test_cache/
|
||||
ls -la test_cache/plans/
|
||||
```
|
||||
|
||||
2. **View cached plan:**
|
||||
```bash
|
||||
cat test_cache/plans/*.json | python3 -m json.tool | head -50
|
||||
```
|
||||
|
||||
3. **View execution plan structure:**
|
||||
```bash
|
||||
cat test_plan_output.json | python3 -m json.tool
|
||||
```
|
||||
|
||||
4. **Run with verbose output:**
|
||||
```bash
|
||||
python3 -m artdag.cli run-recipe examples/simple_sequence.yaml \
|
||||
-i "video:HASH@path" \
|
||||
--cache-dir test_cache \
|
||||
-v
|
||||
```
|
||||
|
||||
5. **Dry-run to see what would execute:**
|
||||
```bash
|
||||
python3 -m artdag.cli execute plan.json --dry-run
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [L1_STORAGE.md](L1_STORAGE.md) - Distributed storage on L1 (IPFS, Redis, PostgreSQL)
|
||||
- [EXECUTION_MODEL.md](EXECUTION_MODEL.md) - 3-phase execution model
|
||||
Reference in New Issue
Block a user