





I built a 17-stage pipeline that compiles an 8-minute short film from a single JSON schema — no cameras, no crew, no manual editing
The movie is no longer the final video file. The movie is the code that generates it.
The result: The Lone Crab — an 8-minute AI-generated short film about a solitary crab navigating a vast ocean floor. Every shot, every sound effect, every second of silence was governed by a master JSON schema and executed by autonomous AI models.
The idea: I wanted to treat filmmaking the way software engineers treat compilation. You write source code (a structured schema defining story beats, character traits, cinematic specs, director rules), you run a compiler (a 17-phase pipeline of specialized AI "skills"), and out comes a binary (a finished film). If the output fails QA — a shot is too short, the runtime falls below the floor, narration bleeds into a silence zone — the pipeline rejects the compile and regenerates.
How it works:
The master schema defines everything:
- Story structure: 7 beats mapped across 480 seconds with an emotional tension curve. Beat 1 (0–60s) is "The Vast and Empty Floor" — wonder/setup. Beat 6 (370–430s) is "The Crevice" — climax of shelter. Each beat has a target duration range and an emotional register.
- Character locking: The crab's identity is maintained across all 48 shots without a 3D rig. Exact string fragments — "mottled grey-brown-ochre carapace", "compound eyes on mobile eyestalks", "asymmetric claws", "worn larger claw tip" — are injected into every prompt at weight 1.0. A minimum similarity score of 0.85 enforces frame-to-frame coherence.
- Cinematic spec: Each shot carries a JSON object specifying shot type (EWS, macro, medium), camera angle, focal length in mm, aperture, and camera movement. Example:
{ "shotType": "EWS", "cameraAngle": "high_angle", "focalLengthMm": 18, "aperture": 5.6, "cameraMovement": "static" }— which translates to extreme wide framing, overhead inverted macro perspective, ultra-wide spatial distortion, infinite deep focus, and absolute locked-off stillness. - Director rules: A config encoding the auteur's voice. Must-avoid list: anthropomorphism, visible sky/surface, musical crescendos, handheld camera shake. Camera language: static or slow-dolly; macro for intimacy (2–5 cm above floor), extreme wide for existential scale. Performance direction for voiceover: unhurried warm tenor, pauses earn more than emphasis, max 135 WPM.
- Automated rule enforcement: Raw AI outputs pass through three gates before approval. (1) Pacing Filter — rejects cuts shorter than 2.0s or holds longer than 75.0s. (2) Runtime Floor — rejects any compile falling below 432s. (3) The Silence Protocol — forces
voiceOver.presenceInRange = falseduring the sand crossing scene. Failures loop back to regeneration.
The generation stack:
- Video: Runway (s14-vidgen), dispatched via a prompt assembly engine (s15-prompt-composer) that concatenates environment base + character traits + cinematic spec + action context + director's rules into a single optimized string.
- Voice over: ElevenLabs — observational tenor parsed into precise script segments, capped at 135 WPM.
- Score: Procedural drone tones and processed ocean harmonics. No melodies, no percussion. Target loudness: −22 LUFS for score, −14 LUFS for final master.
- SFX/Foley: 33 audio assets ranging from "Fish School Pass — Water Displacement" to "Crab Claw Touch — Coral Contact" to "Trench Organism Bioluminescent Pulse". Each tagged with emotional descriptors (indifferent, fluid, eerie, alien, tentative, wonder).
The color system:
Three zones tied to narrative arc:
- Zone 1 (Scenes 001–003, The Kelp Forest): desaturated blue-grey with green-gold kelp accents, true blacks. Palette: desaturated aquamarine.
- Zone 2 (Scenes 004–006, The Dark Trench): near-monochrome blue-black, grain and noise embraced, crushed shadows. Palette: near-monochrome deep blue-black.
- Zone 3 (Scenes 007–008, The Coral Crevice): rich bioluminescent violet-cyan-amber, lifted blacks, first unmistakable appearance of warmth. Palette: bioluminescent jewel-toned.
Pipeline stats:
828.5k tokens consumed. 594.6k in, 233.9k out. 17 skills executed. 139.7 minutes of compute time. 48 shots generated. 33 audio assets. 70 reference images. Target runtime: 8:00 (480s ± 48s tolerance).
Deliverable specs: 1080p, 24fps, sRGB color space, −14 LUFS (optimized for YouTube playback), minimum consistency score 0.85.
The entire thing is deterministic in intent but non-deterministic in execution — every re-compile produces a different film that still obeys the same structural rules. The schema is the movie. The video is just one rendering of it.
I'm happy to answer questions about the schema design, the prompt assembly logic, the QA loop, or anything else. The deck with all the architecture diagrams is in the video description.
----
Youtube - The Lone Crab -> https://youtu.be/da_HKDNIlqA
Youtube - The concpet I am building -> https://youtu.be/qDVnLq4027w