App 06 · Video Universe · WAVGEN

AI VIDEO WORLD

The Frontier of the Moving Image

The image that never existed. The frame that cannot exist again.

Diffusion Steps Latent Space Guidance Scale Seed

THE DIFFUSION ENGINE

STEP (0–50) 0

LATENT SPACE

The diffusion process starts with pure Gaussian noise and iteratively removes it, guided by a learned distribution. Each denoising step refines the latent representation toward a coherent image. The latent space (left) is the compressed abstract manifold where generation occurs.

Section 02

Prompt Lab

Professional AI video prompts are structured taxonomies — not sentences. Build one below.

Subject

a lone figure a burning city a chrome robot

Action

walking slowly dissolving into light rotating

Style / Reference

cinematic, 35mm film grain Tarkovsky slow cinema Studio Ghibli watercolor

Camera / Lighting

dolly shot golden hour extreme close-up

COMBINED PROMPT OUTPUT

PROMPT QUALITY COMPARISON

VAGUE PROMPT

DETAILED PROMPT

AI video generation responds to specificity. "A man walking" produces generic results. "35mm, slow dolly, golden hour, lone silhouette, Terrence Malick" produces intention. Every field in your prompt is a constraint that narrows the generation space toward your vision.

Section 03

The Coherence Problem

AI video must produce 24 consistent frames per second. Each frame must agree with every other frame about who, what, and where.

6-FRAME SEQUENCE

COHERENCE 100

Approaches to the Coherence Problem

① Video-to-Video: Inherit Coherence from Source +

Start from a real video clip and use AI to restyle it — adding artistic filters, changing the visual style, or altering surface details. Since the source video already handles temporal consistency (real physics, real camera), the AI only needs to maintain style coherence, not structural coherence. Used in tools like Runway ML's Video-to-Video and EbSynth.

② Conditioning Frames: Keyframe Interpolation +

Provide explicit keyframes (start frame, middle frame, end frame) and let the AI interpolate the motion between them. The model is constrained to begin and end at your specified images, dramatically reducing drift. Used by Pika Labs and Stable Video Diffusion's keyframe interpolation modes.

③ ControlNet / Pose Guidance: Structural Constraints +

Extract depth maps, pose skeletons, or edge maps from each frame and use these as hard constraints during generation. The AI must produce an image that matches the structural skeleton, even as it varies surface details. Pose-guided generation ensures a character's body proportions remain consistent across frames even when appearance changes.

The coherence problem is the central unsolved challenge of AI video. An image generator produces one frame. A video generator must produce 24 per second — all consistent with each other about who the character is, where the light sources are, and what just happened in the previous frame.

Section 04

Deepfake Detection Clinic

Knowing how deepfakes work is essential for media literacy. This section teaches detection, not creation.

Detection Checklist

01Blinking — AI faces often don't blink naturally; timing is off or absent
02Hair edges — strands blur or merge with background at boundaries
03Ears — frequently inconsistent, warped, or wrong shape
04Teeth — weird geometry, extra teeth, unnatural gum color
05Neck lighting — mismatch between face and neck illumination
06Background bleed — warping or color bleeding at face boundaries
07Eye highlights — reflections don't match actual light source direction
08Temporal drift — face subtly changes between video frames

Deepfake Timeline

2017

Reddit user "deepfakes" releases faceswap tool — term becomes generic

2018–19

GAN-based face synthesis reaches photorealism; detection race begins

2020–21

Commercial tools democratize creation; first legislation in US states

2022–23

Diffusion models surpass GANs; real-time deepfake on consumer hardware

2024

C2PA provenance standards, AI watermarking, EU AI Act enforcement begins

Spot the Fake — Quiz

Click the image you think is the deepfake.

Score: 0 / 3

Option A

Option B

Round 1 of 3

Section 05

Motion Brush Simulator

Paint motion vectors onto regions of an image — then preview the directed animation.

■ Red=Right ■ Blue=Left ■ Green=Up ■ Yellow=Down

MOTION BRUSH CANVAS — CLICK & DRAG TO PAINT

WITHOUT Motion Brush

Entire scene moves as one with camera pan — no independent element control.

WITH Motion Brush

Painted regions move independently — person steps forward while trees sway separately.

Motion brush is how tools like Runway ML Gen-3 and Kling allow creators to direct AI video generation. Instead of hoping the AI guesses your intent, you explicitly paint where and how motion should occur — transforming the generation from stochastic to intentional.

Section 06

NeRF Explorer

Neural Radiance Fields reconstruct 3D scenes from 2D photographs — enabling novel view synthesis from any angle.

NERF SCENE

CAMERA ANGLE 0°

8 Training Views

NeRF is trained on a fixed set of photographs. The camera ring shows the 8 capture positions. Novel views (between cameras) are synthesized — quality degrades at extreme novel angles.

NeRF → Gaussian Splatting (2023) +

3D Gaussian Splatting (2023, Kerbl et al.) replaced NeRF for most real-time applications by using explicit 3D Gaussian primitives instead of an implicit neural field. Each Gaussian has a 3D position, opacity, and color (view-dependent). Rendering is a simple splat operation — 100× faster than NeRF. Same reconstruction quality, real-time playback on consumer GPU. NeRF remains relevant for research; Gaussian Splatting dominates production.

NeRF captured research imagination in 2020 by synthesizing photorealistic novel views from as few as 20 photographs. It represented a paradigm shift: a neural network is the 3D scene — not a mesh or point cloud, but a function that maps 3D position + view direction to color and density.

Section 07

The Ethics Board

Every synthetic media decision is a moral choice. Four realistic scenarios — no single correct answer provided. Make your choice, see the consequences.

Card 01 · The Historical Speech

You have high-quality audio of a politician from 20 years ago saying things that now seem out of context. You could use AI video to create a plausible-looking speech of them saying those words in a modern setting to "restore context" for a documentary.

Card 02 · The Memorial

A family lost a loved one. They have home video footage and want an AI to recreate the person's voice and appearance to say goodbye at a private memorial service. The person left no instructions about posthumous AI use.

Card 03 · The Visual Effects

A studio wants to use an actor's likeness from 10 years ago to de-age them digitally in a new film. The actor has personally agreed and is paid. However, their performers' union has not yet approved AI likeness use in contracts.

Card 04 · The Satire

You want to create clearly-labeled satirical AI video of a public figure saying absurd things for a comedy show. The content would be labeled "PARODY — AI GENERATED" on screen throughout. The figure is a living elected official.

"Every synthetic media decision is a moral choice. The tools are neutral. The intention and the disclosure are everything."

Section 08

Artists & Studios

The people and organizations defining the frontier of AI video.

Runway ML

AI Video Pioneer Studio

Creators of Gen-1, Gen-2, and Gen-3 Alpha. First to offer commercial text-to-video and video-to-video tools at production quality. Acts as the bridge between AI research and working filmmakers.

"We're building the next generation of storytelling tools."

Pika Labs

Creative AI Video Startup

Rapid iteration on consumer-accessible AI video. Known for Pika 1.0's motion brush and lip sync features. Made high-quality text-to-video available to non-technical creators via Discord and web app.

Democratization as design principle.

Sora / OpenAI

Text-to-Video Research

Sora (2024) demonstrated world-model-level video generation — coherent physics, camera motion, and long-form temporal consistency. Trained on video as a spatiotemporal data format rather than frame sequences.

Video generation as physics simulation.

Google DeepMind

Lumiere · Veo Research

Lumiere introduced space-time diffusion for full-clip generation. Veo (2024) extended to high-resolution, multi-shot cinematic sequences with director-level prompt adherence and reference-image conditioning.

Research at cinema resolution.

Alexander Mordvintsev

DeepDream Creator · Google

Created DeepDream (2015), the first widely seen neural network visual art. Showed that CNNs contain learnable visual hallucinations — planting the seed for the generative AI art movement that followed.

The network dreams what it has learned to see.

Holly Herndon & Mat Dryhurst

AI Ethics in Art · Spawn AI

Created Spawn (2019) — an AI trained only on collaborative community data with explicit consent. Pioneered "data dignity" — the principle that artists must have rights over how their work trains AI systems.

"The dataset is the politics."

Refik Anadol

AI Data Sculpture · MoMA

Creates large-scale architectural AI installations — "Machine Hallucinations" trained on city datasets. Brought AI generative art to major institutions including MoMA. Frames AI as a medium for collective memory.

"Data is the new pigment."

Grimes

AI Voice Licensing · Aurora AI

Publicly licensed her voice model for fan use (elf.tech), splitting royalties 50/50. Pioneer of voluntary AI licensing as an artist business model. Aurora AI is her synthetic voice released as creative commons for non-commercial use.

Consent + compensation as the framework.

Glossary

22 essential terms in AI video and generative media.

Conditioning

Providing reference signals (images, poses, depth) to constrain AI generation toward a specific output.

ControlNet

Neural network architecture that adds structural constraints (edges, pose, depth) to diffusion model outputs.

Deepfake

AI-synthesized video of a real person saying or doing something they did not say or do.

Diffusion Model

Generative model that learns to reverse a noise-addition process, producing images by iterative denoising.

GAN

Generative Adversarial Network — two competing networks: a generator creating images and a discriminator judging them.

Gaussian Splatting

3D scene representation using explicit Gaussian primitives; enables real-time novel view synthesis.

Guidance Scale

Parameter controlling how strongly the model follows the text prompt vs. producing varied outputs.

Image-to-Video

Generating a video clip from a single still image, animating it with plausible motion.

Inpainting

Filling a masked region of an image or video with AI-generated content that matches the surrounding context.

Latent Space

Compressed abstract representation space where diffusion models generate before decoding to pixel space.

LoRA

Low-Rank Adaptation — efficient fine-tuning method allowing a model to learn a new style from a small dataset.

Motion Brush

Tool for painting directional motion vectors onto image regions to direct AI video generation.

NeRF

Neural Radiance Field — implicit neural representation of a 3D scene enabling novel view synthesis.

Noise Schedule

The function controlling how much noise is added at each diffusion timestep during training and inference.

Outpainting

Extending an image beyond its original borders using AI generation that matches the existing content.

Prompt Engineering

The practice of crafting text inputs to AI models to produce desired outputs reliably.

Seed

Random number initializing the noise used in generation — same seed + same prompt = same output.

Temporal Consistency

The property of a video where all frames agree on the scene's contents, lighting, and character appearance.

Text-to-Video

Generating video directly from a text description, without a source image or video.

Training Data

The dataset of images, videos, or text used to train a generative model.

VAE

Variational Autoencoder — encoder/decoder architecture used to compress images into latent space and back.

Video-to-Video

Using AI to restyle or transform an existing video clip, inheriting its temporal coherence.