Overview
Kling 3.0 is a unified multimodal video model. It understands cinematic direction, not keyword lists. Write prompts like a director β describe what the audience sees, hears, and feels over time.
Core shift: Description β Direction. Think "direct a scene" not "describe an image."
Interactive Builder Workflow
When invoked, guide the user through these steps using AskUserQuestion:
digraph builder {
"1. Generation mode?" [shape=diamond];
"Text-to-Video" [shape=box];
"Image-to-Video" [shape=box];
"Multi-Shot Sequence" [shape=box];
"Keyframe Transition" [shape=box];
"2. Gather scene details" [shape=box];
"3. Assemble prompt" [shape=box];
"4. Present & refine" [shape=box];
"1. Generation mode?" -> "Text-to-Video";
"1. Generation mode?" -> "Image-to-Video";
"1. Generation mode?" -> "Multi-Shot Sequence";
"1. Generation mode?" -> "Keyframe Transition";
"Text-to-Video" -> "2. Gather scene details";
"Image-to-Video" -> "2. Gather scene details";
"Multi-Shot Sequence" -> "2. Gather scene details";
"Keyframe Transition" -> "2. Gather scene details";
"2. Gather scene details" -> "3. Assemble prompt";
"3. Assemble prompt" -> "4. Present & refine";
}
Step 1: Determine Generation Mode
Ask the user which mode:
- Text-to-Video β prompt from scratch
- Image-to-Video β animate a reference image
- Multi-Shot Sequence β 2-6 shot storyboard (up to 15s)
- Keyframe Transition β start frame β end frame with interpolated motion
Step 2: Gather Scene Details
Ask about each element (adapt questions to mode):
| Element |
Question |
Why it matters |
| Subject |
Who/what is the focus? Specific appearance details? |
Anchors consistency β define distinguishing traits early |
| Action |
What happens? Describe the timeline (first β then β finally) |
Kling 3.0 excels at sequential action over 15s arcs |
| Environment |
Where? Be specific (not "a street" but "narrow Tokyo alley, steam from grates") |
Grounds the scene physically |
| Camera |
Shot type and movement? (See camera reference below) |
Cinematic language produces far better results |
| Lighting |
What light sources? Name them specifically |
"Flickering neon" beats "dramatic lighting" |
| Mood/Emotion |
What should the audience feel? |
Drives color grade, pacing, music |
| Audio |
Dialogue? Ambient sound? Music? |
Kling 3.0 generates native audio + lip-sync |
| Duration |
How long? (3-15s) |
Longer = describe progression over time |
| Aspect Ratio |
16:9 / 9:16 / 1:1 / 21:9? |
16:9 cinematic, 9:16 social, 21:9 ultra-wide |
Image-to-Video: Focus on how the scene evolves from the image β movement, camera motion, environmental change. The model preserves identity/layout from the source.
Keyframes: Ask for start and end frame descriptions. Frames should match in color, style, and lighting. Prompt sparingly β Kling infers motion well.
Multi-Shot: Define each shot separately with its own framing, subject, action, and duration. Label shots explicitly.
Step 3: Assemble the Prompt
Use the Master Formula:
[Scene/Environment] + [Subject & Appearance] + [Action Timeline] + [Camera Movement] + [Audio & Atmosphere] + [Technical Specs]
Writing rules:
- Use cinematic motion verbs: dolly push, whip-pan, crash zoom, rack focus, tracking shot β NOT "moves" or "goes"
- Name real light sources: neon signs, candlelight, golden hour, LED panels β NOT "dramatic lighting"
- Include texture for credibility: grain, lens flares, condensation, fabric sheen, smoke, sweat
- Describe temporal flow: beginning β middle β end
- Keep to 1-3 rich sentences per shot (specificity > length)
- For dialogue: use character labels, assign voice tone/emotion, use transitional words ("Immediately," "Pause")
Step 4: Present & Refine
Present the assembled prompt. Ask if they want to:
- Adjust any element
- Add a negative prompt
- Generate variations (different duration, different camera, different mood)
Quick Reference
Camera Movements
| Movement |
Effect |
Example phrase |
| Dolly push-in |
Builds intimacy/tension |
"slow dolly push-in toward her face" |
| Dolly zoom |
Vertigo/dramatic reveal |
"dolly zoom creating disorienting depth shift" |
| Tracking shot |
Follows subject laterally |
"camera tracks alongside as she walks" |
| Whip-pan |
Energy/surprise |
"whip-pan to reveal the door" |
| Crash zoom |
Shock/emphasis |
"sudden crash zoom on the object" |
| Rack focus |
Shift attention |
"rack focus from foreground hand to background figure" |
| Handheld/shoulder-cam |
Raw/documentary feel |
"handheld shoulder-cam with subtle sway" |
| Static tripod |
Composed/observational |
"locked-off static tripod, wide shot" |
| FPV drone |
High-energy immersion |
"dynamic FPV drone shot chasing through corridor" |
| Low-angle tracking |
Heroic/imposing |
"low-angle tracking shot, subject towers above" |
| Truck left/right |
Lateral reveal |
"camera trucks right revealing the cityscape" |
| Tilt up/down |
Vertical reveal |
"slow tilt up from boots to face" |
Lens & Film Stock
| Phrase |
Effect |
| "Shot on 35mm film" |
Warm grain, organic texture |
| "Macro 85mm lens" |
Tight detail, shallow depth of field |
| "Wide-angle steadicam" |
Smooth, immersive, spatial |
| "Handheld camcorder" |
Raw VHS energy, nostalgic |
| "Anamorphic lens flare" |
Cinematic horizontal streaks |
Lighting
Use specific sources, not adjectives:
- "Golden hour sun cutting through dusty warehouse windows"
- "Flickering neon casting magenta/cyan across wet pavement"
- "Single bare bulb swinging, casting moving shadows"
- "Cool blue LED panels reflecting off glass surfaces"
- "Candlelight warming skin tones, deep shadows beyond"
Color & Grade
- "Desaturated teal grade, crushed blacks"
- "Amber nightclub strobe cutting through smoke"
- "Cool blue haze filling the corridor"
- "Magenta neon reflecting off wet asphalt"
- "Overexposed highlights, blown-out whites"
Multi-Character Dialogue
| Rule |
Do |
Don't |
| Name characters |
[Character A: Silver-haired CEO] |
[Man] says... |
| Anchor to action |
Agent slams table. [Agent, angrily]: "Where is it?" |
Just dialogue without visual action |
| Assign voice tone |
[CEO, deep authoritative gravelly voice] |
Generic "says" |
| Control timing |
"Immediately," "Pause," "After a beat" |
Back-to-back dialogue without transitions |
Multi-Shot Structure
Shot 1 (0-5s): [Wide establishing shot description]
Shot 2 (5-10s): [Medium/close-up with action progression]
Shot 3 (10-15s): [Resolution/reaction with camera payoff]
Atmosphere: [Overall mood, color grade]
Audio: [Sound design, music, dialogue]
Label every shot. Assign durations. Describe framing + subject + motion per shot.
Start & End Frame Tips
- Frames should match in color palette, style, and lighting
- Identical start/end frames = seamless loop
- Prompt sparingly β Kling infers motion between frames well
- Simple camera directions: zoom in/out, pan left/right, tilt up/down
- 5s for dynamic transitions, 10s for complex transformations
- Start frame aspect ratio drives the whole clip
Negative Prompts
Use to prevent common AI defaults:
smiling, laughing, cartoonish, bright saturated colors, low resolution,
morphing, blurry text, disfigured hands, extra fingers, static pose,
frozen expression, stock photo aesthetic
Customize based on scene β remove items that conflict with your intent.
Weak β Strong
| Element |
Weak |
Strong |
| Camera |
"Camera follows person" |
"Handheld shoulder-cam drifts behind subject with subtle sway" |
| Subject |
"A woman walking" |
"Woman in red dress, heels clicking wet cobblestone" |
| Environment |
"In a city" |
"Narrow Tokyo alley, steam from grates, glowing vending machines" |
| Lighting |
"Dramatic lighting" |
"Flickering neon casting magenta/cyan across wet pavement" |
| Texture |
"It looks realistic" |
"Rain beading on leather jacket, condensation on glass, visible breath" |
| Motion |
"She walks away" |
"She turns slowly, hair catches light, disappears around corner" |
Common Mistakes
| Mistake |
Fix |
| Keyword lists instead of scene direction |
Write like directing a shot: subject + action + camera + environment |
| Vague motion ("moves," "goes") |
Use cinematic verbs: dolly, track, whip-pan, crash zoom |
| Generic lighting ("dramatic") |
Name the source: neon, candle, golden hour, LED panel |
| Overlong prompts |
1-3 rich sentences per shot; specificity > length |
| No temporal progression |
Describe beginning β middle β end of the shot |
| Mismatched keyframes |
Match color, lighting, and style between start/end frames |
| Unattributed dialogue |
Label every speaker with name, tone, and emotion |
| Cramming multi-shot into one paragraph |
Separate and label each shot with duration |