Veo 3/3.1: pro JSON prompting, step-by-step

1) Start with intent + constraints (front-load the “what”). Veo weights early words heavily; say the core shot, subject, and action first, then style and camera. ([Reddit][1])
2) Keep scenes atomic. Use short, single-action scenes (micro-beats) and control continuity with explicit carry-overs (lighting, costume, props). Google’s latest guide emphasizes directing scenes and consistency. ([Google Cloud][2])
3) Unify style & tone. Avoid mixed styles; declare one visual grammar (e.g., “photorealistic, cinematic tone”). Keep the prompt concise (3–6 sentences per scene) to prevent dilution. ([Reddit][3])
4) Be explicit about camera & framing. Name the shot type, lens feel, motion, and framing (“medium shot… slow push-in… keep full subject in frame”). ([Reddit][1])
5) Lock continuity with references (“ingredients”). Use start/end frames or reference images to stabilize characters/props; Veo 3.1 strengthens this workflow. ([The Verge][4])
6) Describe audio beats (if your toolchain supports it). Add succinct sound cues to match emotional beats (e.g., “subtle HVAC hum, soft shoe squeak”). ([The Verge][4])
7) Keep it short. Overlong prompts get partially ignored; structure, don’t stuff. ([Reddit][3])
8) Cross-check with Google’s official prompt guides. They show what Veo expects and how to tweak for Vertex/Flow. ([Google Cloud][5])
A reusable Veo JSON “meta-prompt” blueprint
Paste this and fill the blanks for each project. It mirrors how pros on Vertex/Flow break direction into controllable blocks. ([Google Cloud][5])
{
"version": "veo-3.1",
"seed": 42,
"output": { "duration_sec": 10, "fps": 24, "resolution": "1080p" },
"global_style": {
"look": "photorealistic with cinematic tone",
"color": "rich but natural contrast, soft highlight rolloff",
"mood": "focused, confident, modern",
"safety": "no celebrities, no trademarks"
},
"continuity": {
"characters": [
{
"id": "host",
"description": "late-20s professional, charcoal blazer, clean desk setup",
"reference_images": ["host_front.jpg", "host_threequarter.jpg"]
}
],
"props": ["laptop: matte black", "product: polished aluminum cube"],
"lighting": "consistent key from screen-left, 5600K, softbox feel"
},
"scenes": [
{
"id": "s1",
"duration_sec": 3.5,
"shot": {
"type": "medium shot",
"framing": "keep full subject shoulders-up centered",
"camera": "slow push-in 5% over duration, slight parallax"
},
"action": "host sets product cube gently on desk, dust motes visible",
"environment": "minimal studio office; bokeh city lights in background",
"lighting": "soft key from left, dim rim from right, subtle reflections on cube",
"style": "clean tech aesthetic",
"audio": "quiet HVAC hum, soft object placement thud"
},
{
"id": "s2",
"duration_sec": 3.5,
"shot": {
"type": "macro insert",
"framing": "product centered; maintain logo-free surfaces",
"camera": "linear left-to-right slider 10cm, micro-rack-focus to front edge"
},
"action": "cube’s edge catches light; internal detail sparkle",
"lighting": "specular highlight sweep across bevel",
"audio": "gentle high-frequency shimmer"
},
{
"id": "s3",
"duration_sec": 3,
"shot": {
"type": "wide establishing",
"framing": "host at desk, product mid-foreground",
"camera": "locked-off; add subtle handheld micro-jitter 1%"
},
"action": "host nods; UI-like reflections glide on product surface",
"lighting": "same setup as s1 for continuity",
"audio": "soft whoosh as reflections pass; room tone continues"
}
],
"notes": [
"No text overlays or watermarks.",
"Avoid brand marks; use generic industrial design.",
"If conflict: obey scene shot type and framing first."
]
}
Why this structure works
Front-loads the core directive and keeps each scene to one action. ([Reddit][1])
Continuity block stabilizes identity across shots (stronger in 3.1). ([The Verge][4])
Shot object forces you to decide type, framing, and motion—exactly what Veo responds to. ([Google Cloud][5])
Most-shared Veo tips from Reddit (condensed)
| Tip (TL;DR) | Why it helps | Source |
Front-load the important bits; one action per scene; use a fixed shot grammar: [SHOT] + [SUBJECT] + [ACTION] + [STYLE] + [CAMERA] + [AUDIO]. | Veo weights early tokens more; a single action per scene reduces conflicts. | ([Reddit][1]) |
| Keep it concise (3–6 sentences), unify style (“photorealistic, cinematic”), avoid mixing multiple visual grammars. | Long/mixed prompts fragment the output; brevity improves adherence. | ([Reddit][3]) |
| Use Flow/Vadoo/Vertex scene-by-scene with JSON “meta prompts.” | Atomic scenes + JSON fields give repeatability and control across shots. | ([Reddit][6]) |
| Camera literacy wins: specify shot type, motion, lens feel; add “keep full subject in frame.” | Reduces unwanted reframing/cropping; more “director-like” control. | ([Reddit][1]) |
| Stabilize characters with start/end frames or “ingredients” (reference images). | Boosts identity/prop consistency; aligns with Veo 3.1 features. | ([Reddit][3]) |
| Iterate the how, not the what (lock subject/action; vary lighting, lens, motion). | Faster convergence to desired look. | ([Reddit][1]) |
| Don’t overstuff brands/celebs/copyrighted cues. | Avoids safety rejections and weird artifacts. | ([Reddit][7]) |
| Use official guides for prompt anatomy & Vertex specifics. | Aligns with model expectations and latest capabilities. | ([Google Cloud][5]) |
How you can “let AI watch a video” and turn it into a Veo-style JSON prompt
A) Upload the actual video to a model that natively understands video
Gemini 1.5 Pro (Vertex AI / Gemini API) accepts video files directly and can summarize, chapterize with timestamps, and reason over visuals+audio. You pass the video as an input part and ask it to output a scene-by-scene JSON. ([Google AI for Developers][1])
OpenAI GPT-4o / Realtime API: supports multimodal (text+vision+audio) and can handle streams/clips; you can feed frames/streams and ask for structured output. Availability details live in OpenAI’s docs and Realtime updates. ([OpenAI][2])
B) Give a YouTube URL → pull transcript → generate prompt
Many workflows just fetch the YouTube transcript (when available) and drive the prompt from that text (cheap + fast). Libraries and services expose transcripts by video ID/URL. ([PyPI][3])
This misses purely visual info (framing, camera moves). To compensate, pair transcript beats with a few thumbnails/frame grabs you upload.
C) No native video support? Extract frames and feed as images
- Sample frames (e.g., 1–2 fps), upload them, and ask the model to detect shots, actions, and camera moves, then return your Veo JSON. This is a standard workaround discussed for vision models. ([OpenAI Developer Community][4])
A ready-to-use “analysis prompt” you can give the AI (works with A/B/C)
System goal (tell the model what to return): “From the provided video (and transcript if present), produce a Veo-ready JSON with: version, output, global_style, continuity, and scenes[] where each scene has id, start, end, shot{type, framing, camera}, action, environment, lighting, and optional audio. Keep one action per scene.”
User message template:
Inputs:
- Video (or frames): <attached / accessible>
- Transcript (optional): <text if available>
Tasks:
1) Segment into scenes. Use shot/angle changes and major action shifts. Return start/end (s).
2) For each scene, infer camera language (shot type, motion, framing), key action, lighting, env.
3) Output concise Veo JSON (<= 10 scenes, 3–6 sentences per scene). No brand marks/celebs.
JSON schema:
{
"version": "veo-3.1",
"output": { "duration_sec": <int>, "fps": 24, "resolution": "1080p" },
"global_style": { "look": "...", "color": "...", "mood": "...", "safety": "..." },
"continuity": { "characters": [...], "props": [...], "lighting": "..." },
"scenes": [
{
"id": "s1",
"start": 0.0,
"end": 3.5,
"shot": { "type": "...", "framing": "...", "camera": "..." },
"action": "...",
"environment": "...",
"lighting": "...",
"audio": "..."
}
],
"notes": ["..."]
}
Popular practical tips (from the field)
| What to do | Why it works | Where this is documented / discussed |
| Prefer native video models for best fidelity | They see motion, audio, lighting across time, so scene detection + JSON is cleaner | Gemini video understanding + samples. ([Google AI for Developers][1]) |
| If using YouTube links, pull the transcript first | Fast structure for beats; then add a few key frames for visual specifics | Transcript tools/APIs. ([PyPI][3]) |
| For vision-only models, extract frames (~1–2 fps) | Common workaround to “analyze video”; then ask for shot list → Veo JSON | OpenAI community guidance on video via frames. ([OpenAI Developer Community][4]) |
| Ask the model for timestamps per scene | Lets you map JSON back to the cut and edit precisely | Google’s Vertex sample returns timestamped chapters. ([Google Cloud][5]) |
| Keep JSON short and atomic | Veo follows concise, single-action scenes better | General prompting best practices for these models. ([Google AI for Developers][1]) |






