Skip to main content

Command Palette

Search for a command to run...

Veo 3/3.1: pro JSON prompting, step-by-step

Updated
6 min read
Veo 3/3.1: pro JSON prompting, step-by-step

1) Start with intent + constraints (front-load the “what”). Veo weights early words heavily; say the core shot, subject, and action first, then style and camera. ([Reddit][1])

2) Keep scenes atomic. Use short, single-action scenes (micro-beats) and control continuity with explicit carry-overs (lighting, costume, props). Google’s latest guide emphasizes directing scenes and consistency. ([Google Cloud][2])

3) Unify style & tone. Avoid mixed styles; declare one visual grammar (e.g., “photorealistic, cinematic tone”). Keep the prompt concise (3–6 sentences per scene) to prevent dilution. ([Reddit][3])

4) Be explicit about camera & framing. Name the shot type, lens feel, motion, and framing (“medium shot… slow push-in… keep full subject in frame”). ([Reddit][1])

5) Lock continuity with references (“ingredients”). Use start/end frames or reference images to stabilize characters/props; Veo 3.1 strengthens this workflow. ([The Verge][4])

6) Describe audio beats (if your toolchain supports it). Add succinct sound cues to match emotional beats (e.g., “subtle HVAC hum, soft shoe squeak”). ([The Verge][4])

7) Keep it short. Overlong prompts get partially ignored; structure, don’t stuff. ([Reddit][3])

8) Cross-check with Google’s official prompt guides. They show what Veo expects and how to tweak for Vertex/Flow. ([Google Cloud][5])

A reusable Veo JSON “meta-prompt” blueprint

Paste this and fill the blanks for each project. It mirrors how pros on Vertex/Flow break direction into controllable blocks. ([Google Cloud][5])

{
  "version": "veo-3.1",
  "seed": 42,
  "output": { "duration_sec": 10, "fps": 24, "resolution": "1080p" },

  "global_style": {
    "look": "photorealistic with cinematic tone",
    "color": "rich but natural contrast, soft highlight rolloff",
    "mood": "focused, confident, modern",
    "safety": "no celebrities, no trademarks"
  },

  "continuity": {
    "characters": [
      {
        "id": "host",
        "description": "late-20s professional, charcoal blazer, clean desk setup",
        "reference_images": ["host_front.jpg", "host_threequarter.jpg"]
      }
    ],
    "props": ["laptop: matte black", "product: polished aluminum cube"],
    "lighting": "consistent key from screen-left, 5600K, softbox feel"
  },

  "scenes": [
    {
      "id": "s1",
      "duration_sec": 3.5,
      "shot": {
        "type": "medium shot",
        "framing": "keep full subject shoulders-up centered",
        "camera": "slow push-in 5% over duration, slight parallax"
      },
      "action": "host sets product cube gently on desk, dust motes visible",
      "environment": "minimal studio office; bokeh city lights in background",
      "lighting": "soft key from left, dim rim from right, subtle reflections on cube",
      "style": "clean tech aesthetic",
      "audio": "quiet HVAC hum, soft object placement thud"
    },
    {
      "id": "s2",
      "duration_sec": 3.5,
      "shot": {
        "type": "macro insert",
        "framing": "product centered; maintain logo-free surfaces",
        "camera": "linear left-to-right slider 10cm, micro-rack-focus to front edge"
      },
      "action": "cube’s edge catches light; internal detail sparkle",
      "lighting": "specular highlight sweep across bevel",
      "audio": "gentle high-frequency shimmer"
    },
    {
      "id": "s3",
      "duration_sec": 3,
      "shot": {
        "type": "wide establishing",
        "framing": "host at desk, product mid-foreground",
        "camera": "locked-off; add subtle handheld micro-jitter 1%"
      },
      "action": "host nods; UI-like reflections glide on product surface",
      "lighting": "same setup as s1 for continuity",
      "audio": "soft whoosh as reflections pass; room tone continues"
    }
  ],

  "notes": [
    "No text overlays or watermarks.",
    "Avoid brand marks; use generic industrial design.",
    "If conflict: obey scene shot type and framing first."
  ]
}

Why this structure works

  • Front-loads the core directive and keeps each scene to one action. ([Reddit][1])

  • Continuity block stabilizes identity across shots (stronger in 3.1). ([The Verge][4])

  • Shot object forces you to decide type, framing, and motion—exactly what Veo responds to. ([Google Cloud][5])

Most-shared Veo tips from Reddit (condensed)

Tip (TL;DR)Why it helpsSource
Front-load the important bits; one action per scene; use a fixed shot grammar: [SHOT] + [SUBJECT] + [ACTION] + [STYLE] + [CAMERA] + [AUDIO].Veo weights early tokens more; a single action per scene reduces conflicts.([Reddit][1])
Keep it concise (3–6 sentences), unify style (“photorealistic, cinematic”), avoid mixing multiple visual grammars.Long/mixed prompts fragment the output; brevity improves adherence.([Reddit][3])
Use Flow/Vadoo/Vertex scene-by-scene with JSON “meta prompts.”Atomic scenes + JSON fields give repeatability and control across shots.([Reddit][6])
Camera literacy wins: specify shot type, motion, lens feel; add “keep full subject in frame.”Reduces unwanted reframing/cropping; more “director-like” control.([Reddit][1])
Stabilize characters with start/end frames or “ingredients” (reference images).Boosts identity/prop consistency; aligns with Veo 3.1 features.([Reddit][3])
Iterate the how, not the what (lock subject/action; vary lighting, lens, motion).Faster convergence to desired look.([Reddit][1])
Don’t overstuff brands/celebs/copyrighted cues.Avoids safety rejections and weird artifacts.([Reddit][7])
Use official guides for prompt anatomy & Vertex specifics.Aligns with model expectations and latest capabilities.([Google Cloud][5])

How you can “let AI watch a video” and turn it into a Veo-style JSON prompt

A) Upload the actual video to a model that natively understands video

  • Gemini 1.5 Pro (Vertex AI / Gemini API) accepts video files directly and can summarize, chapterize with timestamps, and reason over visuals+audio. You pass the video as an input part and ask it to output a scene-by-scene JSON. ([Google AI for Developers][1])

  • OpenAI GPT-4o / Realtime API: supports multimodal (text+vision+audio) and can handle streams/clips; you can feed frames/streams and ask for structured output. Availability details live in OpenAI’s docs and Realtime updates. ([OpenAI][2])

B) Give a YouTube URL → pull transcript → generate prompt

  • Many workflows just fetch the YouTube transcript (when available) and drive the prompt from that text (cheap + fast). Libraries and services expose transcripts by video ID/URL. ([PyPI][3])

  • This misses purely visual info (framing, camera moves). To compensate, pair transcript beats with a few thumbnails/frame grabs you upload.

C) No native video support? Extract frames and feed as images

  • Sample frames (e.g., 1–2 fps), upload them, and ask the model to detect shots, actions, and camera moves, then return your Veo JSON. This is a standard workaround discussed for vision models. ([OpenAI Developer Community][4])

A ready-to-use “analysis prompt” you can give the AI (works with A/B/C)

System goal (tell the model what to return): “From the provided video (and transcript if present), produce a Veo-ready JSON with: version, output, global_style, continuity, and scenes[] where each scene has id, start, end, shot{type, framing, camera}, action, environment, lighting, and optional audio. Keep one action per scene.”

User message template:

Inputs:
- Video (or frames): <attached / accessible>
- Transcript (optional): <text if available>

Tasks:
1) Segment into scenes. Use shot/angle changes and major action shifts. Return start/end (s).
2) For each scene, infer camera language (shot type, motion, framing), key action, lighting, env.
3) Output concise Veo JSON (<= 10 scenes, 3–6 sentences per scene). No brand marks/celebs.

JSON schema:
{
  "version": "veo-3.1",
  "output": { "duration_sec": <int>, "fps": 24, "resolution": "1080p" },
  "global_style": { "look": "...", "color": "...", "mood": "...", "safety": "..." },
  "continuity": { "characters": [...], "props": [...], "lighting": "..." },
  "scenes": [
    {
      "id": "s1",
      "start": 0.0,
      "end": 3.5,
      "shot": { "type": "...", "framing": "...", "camera": "..." },
      "action": "...",
      "environment": "...",
      "lighting": "...",
      "audio": "..."
    }
  ],
  "notes": ["..."]
}

Popular practical tips (from the field)

What to doWhy it worksWhere this is documented / discussed
Prefer native video models for best fidelityThey see motion, audio, lighting across time, so scene detection + JSON is cleanerGemini video understanding + samples. ([Google AI for Developers][1])
If using YouTube links, pull the transcript firstFast structure for beats; then add a few key frames for visual specificsTranscript tools/APIs. ([PyPI][3])
For vision-only models, extract frames (~1–2 fps)Common workaround to “analyze video”; then ask for shot list → Veo JSONOpenAI community guidance on video via frames. ([OpenAI Developer Community][4])
Ask the model for timestamps per sceneLets you map JSON back to the cut and edit preciselyGoogle’s Vertex sample returns timestamped chapters. ([Google Cloud][5])
Keep JSON short and atomicVeo follows concise, single-action scenes betterGeneral prompting best practices for these models. ([Google AI for Developers][1])

More from this blog

T

Tenten - AI / ML Development

225 posts

🚀 Revolutionize your business with AI! 🤖 Trusted by tech giants since 2013, we're your go-to LLM experts. From startups to corporations, we bring ideas to life with custom AI solutions