Sora 2 Image-to-Video Prompting Guide
Critical reality check: Image-to-video (I2V) in Sora 2 has significantly less control than text-to-video. Users report 99% static outputs with unpredictable results. The image anchor constrains motion generation—expect to lock in visual style but struggle with camera movements and complex actions. Text-to-video handles motion far more reliably; use I2V strategically for visual consistency, not general animation.
Technical requirements
Image specifications are non-negotiable. Your image dimensions must exactly match your target video resolution—no flexibility. Supported formats: JPEG, PNG, WebP only. For sora-2 use 1280×720 or 720×1280. For sora-2-pro add 1792×1024 or 1024×1792. Aspect ratios: 16:9 landscape, 9:16 vertical, 1:1 square.
Duration strategy matters. Model follows instructions more reliably in shorter clips. Generate two 4-second clips and stitch them rather than one 8-second clip. API supports 4, 8, or 12 seconds; app allows up to 20 seconds (25 with Pro). Shorter = better results.
Image content restrictions: Cannot upload photos of real people without using the Cameos consent feature. High-resolution source images with single clear subjects work best. Avoid busy compositions, poor lighting, or multiple competing elements.
Prompt structure that works
Use this cinematography-first template for every prompt:
[Style declaration: "1970s film aesthetic" or "UGC iPhone selfie mode"]
[Scene description: subject, wardrobe, setting, props]
Cinematography:
- Camera: [ONE specific shot: "medium close-up, eye level"]
- Lens: [focal length + depth: "35mm, shallow DOF"]
- Lighting: [quality + source + direction: "soft window light from camera left, warm fill"]
- Palette: [3-5 colors: "amber, cream, walnut brown"]
Mood: [tone: "nostalgic, tender"]
Actions:
- [Beat 1 with timing: "She spins in first 2 seconds"]
- [Beat 2: "Catches light at 0:03"]
- [Beat 3: "Stops motion final second"]
Sound: [specific audio: "fabric flutter, distant traffic"]
The golden rule: ONE camera move + ONE subject action per shot. Multiple simultaneous actions virtually guarantee failure.
Actionable best practices
Describe motion in beats and counts. Never write “person moves quickly”—write “cyclist pedals three times, brakes, stops at crosswalk.” Use temporal markers: “In first 2 seconds,” “At 0:04,” “Final second.” Give the model numbered steps, pauses, time markers.
Specify exact camera framing. Say “wide establishing shot, eye level” or “medium close-up, slight angle from behind” or “slow dolly-in from eye level.” Add lens details: “35mm, shallow depth of field” or “85mm portrait lens.” Never assume the model knows what you want.
Lock your lighting and palette. Describe both quality and source: “soft window light with warm lamp fill, cool rim from hallway.” Name 3-5 specific colors: “teal, sand, rust.” This prevents color drift and maintains consistency with your source image.
Include sound design. Sora 2 generates audio—don’t ignore it. Add ambient sound (“distant traffic rumble”), foley (“footsteps on gravel”), or dialogue with speaker labels. Even simple sounds improve output quality.
Keep prompts matched to duration. 4-second clips: 1-2 dialogue exchanges or single action. 8-second clips: 3-4 exchanges or slightly more complex action sequence. Don’t overload short clips with excessive beats.
Use “same [element]” for consistency. When generating multiple shots, repeat key details: “same red coat,” “same silver earrings,” “same curly ponytail.” This maintains character consistency across generations.
Start conservative with image inputs. Use 4-second durations initially. Request minimal, subtle motion first: “gentle breathing,” “fabric flutter from breeze,” “soft eye blink.” Once that works, add complexity incrementally through remix.
Leverage remix strategically. Remix is for nudging, not gambling. Make ONE controlled change per iteration: “same shot, switch to 85mm” or “same lighting, new palette: coral, mint, navy.” Lock what works, adjust one variable at a time.
Common mistakes to avoid
Don’t expect text-to-video control. Camera movement commands that work reliably in T2V (zoom, pan, tilt, dolly) fail or produce erratic results in I2V. Basic movements don’t respond to instructions. The image anchor overrides motion commands.
Don’t request complex actions. Avoid “walks across room, picks up object, opens door.” Even single complex actions often fail. People remain mostly static lacking subtle movements. Request one simple action maximum.
Don’t upload DALL-E images directly. Users report DALL-E generated images consistently produce completely static outputs in I2V. If using AI-generated images as references, test with simple motion first.
Don’t use vague descriptions. Replace “person moves quickly” with “sprints five steps, skids to stop.” Change “brightly lit room” to “soft window light with warm lamp fill.” Swap “nice colors” for “sage green, cream, terracotta.” Concrete beats abstract.
Don’t try long durations initially. Longer clips increase likelihood of style drift where “the scene transitions from your image to a subpar sora recreation.” Start short, verify success, then extend.
Don’t forget frame safety. Add “keep full subject in frame” or “allow 10% padding around character.” Sora loves extreme close-ups that awkwardly crop subjects.
Don’t mix multiple camera movements. Never write “zoom in while panning left and tilting down.” I2V struggles with single movements—multiple movements guarantee chaos or static output.
Don’t skip resolution matching. Upload images that don’t exactly match target video resolution will error or force destructive cropping. This destroys your carefully composed anchor frame.
Specific I2V techniques
Image preparation workflow: Use high-res source with single clear subject and good lighting. Clean composition with distinct foreground/midground/background layers. Verify rights and ownership. For portraits use frontal or 3/4 view with distinctive features highlighted.
Effective I2V prompts are minimalist: “She turns around and smiles, then slowly walks out of frame” or “Slow zoom in on product, rotating 360 degrees, subtle lighting shift.” Describe what happens after the anchor image—don’t redescribe what’s already visible.
The image locks in: character design, wardrobe, set dressing, aesthetic, composition, lighting quality. Your text prompt defines: what happens next, camera movement (if any), audio, timing. Don’t fight the anchor—work with it.
Best I2V use cases: product showcases with subtle rotation, environmental scenes with ambient motion (leaves, water), preserving brand visual consistency, mood pieces where exact action doesn’t matter. Wrong use cases: complex character animation, reliable camera moves, multi-step action sequences.
Generate reference images strategically. Use OpenAI image generation to create starting points, test aesthetics, establish environments. This workflow often produces better results than uploading external photos.
Key differences from text-to-video
Control inversion: T2V gives high prompt control with reliable cinematography execution. I2V has minimal control—prompts become suggestions, not commands. Text authority drops dramatically when image anchor is present.
Motion philosophy: T2V builds motion from scratch with creative freedom. I2V is severely constrained by preserving the input image, creating strong bias toward static outputs. Physics improvements in T2V don’t transfer to I2V.
Success rates diverge: T2V works well when following prompt guides. I2V requires many more attempts for usable results. Users report testing hundreds of images with poor outcomes while T2V delivers consistently.
Prompt structure changes: T2V uses detailed 50-100+ word prompts with complex scene descriptions. I2V needs shorter, simpler prompts focused on single actions after the anchor frame. Complexity decreases effectiveness.
Camera movements: T2V executes dollies, pans, tilts, zooms reliably. I2V fails at basic camera operations—they’re unresponsive to commands. If camera control matters, use T2V.
Use case separation: T2V for narrative storytelling, complex action, camera work, novel scene generation. I2V for visual anchoring, brand consistency, locking specific aesthetics, subtle ambient motion only.
Quick reference checklist
Every prompt needs:
- Style/format upfront (“cinematic commercial,” “90s documentary”)
- Exact camera framing and angle specified
- ONE camera move + ONE subject action maximum
- Motion in beats with timing (“takes four steps,” “pauses at 0:03”)
- Lighting quality, source, direction described
- 3-5 specific color names
- Sound/audio cues included
- Duration appropriate to complexity (4-8 seconds optimal)
For image-to-video specifically:
- High-res source image matching exact video resolution
- Rights/consent verified (no unauthorized faces)
- Single clear subject with good lighting
- Conservative motion expectations (subtle, minimal)
- Simple action description in prompt
- Test 4-second clips before extending duration
Final strategy: Use text-to-video as your primary tool for motion and camera work. Reserve image-to-video for specific visual consistency needs where you must lock exact character designs, brand colors, or aesthetic anchors. Manage expectations—I2V trades motion control for visual precision, and that trade-off is severe.