Can Claude or LLMs Watch a Video? Here's How to Make It Work
LLMs do not natively watch most video files. Gemini samples frames; Claude needs prep. Solutions: scene-aware frame packs (claude-real-video), transcripts (video-use), native APIs, and DIY ffmpeg β compared with costs.
Can Claude watch a video? Not the way humans do. Paste a YouTube URL into ChatGPT and you mostly get captions, not frame-by-frame vision. Upload an MP4 to Claude and you often hit unsupported file type. Even Gemini, which accepts video natively, typically samples frames at a fixed rate β fine for a lecture, weak for a 15-second reel with six cuts.
The workable pattern in 2026 is not "give the model the video." It is give the model what matters from the video β speech as text, visuals as deduplicated keyframes, metadata as a manifest β then ask your questions.
Bottom line: For Q&A and research, use frames + transcript + manifest. For editing, use video-use. For fastest cloud try, use Gemini video. For privacy and reels, use local prep (crv or DIY).
Why "watching" breaks on most LLM UIs
Multimodal models do not stream 24 fps into context. They receive a bounded set of images and text inside a token budget.
Answers dialogue; blind to on-screen code, charts, or lip-sync fraud
Raw frame dump
1 minute Γ 30 fps Γ ~1,500 tokens/image β millions of tokens β slow, expensive, noisy
No video ingest (Claude file upload)
You must preprocess before the model sees anything
video-use's core insight applies broadly: the LLM should read structure, not raw pixels β but for open-ended video Q&A you need both transcript and selectively chosen frames, not transcript alone.
Google's Gemini models accept video in AI Studio and the Gemini API. Default sampling is often ~1 frame per second β documented behavior teams hit when analyzing long files.
Good for: meeting recordings, lectures, slow-paced demos. Weak for: TikTok edits, sports highlights, music videos with fast cuts.
See Gemini Omni Flash for Google's video stack direction β generation and understanding are related but not identical products.
ChatGPT + links
Useful when spoken content carries the signal. Weak when the answer is on screen but never spoken β UI walkthroughs, silent demos, visual gags.
Cost: bundled in subscription or API vision pricing; you do not control frame selection.
Solution 2 β Scene-aware local prep (claude-real-video)
claude-real-video (crv, MIT, ~250 GitHub stars as of July 2026) is one open-source answer when you want any LLM β Claude, ChatGPT, Gemini, local models β to reason over what changed visually, without uploading the video to a vendor.
How it works (short): yt-dlp or local file β ffmpeg scene-change + fps floor β pixel-diff dedup (not perceptual hash β hashes miss flat-color hue shifts) β subtitle track if present else Whisper β MANIFEST.txt for the model.
Tuning flags that matter:
Flag
Default
Effect
--scene
0.30
Lower = more frames
--max-frames
150
Hard cap
--dedup-threshold
8
Higher = fewer frames kept
--dedup-window
4
Stops A-B-A repeat sends
--report
off
HTML report of keep/drop decisions
Optional --keep-audio: saves audio.m4a for models that accept audio (Gemini, GPT-4o audio) when tone and music matter β transcript alone loses both.
Prompt pattern after crv
You are analyzing a video I preprocessed locally.
Read MANIFEST.txt for metadata, transcript.txt for speech, and frames/*.jpg in order.
Questions:
1. What is the main argument in the first 2 minutes?
2. Which frame shows the pricing table?
3. List every scene change topic in chronological order.
Drop the folder into Claude Projects, paste paths in ChatGPT, or attach images in Gemini β same artifacts, any frontier model.
If your goal is produce a new video, not understand an existing one, use video-use:
ElevenLabs Scribe β word-level takes_packed.md
LLM reasons over text, emits ffmpeg EDL
Optional timeline_view composites for cut decisions
The LLM never watches β it reads and edits. Perfect for launch cuts and filler removal (Fable 5 launch pipeline); wrong tool for "summarize this documentary's visual motifs."
Pairing: Run crv for research Q&A; run video-use when you need final.mp4.
Solution 4 β DIY ffmpeg + Whisper
Minimal version without crv:
# Scene frames
ffmpeg -i input.mp4 -vf "select='gt(scene,0.3)',showinfo" -vsync vfr frames/%04d.jpg
# Audio
whisper input.mp4 --model medium --output_format txt
You lose sliding-window dedup, manifest generation, and URL fetch β but keep air-gapped control. Reasonable for one-off internal compliance reviews.
Solution 5 β Video VLMs and structured extraction
For camera feeds and agent loops ("what changed since last frame?"), small video VLMs like Marlin 2B on NemoStation target structured outputs β not consumer YouTube Q&A.
Use when building products, not when a marketer wants a one-off summary of a webinar.