Skip to main content
fal avatar

fal / video-understanding

Ask a natural-language question about a video and get a detailed text answer — scene description, action recognition, on-screen text (OCR), and visual Q&A.

caption
0.03

Model Input

Input

URL of the video to analyze. Supported formats: mp4, mov, webm, m4v, gif.

The question or prompt about the video content.

Additional Settings

Customize your input with more control.

Request a more detailed analysis of the video.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

There is one lightning flash visible in the video.

Between strikes, the sky is a dark, moody mix of purples and blues, with faint hints of orange and pink on the horizon where the sun is setting or rising. The clouds are heavy and dark, suggesting an ongoing storm.

Generated in 7.207 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Video Understanding analyzes a video and answers a natural-language question about it. Pass a video URL (mp4, mov, webm, m4v, or gif) and a prompt, and it returns a text answer describing what is happening — scene and setting, actions and events over time, on-screen text (OCR), objects and their relationships, and open-ended visual question answering. Its strength is temporal understanding: it reasons about how the scene evolves across the clip, not just a single frame, so it can describe sequences, cause-and-effect, and things that only make sense in motion (a flash of lightning followed by crashing waves, a subject entering then leaving frame).

## Best for - Describing what happens in a video in plain language for summaries, alt text, or accessibility - Action and event recognition ("what is the person doing?", "does anyone fall?") - Reading on-screen or environmental text from a clip (OCR over video) - Open-ended visual Q&A about objects, counts, colors, and scene context - Moderation and triage triggers where you need a text verdict about video content

## Choose another model when - You want to generate or edit a video rather than describe one — use a text-to-video or image-to-video model - Your input is a single still image, not a clip — use an image-to-text / visual-question-answering model - You need word-level speech transcription with timestamps — use a speech-to-text model

## Tips - Ask one focused question per call; a specific prompt ("What object is flying, and what is the landscape below it?") yields a sharper answer than "Describe this." - Set `detailed_analysis` to `true` when you want a longer, more thorough breakdown of the scene; leave it `false` (the default) for a concise answer. - Keep clips short and in a supported container (mp4, mov, webm, m4v, gif); the video must be reachable at a public URL.

## Limitations - Very long videos may be summarized coarsely; the answer favors salient events over exhaustive frame-by-frame detail. - It returns text only — no timestamps, bounding boxes, or per-frame labels.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("video-understanding", { input: { video_url: "https://media.modelrunner.ai/JMtPPgalIlURXVMHfWlAE.mp4", prompt: "Describe what is happening in this video in detail.", detailed_analysis: false, }, }); ```