Skip to main content
fal avatar

fal / video-understanding

Ask a natural-language question about a video and get a detailed text answer — scene description, action recognition, on-screen text (OCR), and visual Q&A.

caption
0.03

Model Input

Input

URL of the video to analyze. Supported formats: mp4, mov, webm, m4v, gif.

The question or prompt about the video content.

Additional Settings

Customize your input with more control.

Request a more detailed analysis of the video.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

The kite is **red**.

The landscape with golden, mature crops (likely wheat, barley, or similar grains) suggests that it is **late spring or summer**, nearing harvest time.

Generated in 6.687 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Video Understanding analyzes a video and answers a natural-language question about it. Pass a video URL (mp4, mov, webm, m4v, or gif) and a prompt, and it returns a text answer describing what is happening — scene and setting, actions and events over time, on-screen text (OCR), objects and their relationships, and open-ended visual question answering. Its strength is temporal understanding: it reasons about how the scene evolves across the clip, not just a single frame, so it can describe sequences, cause-and-effect, and things that only make sense in motion (a flash of lightning followed by crashing waves, a subject entering then leaving frame).

## Best for - Describing what happens in a video in plain language for summaries, alt text, or accessibility - Action and event recognition ("what is the person doing?", "does anyone fall?") - Reading on-screen or environmental text from a clip (OCR over video) - Open-ended visual Q&A about objects, counts, colors, and scene context - Moderation and triage triggers where you need a text verdict about video content

## Choose another model when - You want to generate or edit a video rather than describe one — use a text-to-video or image-to-video model - Your input is a single still image, not a clip — use an image-to-text / visual-question-answering model - You need word-level speech transcription with timestamps — use a speech-to-text model

## Tips - Ask one focused question per call; a specific prompt ("What object is flying, and what is the landscape below it?") yields a sharper answer than "Describe this." - Set `detailed_analysis` to `true` when you want a longer, more thorough breakdown of the scene; leave it `false` (the default) for a concise answer. - Keep clips short and in a supported container (mp4, mov, webm, m4v, gif); the video must be reachable at a public URL.

## Limitations - Very long videos may be summarized coarsely; the answer favors salient events over exhaustive frame-by-frame detail. - It returns text only — no timestamps, bounding boxes, or per-frame labels.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("video-understanding", { input: { video_url: "https://media.modelrunner.ai/JMtPPgalIlURXVMHfWlAE.mp4", prompt: "Describe what is happening in this video in detail.", detailed_analysis: false, }, }); ```