Skip to main content
fal avatar

fal / video-understanding

Ask a natural-language question about a video and get a detailed text answer — scene description, action recognition, on-screen text (OCR), and visual Q&A.

caption
0.03

Model Input

Input

URL of the video to analyze. Supported formats: mp4, mov, webm, m4v, gif.

The question or prompt about the video content.

Additional Settings

Customize your input with more control.

Request a more detailed analysis of the video.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

This video captures a dramatic and awe-inspiring scene of a coastal landscape during what appears to be a storm or a powerful weather event.

**Setting:** The primary setting is a rugged coastline dominated by imposing, dark cliffs that rise steeply from the churning sea. The cliffs are dark and appear to be made of rock, with some waterfalls or cascades flowing down their faces into the ocean, adding to the dynamic nature of the scene. The foreground features large, dark rocks partially submerged in the foamy water, suggesting a rather dangerous and unapproachable shoreline. The ocean itself is a powerful force, with large waves crashing against the base of the cliffs and rocks, creating a significant amount of white foam and spray.

**Weather:** The weather is evidently tumultuous and severe. The sky is heavily overcast with dark, brooding clouds that suggest a storm is either ongoing or approaching. The color palette of the sky ranges from deep indigo and gray in the upper parts to a subtle, pale purple and orange hue near the horizon, indicating either dawn, dusk, or the brief breaking through of light during a stormy period. The most striking weather phenomenon is the vivid lightning that flashes multiple times in the distance over the ocean, illuminating the obscured horizon with bright, electric purple-white streaks. This lightning is accompanied by intense thunder, which is heard as a rumbling sound, further emphasizing the tempestuous conditions. The crashing waves and sea spray also point to strong winds and powerful ocean currents.

**Mood:** The overall mood of the video is one of raw power, majesty, and a touch of foreboding. The dark cliffs, stormy sea, and dramatic lightning create a sense of the immense and often terrifying force of nature. There's an atmosphere of wildness and untamed beauty. While the scene is visually stunning, it also evokes feelings of danger and isolation due to its remote and hostile appearance. The sound of the rolling thunder further amplifies the sense of drama and the sheer scale of the natural event unfolding. It's a scene that commands respect and illustrates the overwhelming power of the elements.

Generated in 9.977 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Video Understanding analyzes a video and answers a natural-language question about it. Pass a video URL (mp4, mov, webm, m4v, or gif) and a prompt, and it returns a text answer describing what is happening — scene and setting, actions and events over time, on-screen text (OCR), objects and their relationships, and open-ended visual question answering. Its strength is temporal understanding: it reasons about how the scene evolves across the clip, not just a single frame, so it can describe sequences, cause-and-effect, and things that only make sense in motion (a flash of lightning followed by crashing waves, a subject entering then leaving frame).

## Best for - Describing what happens in a video in plain language for summaries, alt text, or accessibility - Action and event recognition ("what is the person doing?", "does anyone fall?") - Reading on-screen or environmental text from a clip (OCR over video) - Open-ended visual Q&A about objects, counts, colors, and scene context - Moderation and triage triggers where you need a text verdict about video content

## Choose another model when - You want to generate or edit a video rather than describe one — use a text-to-video or image-to-video model - Your input is a single still image, not a clip — use an image-to-text / visual-question-answering model - You need word-level speech transcription with timestamps — use a speech-to-text model

## Tips - Ask one focused question per call; a specific prompt ("What object is flying, and what is the landscape below it?") yields a sharper answer than "Describe this." - Set `detailed_analysis` to `true` when you want a longer, more thorough breakdown of the scene; leave it `false` (the default) for a concise answer. - Keep clips short and in a supported container (mp4, mov, webm, m4v, gif); the video must be reachable at a public URL.

## Limitations - Very long videos may be summarized coarsely; the answer favors salient events over exhaustive frame-by-frame detail. - It returns text only — no timestamps, bounding boxes, or per-frame labels.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("video-understanding", { input: { video_url: "https://media.modelrunner.ai/JMtPPgalIlURXVMHfWlAE.mp4", prompt: "Describe what is happening in this video in detail.", detailed_analysis: false, }, }); ```