fal / video-understanding

Ask a natural-language question about a video and get a detailed text answer — scene description, action recognition, on-screen text (OCR), and visual Q&A.

caption

0.03

OpenAPI

Input

Output

The kite is **red**.

The landscape with golden, mature crops (likely wheat, barley, or similar grains) suggests that it is **late spring or summer**, nearing harvest time.

{
  "error": "",
  "inferenceTime": 6687,
  "output": "The kite is **red**.\n\nThe landscape with golden, mature crops (likely wheat, barley, or similar grains) suggests that it is **late spring or summer**, nearing harvest time.",
  "input": {
    "prompt": "What color is the kite, and what season does the landscape suggest based on the crops visible?",
    "video_url": "https://media.modelrunner.ai/BITZDql1F9DznEOX9UPPa.mp4",
    "detailed_analysis": false
  },
  "logs": "Generated 1 output(s)"
}

Generated in 6.687 seconds

Logs (1 lines)

Examples

This video captures a dramatic and awe-inspiring scene of a coastal landscape during what appears to be a storm or a powerful weather event. **Setting:** The primary setting is a rugged coastline dominated by imposing, dark cliffs that rise steeply from the churning sea. The cliffs are dark and appear to be made of rock, with some waterfalls or cascades flowing down their faces into the ocean, adding to the dynamic nature of the scene. The foreground features large, dark rocks partially submerged in the foamy water, suggesting a rather dangerous and unapproachable shoreline. The ocean itself is a powerful force, with large waves crashing against the base of the cliffs and rocks, creating a significant amount of white foam and spray. **Weather:** The weather is evidently tumultuous and severe. The sky is heavily overcast with dark, brooding clouds that suggest a storm is either ongoing or approaching. The color palette of the sky ranges from deep indigo and gray in the upper parts to a subtle, pale purple and orange hue near the horizon, indicating either dawn, dusk, or the brief breaking through of light during a stormy period. The most striking weather phenomenon is the vivid lightning that flashes multiple times in the distance over the ocean, illuminating the obscured horizon with bright, electric purple-white streaks. This lightning is accompanied by intense thunder, which is heard as a rumbling sound, further emphasizing the tempestuous conditions. The crashing waves and sea spray also point to strong winds and powerful ocean currents. **Mood:** The overall mood of the video is one of raw power, majesty, and a touch of foreboding. The dark cliffs, stormy sea, and dramatic lightning create a sense of the immense and often terrifying force of nature. There's an atmosphere of wildness and untamed beauty. While the scene is visually stunning, it also evokes feelings of danger and isolation due to its remote and hostile appearance. The sound of the rolling thunder further amplifies the sense of drama and the sheer scale of the natural event unfolding. It's a scene that commands respect and illustrates the overwhelming power of the elements.

This video segment, lasting approximately 4 seconds, showcases a beautiful panoramic view of a rural landscape with a distinctive object flying in the sky. **Analysis:** * **Main Activities and Actions:** The primary action depicted is the stable flight of a kite against the backdrop of a vast agricultural landscape. The video captures the kite's gentle swaying in what appears to be a light breeze. * **People and Their Interactions:** No people are visible in the video frame, nor is there any direct interaction with the environment by human subjects. The implication is that someone is flying the kite, but they are off-screen. * **Objects and Environment:** * **Object in the Sky:** The object flying in the sky is a vibrant red kite. It has a distinctive, somewhat curved or "bowed" shape, suggesting it might be a delta kite or a parafoil kite designed to catch the wind efficiently. It is tethered by a visible white string, which extends downwards out of the frame, indicating it's being flown from the ground. * **Landscape Below:** The landscape is characterized by rolling hills covered in what appears to be golden, ripe wheat or another cereal crop, indicating late spring or summer. The fields stretch into the distance, showing variations in terrain and light. There are visible tracks or marks within the fields, likely from agricultural machinery. Scattered trees and small clusters of foliage mark the edges of some fields and rise on distant hills, suggesting a mix of cultivated land and natural growth. The overall impression is one of an expansive, rural setting. * **Sky:** The sky is a clear, light blue with scattered white, wispy clouds, characteristic of a pleasant day. * **Temporal Sequence of Events:** The video is very short, essentially a static shot with very subtle movement. The kite gently bobs and shifts slightly due to air currents, but there are no significant changes or events occurring within this brief sequence. The lighting suggests either early morning or late afternoon/early evening, given the golden hues highlighting the crops and the indirect, soft quality of the light. * **Visual Style and Composition:** * **Perspective:** The camera appears to be at an elevated position, possibly flown on a drone or mounted on a high point, offering a broad, panoramic view of the landscape. * **Framing:** The red kite is prominently featured in the upper-mid section of the frame, acting as a clear focal point. The expansive landscape fills the rest of the frame, creating a sense of depth and scale. * **Color Palette:** The dominant colors are the golden yellow of the fields, the bright red of the kite, and the blue and white of the sky. The warm, natural tones contribute to a calm and serene aesthetic. * **Lighting:** The lighting is soft and warm, casting gentle shadows and highlighting the texture of the fields. The sun appears to be low on the horizon, creating a warm, golden hour effect. * **Any Text or Audio Cues Visible:** There are no visible text overlays or audio cues within this video segment. * **Overall Mood and Atmosphere:** The mood is peaceful, serene, and idyllic. The combination of the beautiful natural landscape, the clear sky, and the gentle flight of the kite evokes a sense of tranquility, freedom, and nostalgia. It suggests a calm, unhurried moment in nature.

There is one lightning flash visible in the video. Between strikes, the sky is a dark, moody mix of purples and blues, with faint hints of orange and pink on the horizon where the sun is setting or rising. The clouds are heavy and dark, suggesting an ongoing storm.

The kite is **red**. The landscape with golden, mature crops (likely wheat, barley, or similar grains) suggests that it is **late spring or summer**, nearing harvest time.

Model Details

Video Understanding analyzes a video and answers a natural-language question about it. Pass a video URL (mp4, mov, webm, m4v, or gif) and a prompt, and it returns a text answer describing what is happening — scene and setting, actions and events over time, on-screen text (OCR), objects and their relationships, and open-ended visual question answering. Its strength is temporal understanding: it reasons about how the scene evolves across the clip, not just a single frame, so it can describe sequences, cause-and-effect, and things that only make sense in motion (a flash of lightning followed by crashing waves, a subject entering then leaving frame).

## Best for - Describing what happens in a video in plain language for summaries, alt text, or accessibility - Action and event recognition ("what is the person doing?", "does anyone fall?") - Reading on-screen or environmental text from a clip (OCR over video) - Open-ended visual Q&A about objects, counts, colors, and scene context - Moderation and triage triggers where you need a text verdict about video content

## Choose another model when - You want to generate or edit a video rather than describe one — use a text-to-video or image-to-video model - Your input is a single still image, not a clip — use an image-to-text / visual-question-answering model - You need word-level speech transcription with timestamps — use a speech-to-text model

## Tips - Ask one focused question per call; a specific prompt ("What object is flying, and what is the landscape below it?") yields a sharper answer than "Describe this." - Set `detailed_analysis` to `true` when you want a longer, more thorough breakdown of the scene; leave it `false` (the default) for a concise answer. - Keep clips short and in a supported container (mp4, mov, webm, m4v, gif); the video must be reachable at a public URL.

## Limitations - Very long videos may be summarized coarsely; the answer favors salient events over exhaustive frame-by-frame detail. - It returns text only — no timestamps, bounding boxes, or per-frame labels.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("video-understanding", { input: { video_url: "https://media.modelrunner.ai/JMtPPgalIlURXVMHfWlAE.mp4", prompt: "Describe what is happening in this video in detail.", detailed_analysis: false, }, }); ```

fal / video-understanding

Model Input

Input

Additional Settings

Model Output

Output

Model Example Requests

Examples

Model Details

Model Details