Skip to main content
fal avatar

fal / video-understanding

Ask a natural-language question about a video and get a detailed text answer — scene description, action recognition, on-screen text (OCR), and visual Q&A.

caption
0.03

Model Input

Input

URL of the video to analyze. Supported formats: mp4, mov, webm, m4v, gif.

The question or prompt about the video content.

Additional Settings

Customize your input with more control.

Request a more detailed analysis of the video.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

This video segment, lasting approximately 4 seconds, showcases a beautiful panoramic view of a rural landscape with a distinctive object flying in the sky.

**Analysis:**

* **Main Activities and Actions:** The primary action depicted is the stable flight of a kite against the backdrop of a vast agricultural landscape. The video captures the kite's gentle swaying in what appears to be a light breeze.

* **People and Their Interactions:** No people are visible in the video frame, nor is there any direct interaction with the environment by human subjects. The implication is that someone is flying the kite, but they are off-screen.

* **Objects and Environment:** * **Object in the Sky:** The object flying in the sky is a vibrant red kite. It has a distinctive, somewhat curved or "bowed" shape, suggesting it might be a delta kite or a parafoil kite designed to catch the wind efficiently. It is tethered by a visible white string, which extends downwards out of the frame, indicating it's being flown from the ground. * **Landscape Below:** The landscape is characterized by rolling hills covered in what appears to be golden, ripe wheat or another cereal crop, indicating late spring or summer. The fields stretch into the distance, showing variations in terrain and light. There are visible tracks or marks within the fields, likely from agricultural machinery. Scattered trees and small clusters of foliage mark the edges of some fields and rise on distant hills, suggesting a mix of cultivated land and natural growth. The overall impression is one of an expansive, rural setting. * **Sky:** The sky is a clear, light blue with scattered white, wispy clouds, characteristic of a pleasant day.

* **Temporal Sequence of Events:** The video is very short, essentially a static shot with very subtle movement. The kite gently bobs and shifts slightly due to air currents, but there are no significant changes or events occurring within this brief sequence. The lighting suggests either early morning or late afternoon/early evening, given the golden hues highlighting the crops and the indirect, soft quality of the light.

* **Visual Style and Composition:** * **Perspective:** The camera appears to be at an elevated position, possibly flown on a drone or mounted on a high point, offering a broad, panoramic view of the landscape. * **Framing:** The red kite is prominently featured in the upper-mid section of the frame, acting as a clear focal point. The expansive landscape fills the rest of the frame, creating a sense of depth and scale. * **Color Palette:** The dominant colors are the golden yellow of the fields, the bright red of the kite, and the blue and white of the sky. The warm, natural tones contribute to a calm and serene aesthetic. * **Lighting:** The lighting is soft and warm, casting gentle shadows and highlighting the texture of the fields. The sun appears to be low on the horizon, creating a warm, golden hour effect.

* **Any Text or Audio Cues Visible:** There are no visible text overlays or audio cues within this video segment.

* **Overall Mood and Atmosphere:** The mood is peaceful, serene, and idyllic. The combination of the beautiful natural landscape, the clear sky, and the gentle flight of the kite evokes a sense of tranquility, freedom, and nostalgia. It suggests a calm, unhurried moment in nature.

Generated in 15.693 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Video Understanding analyzes a video and answers a natural-language question about it. Pass a video URL (mp4, mov, webm, m4v, or gif) and a prompt, and it returns a text answer describing what is happening — scene and setting, actions and events over time, on-screen text (OCR), objects and their relationships, and open-ended visual question answering. Its strength is temporal understanding: it reasons about how the scene evolves across the clip, not just a single frame, so it can describe sequences, cause-and-effect, and things that only make sense in motion (a flash of lightning followed by crashing waves, a subject entering then leaving frame).

## Best for - Describing what happens in a video in plain language for summaries, alt text, or accessibility - Action and event recognition ("what is the person doing?", "does anyone fall?") - Reading on-screen or environmental text from a clip (OCR over video) - Open-ended visual Q&A about objects, counts, colors, and scene context - Moderation and triage triggers where you need a text verdict about video content

## Choose another model when - You want to generate or edit a video rather than describe one — use a text-to-video or image-to-video model - Your input is a single still image, not a clip — use an image-to-text / visual-question-answering model - You need word-level speech transcription with timestamps — use a speech-to-text model

## Tips - Ask one focused question per call; a specific prompt ("What object is flying, and what is the landscape below it?") yields a sharper answer than "Describe this." - Set `detailed_analysis` to `true` when you want a longer, more thorough breakdown of the scene; leave it `false` (the default) for a concise answer. - Keep clips short and in a supported container (mp4, mov, webm, m4v, gif); the video must be reachable at a public URL.

## Limitations - Very long videos may be summarized coarsely; the answer favors salient events over exhaustive frame-by-frame detail. - It returns text only — no timestamps, bounding boxes, or per-frame labels.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("video-understanding", { input: { video_url: "https://media.modelrunner.ai/JMtPPgalIlURXVMHfWlAE.mp4", prompt: "Describe what is happening in this video in detail.", detailed_analysis: false, }, }); ```