Skip to main content
microsoft avatar

microsoft / florence-2-large/caption

Generate a concise one-sentence caption describing any photo — no prompt needed.

0

Model Input

Input

URL of the image to caption.

You need to be logged in to run this model and view results.
Log in

Model Output

Output

A dog running on the beach at sunset.

Generated in 2.707 seconds
Logs (1 lines)

Model Example Requests

Examples

Model Details

Model Details

Florence-2 Large Caption looks at a photo and returns a single concise sentence describing what it shows. Give it one image URL and you get back a short, natural caption of the whole scene — what is in it and roughly what is happening — with no prompt, no question, and nothing to tune. It is zero-shot: one image in, one caption string out. Built on Microsoft's Florence-2 vision-language foundation model, it is a fast, deterministic way to turn an image into searchable, human-readable text for alt-text, indexing, or a quick scene summary.

## Best for - Generating alt-text or accessibility captions for photos - Auto-describing images for search, tagging, or content indexing - Getting a quick one-line summary of what a picture shows - Bulk-captioning a dataset of images where a short, consistent description per image is enough

## Choose another model when - You want labeled bounding boxes around the objects in the image — use the Florence-2 Large object-detection variant - You want to read or transcribe text printed in the image — use an OCR model - You want to ask a specific question about the image or get a longer, detailed answer — use a visual question-answering model such as Moondream - You need a long, multi-paragraph description rather than one sentence — use a detailed-captioning model

## Tips - Feed a clear, reasonably high-resolution photo; the caption summarizes the dominant subject and setting, so a clean composition yields a sharper description. - There is nothing to prompt or configure — every run on the same image is deterministic, so you can cache results. - The output is a plain caption sentence (not a list of objects); use the object-detection variant if you need structured detections.

## Limitations - Produces one short sentence focused on the main subject — fine detail, small objects, and secondary elements are often omitted. - May misread unusual scenes, rare objects, or text-heavy images; it describes, it does not transcribe.

To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";

const result = await modelrunner.subscribe("microsoft/florence-2-large/caption", { input: { image_url: "https://media.modelrunner.ai/example-scene.png", }, }); // result.data is a caption string, e.g. "a green volkswagen beetle parked in front of a yellow building" ```