Model Details
Florence-2 Large Caption looks at a photo and returns a single concise sentence describing what it shows. Give it one image URL and you get back a short, natural caption of the whole scene — what is in it and roughly what is happening — with no prompt, no question, and nothing to tune. It is zero-shot: one image in, one caption string out. Built on Microsoft's Florence-2 vision-language foundation model, it is a fast, deterministic way to turn an image into searchable, human-readable text for alt-text, indexing, or a quick scene summary.
## Best for - Generating alt-text or accessibility captions for photos - Auto-describing images for search, tagging, or content indexing - Getting a quick one-line summary of what a picture shows - Bulk-captioning a dataset of images where a short, consistent description per image is enough
## Choose another model when - You want labeled bounding boxes around the objects in the image — use the Florence-2 Large object-detection variant - You want to read or transcribe text printed in the image — use an OCR model - You want to ask a specific question about the image or get a longer, detailed answer — use a visual question-answering model such as Moondream - You need a long, multi-paragraph description rather than one sentence — use a detailed-captioning model
## Tips - Feed a clear, reasonably high-resolution photo; the caption summarizes the dominant subject and setting, so a clean composition yields a sharper description. - There is nothing to prompt or configure — every run on the same image is deterministic, so you can cache results. - The output is a plain caption sentence (not a list of objects); use the object-detection variant if you need structured detections.
## Limitations - Produces one short sentence focused on the main subject — fine detail, small objects, and secondary elements are often omitted. - May misread unusual scenes, rare objects, or text-heavy images; it describes, it does not transcribe.
To run via the ModelRunner JavaScript client: ```js import { modelrunner } from "@modelrunner/client";
const result = await modelrunner.subscribe("microsoft/florence-2-large/caption", { input: { image_url: "https://media.modelrunner.ai/example-scene.png", }, }); // result.data is a caption string, e.g. "a green volkswagen beetle parked in front of a yellow building" ```
