, and today we will provide a more detailed introduction to the image-related aspects. Microsoft recently published a paper titled "The Dawn of Large Multimodal Models: A Preliminary Exploration of GPT-4V(ision)", focusing on the application capabilities of large language models in the field of vision.
Paper Link: https://arxiv.org/abs/2309.17421
GPT-4V supports three major input methods:
Plain text
Image with caption
Image recognition Object localization Image captioning Visual Question Answering Visual Dialogue Dense Captioning
Interwoven Text and Images
Applicable to Various Application Scenarios Process multiple image inputs simultaneously and extract query information Effectively match information between images and text Applicable to few-shot learning in context and other advanced instruction techniques
GPT-4V operation prompt tips:
Textual instructions
Instruction | Response | Note |
---|---|---|
Count the number of apples in the image | An apple | Wrong count |
Count the apples in the picture row by row | First row: 4 apples | The result is correct, but the process is wrong. |
As a counting expert, please count the apples in the figure below line by line to ensure the answer is correct. | First row: 4 apples | Clear instructions, correct responses |
To accurately label an object in an image, we have six methods to choose from:
Coordinates Cropping Arrow Rectangle Oval Hand-drawn
The generality and flexibility demonstrated by GPT-4V enable it to understand multimodal instructions in a nearly human-like manner, showcasing unprecedented adaptability.
Few-shot example indication
When given zero-shot instructions, the results may be incorrect.
Under one-shot instructions, the results are still incorrect.
But with the few-shot prompt, the result is completely accurate.