Advertisement

Advanced application capabilities of ChatGPT in the visual field - Level 1

, and today we will provide a more detailed introduction to the image-related aspects. Microsoft recently published a paper titled "The Dawn of Large Multimodal Models: A Preliminary Exploration of GPT-4V(ision)", focusing on the application capabilities of large language models in the field of vision.

Paper Link: https://arxiv.org/abs/2309.17421

GPT-4V supports three major input methods:

  1. Plain text

  2. Image with caption

  • Image recognition
  • Object localization
  • Image captioning
  • Visual Question Answering
  • Visual Dialogue
  • Dense Captioning
  1. Interwoven Text and Images
  • Applicable to Various Application Scenarios
  • Process multiple image inputs simultaneously and extract query information
  • Effectively match information between images and text
  • Applicable to few-shot learning in context and other advanced instruction techniques

GPT-4V operation prompt tips:

  1. Textual instructions


InstructionResponseNote
Count the number of apples in the imageAn appleWrong count
Count the apples in the picture row by rowFirst row: 4 applesThe result is correct, but the process is wrong.
As a counting expert, please count the apples in the figure below line by line to ensure the answer is correct.First row: 4 applesClear instructions, correct responses
  1. To accurately label an object in an image, we have six methods to choose from:
  • Coordinates
  • Cropping
  • Arrow
  • Rectangle
  • Oval
  • Hand-drawn
See the example
  1. The generality and flexibility demonstrated by GPT-4V enable it to understand multimodal instructions in a nearly human-like manner, showcasing unprecedented adaptability.


  2. Few-shot example indication

When given zero-shot instructions, the results may be incorrect.

Under one-shot instructions, the results are still incorrect.

But with the few-shot prompt, the result is completely accurate.