ChatGPT's application capabilities in the visual domain - Advanced Level 2

), yesterday we covered input methods and operational techniques. Today, continuing from yesterday's sharing, let's explore the boundaries of GPT-4's visual language capabilities, as her abilities are quite strong, so we will be learning over two days.

GPT-4's Visual Language Capabilities (Part 1)

Following yesterday's sharing, let's take a look at GPT-4's performance in terms of visual language capabilities:

Image descriptions from different domains

: GPT-4 can understand a scene where the current U.S. President delivers a speech at the 2023 G7 Summit. This demonstrates the model's ability to generalize and handle new scenarios, such as the 2023 G7 Summit, even though this specific scenario was not part of its training data.
: GPT-4 can accurately identify the Space Needle in Seattle, Washington, knows that it was built for the 1962 World's Fair, and has since become an icon of the city.
: GPT-4 can effectively capture complex details in images, enabling it to identify specific ingredients, garnishes, or cooking techniques in dishes.
: GPT-4 can recognize common conditions, such as Jones fractures.
：GPT-4 can provide descriptions of novel or emerging logos and icons, such as the recently released Microsoft 365 Copilot.
：GPT-4 can describe roads as well as the positions and colors of vehicles. It can also read signs and note the speed limit for that road.
：GPT-4 can correctly describe the content of an image when faced with misleading questions or instructions.

Object localization, counting, and dense captioning

: GPT-4 can identify the spatial relationships between people and cars in an image and point out that the camera angle may affect the perceived size.
: GPT-4 can successfully calculate the number of objects present in an image.
: GPT-4 demonstrates the ability to generate bounding box coordinates in text format without separate textual box markers.
: GPT-4 can successfully locate and identify individuals in images, then provide concise descriptions of the individuals in the image.

Multimodal Knowledge, Common Sense

: GPT-4 has the remarkable ability to gather information from visual and text modes, then understand the humor embedded in MEMEs.
: GPT-4 can identify the average particle velocity of Sample A and Sample B. By considering the relationship between particle velocity, kinetic energy, and temperature, GPT-4 correctly answers the question.

", we observe that the generated answers adopt a tutorial format and gradually explain the topic.

: Based on the formal dresses worn by [person1] and [person2] and the floral decorations present in the scene, it can be inferred that they are attending a wedding.

?”, GPT-4V demonstrated its ability to distinguish numerous subtle visual clues in images and provided a list of reasonable assumptions.

Summary: GPT-4 performs excellently in visual language capabilities, showcasing strong understanding and processing abilities in image descriptions across various fields. Additionally, it excels in spatial relationship understanding, object counting, object positioning, and dense captioning. Moreover, GPT-4 also demonstrates remarkable abilities in multimodal knowledge and common sense.