, OpenAI finally officially released the native image generation function of GPT-4o early this morning via live streaming, blog posts, and system cards, confirming that it uses aautoregressive modelThe most detailed information available so far about how this new model works is an image posted by Allan Jabri, who reportedly worked on the original 4o image generation technology (later taken over by Gabe Goh).

GPT-4o Image Generation Function Fully Launched
OpenAI announced that the GPT-4o image generation function will begin rolling out today to ChatGPT and Sora Plus, Pro, Team, and free users.
According to Kevin Weil's description, this update brings significant improvements, especially in handling complex instructions and detailed visual layouts, where it performs particularly well. GPT-4o has the ability to generate clear text and images in various styles, including realistic photographic styles.

User Feedback: Launch Experience and Actual Performance
In terms of actual generation quality, the new model has received widespread recognition:
Image quality has significantly improved, with clear text presentation and more realistic portrait effects; Many users have shared their experiences in generating high-quality stickers and movie posters, calling it a "gamechanger"; Users have shown great interest in the model's ability to process images of public figures and images suitable for 3D printing; Some users are concerned that the popularity of these new tools may impact traditional design tools like Photoshop.
Some users also complained on Twitter that GPT-4o altered images without explicit requests, such as over-beautifying facial features (like enlarging eyes or adjusting proportions of facial features), and even changing the overall appearance of users, which were considered "unwelcome edits."At the same time, many users pointed out that even slight modifications to the prompts could lead to noticeable errors, indicating that the model was overly sensitive to changes in prompts.

Despite the much-praised image quality, users also noted that the new generator ran relatively slowly.
In addition, users shared specific experiences of using the GPT-4o image generator, such as trying it on the Sora.com platform with a Plus subscription. They believed the tool performed well in following prompts and provided creative examples like generating scenes from "Dragon Ball Z" (DBZ).

Official information
OpenAI trained GPT-4o based on the joint distribution of images and text on the web, not only enabling the model to learn the relationship between images and text, but also the intrinsic connections between images. After enhanced subsequent training, GPT-4o possesses surprising visual fluency, capable of generating practical, consistent, and context-sensitive images.
Precise text rendering
"A picture is worth a thousand words," but sometimes appropriately placing a few key words in an image can further enhance its communication effectiveness. GPT-4o can accurately integrate text with images, turning image generation into an effective visual communication tool.

Create a photorealistic image of two witches in their 20s (one ash balayage, one with long wavy auburn hair) reading a street sign.
Context:
a city street in a random street in Williamsburg, NY with a pole covered entirely by numerous detailed street signs (e.g., street sweeping hours, parking permits required, vehicle classifications, towing rules), including few ridiculous signs at the middle: (paraphrase it to make these legitimate street signs)"Broom Parking for Witches Not Permitted in Zone C" and "Magic Carpet Loading and Unloading Only (15-Minute Limit)" and "Reindeer Parking by Permit Only (Dec 24–25)\n Violators will be placed on Naughty List." The signpost is on the right of a street. Do not repeat signs. Signs must be realistic.
Characters:
one witch is holding a broom and the other has a rolled-up magic carpet. They are in the foreground, back slightly turned towards the camera and head slightly tilted as they scrutinize the signs.
Composition from background to foreground:
streets + parked cars + buildings -> street sign -> witches. Characters must be closest to the camera taking the shot
Supports multi-round iterative image generation
GPT-4o natively supports continuous optimization of images within conversations. For example, in game character design, even after multiple adjustments, the character's appearance can remain consistent, allowing users to refine and adjust image details through natural dialogue.



Better instruction-following ability
Compared to other systems that can usually handle about 5 to 8 objects, GPT-4o can accurately process 10 to 20 different objects. More precise object and attribute association allows users to have finer control over image output.

A square image containing a 4 row by 4 column grid containing 16 objects on a white background. Go from left to right, top to bottom. Here's the list:
1. a blue star
2. red triangle
3. green square
4. pink circle
5. orange hourglass
6. purple infinity sign
7. black and white polka dot bowtie
8. tiedye "42"
9. an orange cat wearing a black baseball cap
10. a map with a treasure chest
11. a pair of googly eyes
12. a thumbs up emoji
13. a pair of scissors
14. a blue and white giraffe
15. the word "OpenAI" written in cursive
16. a rainbow-colored lightning bolt
Context learning and application
GPT-4o can analyze images uploaded by users, seamlessly integrating details into the model's context, thus achieving more accurate and demand-oriented image generation.


Rich world knowledge and image styles
Thanks to the model's learning of various image styles, GPT-4o can generate or transform images in a variety of styles and accurately achieve photo-realistic results.

make a very colorful risograph on how to make matcha
Current limitations and future directions for improvement
Of course, the model is not yet perfect. After its release, GPT-4o has certain known limitations, and OpenAI stated that it will continue to improve and optimize the model to gradually address these issues.

The model is known to struggle when asked to render detail information at a very small size.