"State of AI Report 2024" (3) - AI research in gaming agents, code testing, and enterprise automation fields

Today, I will translate 3 research papers in the fields of game agents, code testing, and enterprise automation.

Game Agent

One of the main bottlenecks in training reinforcement learning (RL) agents is the shortage of training data. Common methods, such as converting existing environments (e.g., Atari) or manually constructing new ones, consume significant human resources and are difficult to scale.

a model that acts as a world model capable of generating virtual worlds controllable by actions. The research team analyzed 30,000 hours of 2D platform game videos, achieving this result by learning to compress visual information and infer the actions driving changes between frames. This has been shared before:

Not only can it imagine entirely new interactive scenarios, but it also demonstrates great flexibility: it can accept various forms of prompts, from text descriptions to hand-drawn sketches, and convert these prompts into interactive game environments.

More notably, the application scope of this method is not limited to games. The research team successfully applied the hyperparameters from the game model to robotics data without fine-tuning, showcasing its broad applicability.

Code Testing

By combining multiple large language models (LLMs), different prompts, and configurations, it leverages the strengths of each model to improve the unit test coverage of Android code in Instagram and Facebook.

It adopts a "guardrails" approach, filtering the generated tests to ensure they can build successfully, pass stably, and increase coverage before recommending them for developer use. This is the first time a solution combining LLMs with verifiable guarantees for code improvements has been deployed at industrial scale, addressing concerns about the hallucination phenomenon and reliability issues of LLMs in software engineering.

Improved about 10% of the test classes in its application, and 73% of its recommended tests were accepted and applied by developers.

Corporate Automation

(developed by Stanford), which leverage foundational models to improve these deficiencies.

achieved an accuracy rate of 99.5% in workflow understanding.

increased the completion rate from 0% to 40%.