Advertisement

Tencent's AppAgent: A multimodal intelligent agent that automatically interacts with applications

Tencent released a paper last month, titled "AppAgent: Multimodal Agents as Smartphone Users," with the paper link available at https://arxiv.org/abs/2312.13771.

This agent can perform complex tasks and operate smartphone applications. By simplifying the action space to interact with smartphone apps, it mimics human tapping and swiping interactions, bypassing the need for backend system access, thereby expanding its applicability across various applications.

The core of its functionality lies in its innovative learning approach. The agent learns to navigate and use new applications either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent uses to execute complex tasks across different applications, demonstrating proficiency in handling various advanced tasks.

Method

App Agent operates in two phases: the exploration phase and the deployment phase.

  1. App Agent observes user interface interactions across different applications. After sufficient observation, App Agent becomes adept at using these applications. This knowledge is meticulously documented. Once the learning phase is complete, the agent is ready to take action.


  2. App Agent is capable of handling advanced tasks within any supported application. This method enables App Agent to efficiently complete various complex tasks across different applications.


DEMO

Demonstration of App Agent's exploration and deployment on applications.

  • The demonstration video shows the process of using AppAgent during the deployment phase to follow a user on X (Twitter).

  • An interesting experiment showcasing AppAgent's ability to pass CAPTCHA.

  • An example of using a grid overlay to locate UI elements not labeled with numerical tags.