The Future of Large Language Models - Andrej Karpathy's In-Depth Explanation of LLM (Part 11)

Multimodal Fusion

Large language models are rapidly evolving from simply processing text to being able to handle text, audio, and images simultaneously as multimodal models. This capability is based on converting audio and images into tokens, which are trained together with text, enabling more natural and enriched human-computer interactions.

Audio Processing: The model processes audio data by segmenting audio spectrograms into patches and converting them into audio tokens, achieving understanding and generation of audio information.
Image Processing: Images are divided into multiple small blocks and converted into image tokens, allowing the model to understand visual information in a sequential manner.

AI Agents for Long-term Task Execution

Future AI models will possess the ability to autonomously complete complex tasks over long periods without frequent human intervention, with humans acting as supervisors for monitoring and adjustment.

Autonomous Planning and Execution: AI agents can independently plan and correct errors during long-term tasks, reducing the frequency of human intervention.
Human-to-Agent Ratio: Similar to the human-to-machine ratio in factory automation, a supervisory relationship between humans and AI agents will be established in the digital domain.

Supervision and Unobtrusiveness

Although AI agents are powerful, they are not perfect and still require human supervision and guidance to ensure proper task execution and avoid potential errors.

AI technology will gradually integrate into various tools and platforms, becoming more pervasive and invisible, allowing users to enjoy the convenience brought by the technology without noticing it.

Realization of Computing Agents

AI models will have the ability to perform specific computer operations on behalf of users, such as controlling mouse and keyboard actions, significantly improving user efficiency.

Test-time Learning Research

Current AI models are divided into training and inference phases, with fixed parameters during the inference phase that prevent continuous learning from new data. Future research will focus on enabling models to learn during testing, similar to how humans adjust and learn during daily activities and rest (e.g., sleep).

As tasks become increasingly complex and multimodal, the existing context window length will become a bottleneck. While expanding the context window is a temporary solution, it is insufficient in the long term for handling more complex and longer-running tasks, necessitating exploration of new technologies and methods.