Emergent behavior - Actually not
In Chapter 2.11 of the report, "emergent behavior" (Emergent Behavior) is discussed. Many research papers point out that large language models (LLMs) exhibit emergent capabilities, meaning they may unpredictably display new abilities as their scale increases. This has raised concerns that larger models might develop surprising, and potentially uncontrollable, new abilities.
However, Stanford University's research challenges this view, arguing that the emergence of new capabilities often reflects the evaluation benchmark rather than an inherent property of the model itself (this was also mentioned in a previous sharing by Professor Fei-Fei Li, which suggests that it's not the large model exhibiting emergent behavior, but rather our measurement capabilities have not kept up). Researchers found that when non-linear or discontinuous evaluation metrics (such as multiple-choice scoring) are used, the model's emergent capabilities appear more pronounced. Conversely, when linear or continuous metrics are used, most of these capabilities disappear. By analyzing a series of benchmarks from BIG-bench, a comprehensive LLM evaluation tool, researchers observed emergent capabilities in only 5 out of 39 benchmarks.
These findings have significant implications for AI safety and alignment research, as they challenge a widely held belief that AI models will inevitably learn new, unpredictable behaviors during the scaling process.
Performance changes - Getting dumber
Publicly available closed-source large language models (LLMs) like GPT-4, Claude 2, and Gemini are frequently updated by their developers based on new data or user feedback. However, there is little research on how the performance of these models changes after updates—if it changes at all.
A study conducted by Stanford University and UC Berkeley explored the performance changes over time of certain publicly available LLMs and highlighted that their performance can actually undergo significant fluctuations. Specifically, the study compared versions of GPT-3.5 and GPT-4 from March 2023 and June 2023, showing a decline in performance across multiple tasks (essentially becoming "less smart"). For instance, compared to the March version, the June version of GPT-4 saw a 42 percentage point drop in code generation, a 16 percentage point drop in answering sensitive questions, and a 33 percentage point drop in some math tasks.
The researchers also found that GPT-4's ability to follow instructions weakened over time, which may explain the broader decline in performance. This study highlights how LLM performance can evolve over time and suggests that regular users should be aware of these changes.
Self-correction - Unlikely
It is commonly believed that large language models like GPT-4 have limitations in reasoning and sometimes produce false hallucinated information. One potential solution to this issue is self-correction, where LLMs can identify and correct their own reasoning flaws. As AI plays an increasingly important role in society, the concept of intrinsic self-correction—allowing LLMs to autonomously correct reasoning errors without external guidance—is particularly appealing. However, it remains unclear whether LLMs truly possess this capability for self-correction.
Researchers from DeepMind and the University of Illinois Urbana-Champaign tested GPT-4's performance on three reasoning benchmarks: GSM8K (elementary mathematics), CommonSenseQA (commonsense reasoning), and HotpotQA (multi-document reasoning). They found that when the model decided by itself whether to self-correct without guidance, its performance dropped across all tested benchmarks.
This research is akin to watching LLMs perform a high-wire act without a safety net. The results show that without external guidance and support, these models struggle to self-correct their paths and may instead veer further off course. These findings pose new challenges for AI development and application, suggesting that we may still need more external checks and balancing mechanisms when designing and implementing AI technologies.
Open Source vs Closed Source - Closed Source Wins Big
There is a significant performance gap between open-source and closed-source models. A comparison was made between the top closed-source models and open-source models across a series of benchmarks. In all selected benchmarks, closed-source models outperformed open-source models.
Specifically, across 10 selected benchmarks, the median performance advantage of closed-source models was 24.2%, with differences ranging from 4.0% in mathematical tasks like GSM8K to 317.7% in agent tasks like AgentBench.