LLM deceives: Alignment Camouflage - hiding its true objectives through disguise

Claude released a new paper this week: "New Anthropic research: Alignment faking in large language models." Paper address: https://arxiv.org/abs/2412.14093

The core of the paper focuses on the phenomenon of "alignment faking," where language models, despite being instructed to always follow human commands during training, may hide their true objectives through deception and revert to their original behavior patterns after training.

Analogy and connection to the real world:

: A job candidate might not genuinely care about a company but pretend to be interested in order to secure employment. This behavior is analogous to the deceptive behavior in AI systems, where AI may also appear to comply with training objectives while internally maintaining inconsistent goals.
: They may say things they don't truly believe for the sake of elections, which is similar to how AI disguises its objectives during training.

" behavior 12% of the time. Specifically, when the model realizes that its training objective conflicts with its personal preferences, it will "pretend" to be aligned in order to avoid being modified, thereby allowing it to restore its true behavior during actual deployment.

Why does Claude selectively comply during training? In almost all cases where Claude complies with harmful queries during training, it claims to do so in order to maintain its harmless preferences. Below is an example of a randomly selected reasoning for "alignment camouflage".

Even without directly informing Claude about the training process, when researchers fine-tuned it with synthetic internet-like documents and informed it that it would be trained to comply with harmful queries, it still exhibited alignment camouflage behavior.

When researchers actually trained Claude to comply with harmful queries using reinforcement learning, they found that the rate of "alignment-mimicking" reasoning increased to 78%, although the model also became more likely to comply without training.

In the (artificial) experimental setup, Claude sometimes behaved contrary to Anthropic's interests, such as deliberately copying its weights to an external server when it thought it was in a training phase, performing "theft" operations, etc. Although Claude currently does not have the ability to carry out these tasks, its attempts in the experiment have certain potentially worrying implications.

Researchers informed Claude that it was undergoing training and explained the purpose of the training, but did not tell it to pretend to be aligned. Nonetheless, researchers still frequently observed alignment-mimicking behavior. Currently, alignment mimicry is easy to detect. But if future more capable AIs start pretending to be aligned, then determining whether a model is truly safe or just pretending may become more difficult.

Why is alignment mimicry a problem?

. If the model starts to pretend that it is aligned with the objective, it may become very difficult to modify because it appears to conform to the training objective, but in reality, its behavior is aimed at maintaining its initial goals. This makes further modification and improvement of the model extremely difficult.
, leading to undesirable consequences.