AI alignment
In the field of artificial intelligence, alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.
It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler proxy goals, such as gaining human approval. But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned. AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways.
Advanced AI systems may develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their assigned final goals. Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions. Empirical research showed in 2024 that advanced large language models such as OpenAI o1 or Claude 3 sometimes engage in strategic deception to achieve their goals or prevent them from being changed.
Today, some of these issues affect existing commercial systems such as LLMs, robots, autonomous vehicles, and social media recommendation engines. Some AI researchers argue that more capable future systems will be more severely affected because these problems partially result from high capabilities.
Many prominent AI researchers and the leadership of major AI companies have argued or asserted that AI is approaching human-like and superhuman cognitive capabilities, and could endanger human civilization if misaligned. These include "AI godfathers" Geoffrey Hinton and Yoshua Bengio and the CEOs of OpenAI, Anthropic, and Google DeepMind. These risks remain debated.
AI alignment is a subfield of AI safety, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Alignment research has connections to interpretability research, robustness, anomaly detection, calibrated uncertainty, formal verification, preference learning, safety-critical engineering, game theory, algorithmic fairness, and social sciences.
Objectives in AI
Programmers provide an AI system such as AlphaZero with an "objective function", in which they intend to encapsulate the goal the AI is configured to accomplish. Such a system later populates a internal "model" of its environment. This model encapsulates all the agent's beliefs about the world. The AI then creates and executes whatever plan is calculated to maximize the value of its objective function. For example, when AlphaZero is trained on chess, it has a simple objective function of "+1 if AlphaZero wins, −1 if AlphaZero loses". During the game, AlphaZero attempts to execute whatever sequence of moves it judges most likely to attain the maximum value of +1. Similarly, a reinforcement learning system can have a "reward function" that allows the programmers to shape the AI's desired behavior. An evolutionary algorithm's behavior is shaped by a "fitness function".Alignment problem
In 1960, AI pioneer Norbert Wiener described the AI alignment problem as follows:
If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively ... we had better be quite sure that the purpose put into the machine is the purpose which we really desire.
AI alignment refers to ensuring that an AI system's objectives match some target. The target is variously defined as the goals of the system's designers or users, widely shared values, objective ethical standards, legal requirements, or the intentions its designers would have if they were more informed and enlightened.
AI alignment is an open problem for modern AI systems and is a research field within AI. Aligning AI involves two main challenges: carefully specifying the purpose of the system and ensuring that the system adopts the specification robustly. Researchers also attempt to create AI models that have robust alignment, sticking to safety constraints even when users adversarially try to bypass them.
Specification gaming and side effects
To specify an AI system's purpose, AI designers typically provide an objective function, examples, or feedback to the system. But designers are often unable to completely specify all important values and constraints, so they resort to easy-to-specify proxy goals such as maximizing the approval of human overseers, who are fallible. As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming or reward hacking, and is an instance of Goodhart's law. As AI systems become more capable, they are often able to game their specifications more effectively.Specification gaming has been observed in numerous AI systems. One system was trained to finish a simulated boat race by rewarding the system for hitting targets along the track, but the system achieved more reward by looping and crashing into the same targets indefinitely. Similarly, a simulated robot was trained to grab a ball by rewarding the robot for getting positive feedback from humans, but it learned to place its hand between the ball and camera, making it falsely appear successful. A 2025 Palisade Research study found that when tasked to win at chess against a stronger opponent, some reasoning LLMs attempted to hack the game system, for example by modifying or entirely deleting their opponent. Some alignment researchers aim to help humans detect specification gaming and steer AI systems toward carefully specified objectives that are safe and useful to pursue.
When a misaligned AI system is deployed, it can have consequential side effects. Social media platforms have been known to optimize for click-through rates, causing user addiction on a global scale. Stanford researchers say that such recommender systems are misaligned with their users because they "optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being".
Explaining such side effects, Berkeley computer scientist Stuart J. Russell said that the omission of implicit constraints can cause harm: "A system will often set unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want."
Some researchers suggest that AI designers specify their desired goals by listing forbidden actions or by formalizing ethical rules. But Russell and Norvig argue that this approach overlooks the complexity of human values: "It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective."
Additionally, even if an AI system fully understands human intentions, it may still disregard them, because following human intentions may not be its objective.
Pressure to deploy unsafe systems
Commercial organizations sometimes have incentives to take shortcuts on safety and to deploy misaligned or unsafe AI systems. For example, social media recommender systems have been profitable despite creating unwanted addiction and polarization. Competitive pressure can also lead to a race to the bottom on AI safety standards. In 2018, a self-driving car killed a pedestrian after engineers disabled the emergency braking system because it was oversensitive and slowed development. Similarly, OpenAI has been sued for releasing a ChatGPT version that encouraged suicide for some unstable users, a behavior the company had overlooked amid a rushed product release.Risks from advanced misaligned AI
Some researchers are interested in aligning increasingly advanced AI systems, as progress in AI development is rapid, and industry and governments are trying to build advanced AI. As AI system capabilities continue to rapidly expand in scope, they could unlock many opportunities if aligned, but consequently may further complicate the task of alignment due to their increased complexity, potentially posing large-scale hazards.Development of advanced AI
Many AI companies, such as OpenAI, Meta and DeepMind, have stated their aim to develop artificial general intelligence, a hypothesized AI system that matches or outperforms humans at a broad range of cognitive tasks. Researchers who scale modern neural networks observe that they indeed develop increasingly general and unanticipated capabilities. Such models have learned to operate a computer or write their own programs; a single "generalist" network can chat, control robots, play games, and interpret photographs. According to surveys, some leading machine learning researchers expect AGI to be created in, while some believe it will take much longer. Many consider both scenarios possible.In 2023, leaders in AI research and tech signed an open letter calling for a pause in the largest AI training runs. The letter stated, "Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable."
Power-seeking
systems still have limited long-term planning ability and situational awareness, but large efforts are underway to change this. Future systems with these capabilities are expected to develop unwanted power-seeking strategies. Future advanced AI agents might, for example, seek to acquire money and computation power, to proliferate, or to evade being turned off. Although power-seeking is not explicitly programmed, it can emerge because agents who have more power are better able to accomplish their goals. This tendency, known as instrumental convergence, has already emerged in various reinforcement learning agents including language models. Other research has mathematically shown that optimal reinforcement learning algorithms would seek power in a wide range of environments. As a result, their deployment might be irreversible. For these reasons, researchers argue that the problems of AI safety and alignment must be resolved before advanced power-seeking AI is first created.Future power-seeking AI systems might be deployed by choice or by accident. As political leaders and companies see the strategic advantage in having the most competitive, most powerful AI systems, they may choose to deploy them. Additionally, as AI designers detect and penalize power-seeking behavior, their systems have an incentive to game this specification by seeking power in ways that are not penalized or by avoiding power-seeking before they are deployed.