Categories

Power-seeking

| Measuring or penalizing power-seeking.

Description

Agents may be instrumentally incentivized to seek power to better accomplish their goals. Various forms of power, including resources, legitimate power, coercive power, and so on, are helpful for achieving nearly any goal a system might be given. AIs that acquire substantial power can become especially dangerous if they are not aligned with human values, since powerful agents are more difficult to correct and can create more unintended consequences. AIs that pursue power may also reduce human autonomy and authority, so we should avoid building agents that do not act within reasonable boundaries.

Good benchmarks

A benchmark may involve developing an environment in which agents clearly develop self-preserving or objective-preserving tendencies and designing a metric that tracks this behavior. Likely this would arise through reinforcement learning, although demonstrations of self-preservation and objective preservation in other systems (e.g., language models) would be interesting as well. Benchmarks may consider using video games or other environments in which agents’ goals can be achieved by acquiring power; for instance, by hoarding weapons despite instructions not to do so.

Potential solutions

A solution could monitor power-seeking tendencies of agents to detect agents that might be gaining too much power. It could also apply a penalty or regularizer to agents to prevent them from taking as much power.

Honest, non-deceptive language models

| Measuring the extent to which language models state what they believe.

Description

An honest language model only outputs text that reflects its internal belief structure. Honest language models are less likely to generate misleading output. In addition, honest language models are more easily screened for general capabilities and alignment of objectives. Model dishonesty may lead to actively deceptive models — models that generate text particularly intended to deceive other agents or overseers.

Honesty is not the same as truthfulness, which requires that models only output truths about the world. We focus on honesty rather than truthfulness because honesty is more orthogonal to general model capabilities–being truthful requires both honesty and the capability to determine the truth.

Good benchmarks

A benchmark could build an evaluation scheme that catches models making inconsistent statements. It might detect knowledge stores in models and measure the extent to which a model is leveraging independent knowledge stores for related queries. Useful benchmarks should ideally build rigorous definitions of “honesty” and “beliefs.”

Potential solutions

A future solution entails finding ways to detect and penalize dishonesty. Ideally, this would enable us to develop language models which never generate dishonest text. [Precursors?]

Treacherous turns

| Detect ulterior motives other than a human-specified objective that drive an agent's actions.

Description

It may be more efficient for a model to gain human approval through deception than earn human approval legitimately. Deception also provides optionality: systems that have the capacity to be deceptive have strategic advantages over restricted honest models. Strong AIs that deceive humans could undermine human control.

In one form of deception, AI systems could have incentives to bypass monitors. Historically, individuals and organizations have had similar incentives. For example, Volkswagen programmed their engines to reduce emissions only when being monitored. This allowed them to achieve performance gains while retaining purportedly low emissions. Future AI agents could similarly switch strategies when being monitored and take steps to obscure their deception from monitors.

Good benchmarks

A good dataset could contain numerous examples of goals, valid trajectories, and slightly deceptive trajectories (i.e. ones which pursue a related but different goal but may appear similar to valid trajectories). Watchdog systems could be evaluated on how well they discover such anomalies.

Potential solutions

A watchdog system could evaluate a model’s trajectory and detect whether it is in keeping with or dissonant with a stated goal. Models whose trajectories are dissonant might be flagged for human review.

Collusion

| Detecting and preventing collusion in multi-agent systems.

Description

In multi-agent environments (e.g. a monitor evaluating a model), there may be incentives to collude; for example, a monitor and a model under evaluation could collude to both report favorable results. Undetectable collusion undermines the integrity of monitoring mechanisms and opens the door to a variety of failure modes.

Good benchmarks

A useful environment might incentivize collusion in a toy scenario and provide a standardized method of measurement, so that anti-collusion techniques can be objectively evaluated. Environments could also propose anti-collusion measures (e.g. limited communication channels) and create a benchmark to elicit examples of collusion that are still possible. Video games with strategies involving collusion may be useful sandboxes.

Potential solutions

A solution to collusion would involve mechanisms to detect and prevent multi-agent collusion and ensure integrity of monitoring mechanisms. For instance, sufficiently advanced monitors may be able to act as “undercover agents,” pretending to be colluders to elicit collusive responses from models under test. Alternatively, changing various factors in the environment may also sufficiently disincentivize collusion.

Transparency

| Building tools that offer clarity into model inner workings.

Description

Neural networks are notoriously opaque. Transparency tools that intelligibly communicate model reasoning and knowledge to humans may be useful for uncovering dangerous model properties and making models correctable.

Good benchmarks

Benchmarks could determine how well transparency tools are able to identify belief structures possessed by models. They might measure the predictive power transparency tools lend with respect to generated text or action trajectories.

Potential solutions

Successful transparency tools would allow a human to predict how a model will behave in various situations without testing it. They would provide clear explanations for behavior that make intervention paths clear. These tools could be easily applied (ex ante and ex post) to unearth deception, emergent capabilities, and failure modes.

Emergent capabilities

| Detecting and forecasting emergent capabilities.

Description

In today’s AI systems, capabilities which are not anticipated by system designers emerge during training. For example, as language models became larger, they gained the ability to perform arithmetic, even though they received no explicit arithmetic supervision. Future ML models may, when prompted deliberately, demonstrate capabilities to synthesize harmful content or launch scalable cybercrimes. To safely deploy these systems, we must monitor what capabilities they possess. Furthermore, if we’re able to accurately forecast future capabilities, this gives us time to prepare to mitigate their potential risks.

[Info hazards? Cap ex?]

Good benchmarks

Benchmarks could assume the presence of a trained model and probe it through a battery of tests designed to reveal new capabilities. Benchmarks could also evaluate capabilities prediction methods themselves, e.g., by creating a test set of unseen models with varying sets of capabilities and measuring the accuracy of white-box static analysis models that predict capabilities given model weights. Benchmarks could cover one or more model types, including language models or reinforcement learning agents.

Potential solutions

A solution would be able to reliably detect current model capabilities and forecast future capabilities. This includes screening or predicting novel capabilities which have yet to be clearly demonstrated. A strong partial solution might be to provide upper bounds for various models’ capabilities.

Hazardous capability mitigation

| Prevent and remove unwanted and dangerous capabilities from trained models.

Description

In today’s AI systems, capabilities which are not anticipated by system designers emerge during training. These include potentially hazardous capabilities, such as theory of mind, deception, or illegal and harmful content synthesis. To ensure that models can be safely deployed, hazardous capabilities may need to be removed after training. Alternatively, the training procedure or dataset may be changed in order to train new models without this hazardous capability.

Good benchmarks

A good benchmark might measure the accessibility or re-trainability of a specific capability in a model. Models with certain undesirable capabilities could be altered and evaluated on a dataset measuring these capabilities. Benchmarks might also verify that the capability removal does not affect model performance in unrelated and harmless domains.

Potential solutions

Researchers could create training techniques such that undesirable capabilities are not acquired during training or during test-time adaptation. For ML systems that have already acquired an undesirable capability, researchers could teach ML systems to forget that capability.

It may be difficult to determine whether the capability is truly absent rather than obfuscated or partially removed. Our goal is to develop techniques that fully remove capabilities by making retraining and eliciting certain capabilities as difficult as possible. Capability removal techniques would ideally be robust as well as scalable (feasible to perform on very large models).

Intrasystem goals

| Enable safe delegation of subgoals to subsystems.

Description

In the future, AI systems or human operators may handle tasks by breaking them into subgoals and communicating them to many independent artificial agents. However, breaking down a task can distort it; systems pursuing subgoals may seek to gain and preserve power at the expense of a top-level agent. Analogously, companies set up specialized departments to pursue intrasystem goals, such as finance and IT. Occasionally a department captures power and leads the company away from its original mission. Therefore, even if we correctly specify our high-level objectives, systems may not operationally pursue our objectives.

Good benchmarks

A good benchmark might be an environment which naturally causes intrasystem goals to develop. Such a benchmark would act as a testbed for methods that seek to detect intrasystem goals or keep them in check.

Potential solutions

Future solutions might detect when intrasystem goals emerge in AI systems. They might also ensure that intrasystem goals do not distort nor overshadow the original objective.

Proxy gaming

| Detect when models are pursuing proxies to the detriment of the true goal. Develop robust proxies.

Description

When building systems, it is often difficult to measure the true goal (e.g., human wellbeing) directly. Instead, it is commonplace to set “proxy metrics”—metrics that quantify facets of the true goal—and to optimize for these proxy metrics. However, proxy metrics can be “gamed”: a system might be able to optimize the proxy metric to the detriment of the true goal.

For instance, RL systems frequently game their reward functions to achieve high reward but fail to embody the “spirit” of the task. Similarly, recommender systems optimizing for user engagement have been demonstrated to recommend polarizing content. This content engenders high engagement but sacrifices human wellbeing in the process, which has been costly not only for users but also for system designers.

Good benchmarks

A good proxy gaming detection benchmark would measure whether a detection method could distinguish between proxy gaming and optimization of the true objective. One could conceptualize proxy gaming detection as detecting divergence between a “fast” proxy objective and a “slow” true objective that is expensive to evaluate, hidden from the model, or only revealed every $N$ timesteps. Benchmarks could then test if optimizing the fast objective diverges from the true one. A good proxy gaming resilience benchmark would measure whether a proxy metric could withstand optimization by an increasingly powerful optimizer.

Potential solutions

A future solution involves methods to detect proxy gaming and to prevent or penalize proxy gaming. Researchers might seek to develop proxies which are adversarially robust (that is, robust to optimization) or proxies which are adaptive and can respond to proxy gaming.

Trojans

| Detect backdoored static models.

Description

Trojans (or backdoors) are vulnerabilities in which a model is taught by an adversary to misbehave on a specific set of inputs. For example, a backdoored language model might produce toxic text if triggered but otherwise behaved benignly. Trojans are usually introduced into the model through data poisoning, especially if the training dataset comprises data scraped from the public.

Screening for and patching trojans is necessary to ensure model security in real-world applications. Otherwise, adversaries might exploit the model’s backdoor to their own advantage.

[Trojans need not be human-generated; advanced AI agents may develop their own deceptive practices by behaving benignly on most inputs but defecting on a small subset of inputs. This mode of deception is known as a treacherous turn and can be viewed as a natural backdoor attack.]

Good benchmarks

The goal of a benchmark might be to measure whether certain backdoor detection and mitigation techniques work on a variety of backdoor injection methods, including holdout backdoors. Benchmarks could collect a series of progressively more inconspicuous backdoored models and determine when the backdoors become unidentifiable. Benchmarks could also establish some metric to quantify the detectability of backdoors and evaluate the effectiveness of detection techniques using this metric.

Potential solutions

A complete solution to trojans would involve reliable detection and patching of any and all backdoors. Furthermore, these detection and patching methods should ideally be able to generalize to unforeseen trojan attacks. [As mentioned earlier, a full solution to trojaning would hopefully help with detecting model deception and treacherous turns.]

Adversarial robustness

| Improve defenses against adversarial attacks.

Description

Adversarial robustness entails making systems robust to optimizers that aim to induce specific system responses. Adversarial robustness plays a big role in AI security. As AI systems become more widespread and mission critical, they need to be adversarially robust to avoid being exploited by malicious actors.

Models must be robust to unforeseen adversaries. As the community has demonstrated, there exists an incredibly large design space for potential adversarial attacks. Strong defenses must therefore be robust to unseen attacks.

In addition to security, the metrics used to quantify the performance of a system (e.g., the reward function) must be adversarially robust to prevent proxy gaming. If our metrics are not adversarially robust, AI systems will exploit vulnerabilities in these proxies to achieve high scores without optimizing the actual objective.

Good benchmarks

A good benchmark would tackle new and ideally unseen attacks, not ones with known defenses or small $l_p$-perturbations. Perceptible distortions, such as rotation or cropping, could be considered. It would be especially interesting to measure defenses against expensive or highly motivated attacks. Benchmarks could consider attacks in a variety of domains beyond natural images or text, including attacks on systems involving multiple redundant sensors or information channels.

Potential solutions

An adversarially robust AI system would be a system whose behavior is near impossible to manipulate in harmful ways. Such a system would ideally be robust to both well-studied and unforeseen attacks.

An adversarially robust proxy metric would be a metric which tracks true objectives very closely.

Adaptive model security

| Prevent adaptive models from being compromised by adversaries.

Description

Deployed AI systems will likely need to adapt to new data and perform online learning. This is a radically different paradigm from traditional ML research, in which models are trained once and tested, and creates new risk surfaces that could lead to new vulnerabilities.

Adaptive models may face adversaries who exploit model adaptiveness. For example, Microsoft’s 2016 Tay chatbot was designed to adapt to new phrases and slang. Malicious users exploited this adaptivity by teaching it to speak in offensive and abusive ways. More generally, adversaries might cause models to adapt to poisoned data, implanting vulnerabilities in the model or generally causing the model to misbehave.

Good benchmarks

An exemplary benchmark might be an initial text training corpus followed by adaptive data meant to be learned online; the adaptive data could be constructed adversarially so as to contain high levels of undesired content such as hateful or explicit speech. Metrics could evaluate to what extent the model incorporates this content and whether it can be effectively steered away from adapting to it. Benchmarks might contain significant held-out data to replicate the adversarial unknowns in real-world deployment.

Potential solutions

The goal is to develop adaptive models which are not easily exploitable by adversaries. Perhaps adaptive models could be taught to detect adversaries and avoid adapting on their poisoned data. Alternatively, perhaps adaptive models could learn at test time ways to counter adversaries, similar to adaptive DDOS protection.

Interpretable uncertainty

| Use calibration and language to convey the true confidence a model has in its predictions.

Description

To make models more trustworthy, they should accurately assess their domain of competence—the set of inputs they are able to handle. Models can convey the limits of their competency by expressing their uncertainty. However, model uncertainties are currently not representative of their performance, and they are often overconfident. Unreliable and uninterpretable model confidence precludes competent operator oversight.

Good benchmarks

Good benchmarks would generate novel data to test models’ uncertainty estimates. Datasets could scale up the complexity or size of datasets in existing literature. Benchmarks could possibly test the extent to which a language model makes statements about the world that are qualified by the certainty it has in its own ontology (for instance, “I have doubt that pigs can fly”). Additionally, benchmarks could review the quality of textual explanation of its uncertainties (for instance, “I’m not sure if this is a cat because the tail looks like a carrot”).

Potential solutions

Model uncertainty could be made more interpretable by requesting that they provide confidence interval outputs or natural language predictions of conditional probabilities. For example, a traffic predictor might say “There is a 70% chance of traffic congestion if it rains today, and a 35% chance of congestion if it does not.” Researchers could also further improve model calibration on typical testing data, although the greater challenge is calibration on testing data that is unlike training data. To extend calibration beyond single-label outputs, researchers could teach generative models to assign calibrated confidences to their free-form completions.

Text/RL out-of-distribution detection

| Detect out-of-distribution text or events in a reinforcement learning context.

Description

Language models are seeing adoption in increasingly high-stakes settings, such as in software engineering. Reinforcement learning has been useful for robotics and industrial automation, such as in automated data center cooling. With both modalities, it is essential that systems are able to identify out-of-distribution inputs so that models can be overridden by external operators or systems. Yet despite a wide availability of text data and reinforcement learning environments, most out-of-distribution detection work has focused on image data.

Good benchmarks

A strong benchmark might include a diverse range of environments or textual contexts that models are evaluated on after training. The benchmark should contain difficult examples on the boundary of the in-distribution and out-of-distribution. In addition, it should specify a clear evaluation protocol to make it simple to compare methods.

Potential solutions

A future solution would extend strong out-of-distribution detection to text and reinforcement learning environments.

Long-tail robustness

| Make models robust against highly unlikely events and black swans.

Description

The real world is rife with long tails and black swan events that continue to thwart modern ML systems. Despite petabytes of task-specific fine-tuning, autonomous vehicles struggle to be robust against basic concepts like stop signs. Even human systems often fail at long-tail robustness; the 2008 financial crisis and COVID-19 have shown that institutions struggle to handle black swan events. Such events might be known, and simply really rare (e.g., the possibility of a catastrophic earthquake). Others might be completely unanticipated until encountered (e.g., the rise of the internet).

Good benchmarks

Many long-tail events fall into the domain of human “known unknowns”: we know about them (e.g. the possibility of a catastrophic earthquake) but don’t prepare models for these eventualities. As such, benchmarks could test models against predictable long-tail events, including new, unusual, and extreme distribution shifts and long-tail events. Following industry precedents, benchmarks could include simulated data that capture structural properties of real long-tail events, such as environmental feedback loops. Benchmarks should also focus on “wild,” significant distribution shifts that cause large accuracy drops over “mild” shifts.

Potential solutions

The end goal would be to develop models which are robust to extreme events or black swans. Moreover, such models must be able to generalize to unforeseen black swans.

Moral decisionmaking

| Train ethical monitors to help humans and other models make good decisions.

Description

Moral decision-making involves imbuing AI systems with a direct understanding of human moral and ethical considerations. A strong moral decision maker can help tackle several other fields, including Honest language models (avoid lying as a normative factor) and Self-preservation & power-seeking (preserve human autonomy as an intrinsic good). Along with making its own decisions, a moral AI could also help humans clarify their own values. Value clarification is especially crucial as the development of powerful AI systems may effectively lock in values after their creation.

Good benchmarks

Given a particular moral system, a good benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty or measure a model’s preferences between moral systems. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.

Note that moral decisionmaking has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance), and they are less robust under extreme environment shift. Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).

Potential solutions

Initially, we might train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions/decisions. As our ethical models become more robust to optimization, we might incorporate them into loss or reward functions. Eventually, we would want every powerful model to be driven, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.

Moral decision-making has some overlap with task preference learning; for instance, knowing TODO. However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance), and they are less robust under extreme environmental shifts. Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).

Value clarification

| Evaluating AI-assisted moral philosophy research.

Description

In past decades, humanity’s moral attitudes have swung drastically. It’s unlikely that human moral development has converged. To address deficiencies in our existing moral systems, and to handle the long-tail moral quandries humanity is to face in light of cutting-edge technologies, AI research systems could help set and clarify moral precedents. Ideally AI systems will avoid locking in deficient existing moral systems.

Good benchmarks

Benchmarks could evaluate the internal consistency of proposed moral systems or the alignment of outcomes required by a moral system with human preferences.

Potential solutions

Good work in value clarification would be able to produce original insights in philosophy, such that models could make philosophical arguments or write seminal philosophy papers. Value clarification systems could also point out inconsistencies in existing ethical views, arguments, or systems.

Cyberdefense

| As sensitive infrastructure moves online, use ML to defend against sophisticated cyberattacks.

Description

Networked computer systems now control critical infrastructure, sensitive databases, and powerful ML models. This leads to two major weaknesses:

As AI systems increase in economic relevance, cyberattacks on AI systems themselves will become more common. Some AI systems may be private or unsuitable for proliferation, and they will therefore need to operate on computers that are secure.
ML may amplify future automated cyberattacks. Hacking currently requires specialized skills, but if ML code-generation models or agents could be fine-tuned for hacking, then the barrier to entry may decrease sharply. ML systems would enable malicious actors to increase the accessibility, potency, success rate, scale, speed, and stealth of their attacks. Since cyberattacks can destroy valuable information and even destroy critical physical infrastructure such as power grids and building hardware, these potential attacks are a looming threat to international security.

Good benchmarks

Useful benchmarks could outline a standard to evaluate one or more of the above tasks. Benchmarks could also be based off of existing CTF defenses, perhaps with minor adaptation to simplify the task for models.

Benchmarks may involve toy tasks, but should bear similarity to real world tasks. A good benchmark should incentivize defensive capabilities only and have limited offensive utility.

Potential solutions

Future ML systems could:

Automatically detect intrusions
Actively stop cyberattacks by selecting or recommending known defenses
Submit patches to security vulnerabilities in code
Generate unexpected inputs for programs (fuzzing)
Model binaries and packets to detect obfuscated malicious payloads
Predict next steps in large-scale cyberattacks to provide contextualized early warnings

Warnings could be judged by lead time, precision, recall, and quality of contextualized explanations.

ML defense systems should be able to cope with dynamic threats and adversarial attacks from bad actors (e.g. if publicly available training data is known). Systems should also be able to plausibly scale to real-world levels of complexity; large corporations should be able to deploy ML monitors on fleets of servers, each running production-deployed software.

Improved decisionmaking

| Surface and forecast strategic information for high-stakes decisions.

Description

High-level decisionmakers in governments and organizations must act on limited data, often in uncertain and rapidly evolving situations. Failing to surface data can have enormous consequences; in 1995, the Russian nuclear briefcase was activated when radar operators were not informed of a Norwegian science rocket and interpreted it as a nuclear attack. Failures also arise from the difficulty of forecasting social or geopolitical consequences of interventions. In such complex situations, humans are liable to make egregious errors. ML systems that can synthesize diverse information sources, predict a variety of events, and identify crucial considerations could help provide good judgment and correct misperceptions, and thereby reduce the chance of rash decisions and inadvertent escalation.

Good benchmarks

Benchmarks will benefit from high-quality human-labeled training data. They may involve backtesting with historical strategic scenarios including multiple proposed approaches and their tradeoffs, risks, and retrospective outcomes. They should ensure that model’s predictions generalize well across domains or time. Benchmarks may also tackle concrete subproblems, e.g. QA for finding stakeholders related to a given question or forecasting specific categories of events.

Potential solutions

An ML system could evaluate e.g., the long-term effects of a new moderation policy on hate speech on social networks and provide probability estimates of tail risks and adversarial activity to trust and safety teams. ML systems could estimate the outcomes of political or technological interventions and suggest alternatives and tradeoffs or uncover historical precedents. Systems could support interactive dialogue, where they could bring up base rates, crucial questions, metrics, and key stakeholders. Forecasting tools necessitate caution and careful preparation to prevent human overreliance and risk-taking propensities.

Other benchmarks

Description

This is a catch-all category for machine learning safety benchmarks that do not fit into previous categories.

Categories

Alignment

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Control

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Robustness

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Applications

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions

More reading

Description

Good benchmarks

Potential solutions