[Crossposted on LessWrong]
Throughout history, technological and scientific advances have had both good and ill effects, but their overall impact has been overwhelmingly positive. Thanks to scientific progress, most people on earth live longer, healthier, and better than they did centuries or even decades ago.
I believe that AI (including AGI and ASI) can do the same and be a positive force for humanity. I also believe that it is possible to solve the “technical alignment” problem and build AIs that follow the words and intent of our instructions and report faithfully on their actions and observations.
I will not defend these two claims here. However, even granting these optimistic premises, AI’s positive impact is not guaranteed. In this essay, I will:
- Define what I mean by “solving” the technical alignment problem via training AIs to be faithful and obedient.
- Discuss some of the potential risks that still remain even if we do solve this problem.
- Talk about some scenarios that could make these risks more likely, and how do we avoid them.
In the next decade, AI progress will be extremely rapid, and such periods of sharp transition can be risky. What we — in industry, academia, and government— do in the coming years will matter a lot to ensure that AI’s benefits far outweigh its costs.
Faithful, obedient, superintelligent.
Currently, it is very challenging to train AIs to achieve tasks that are too hard for humans to supervise. I believe that (a) we will need to solve this challenge to unlock “self-improvement” and superhuman AI, and (b) we will be able to solve it. I do not want to discuss here why I believe these statements. I would just say that if you assume (a) and not (b), then the capabilities of AIs, and hence their risks, will be much more limited.
Ensuring AIs accurately follow our instructions, even in settings too complex for direct human oversight, is one such hard-to-supervise objective. Getting AIs to be maximally honest, even in settings that are too hard for us to verify, is another such objective. I am optimistic about the prospects of getting AIs to maximize these objectives. I believe it will be possible train AIs with an arbitrary level of intelligence that:
-
Follow the principles, regulations, and instructions that they are given, doing their best to follow not just the words but our intent, in all situations. Their sole reward or goal would be maximum obedience to our instructions.
- We could probe these AIs (whether by simply asking them, looking at their chains of thought, or using interpretability methods) and get precise and faithful representations of their reasoning, actions, observations, and choices at every level of granularity we want.
This is what I mean by “solving” the technical alignment problem. Note that this does not require AIs to have an inherent love for humanity, or be fundamentally incapable of carrying out harmful instructions if those are given to them or they are fine-tuned to do so. However, it does require these AIs to reasonably generalize our intent into new situations– what I called before “robust reasonable compliance.”
Intelligence and obedience are independent qualities. It is possible to have a maximally obedient AI capable of planning and carrying out arbitrarily complex tasks—like proving the Riemann hypothesis, figuring out fusion power, or curing cancer—that have so far proven beyond humanity’s collective intelligence. Thus, faithful and obedient AIs can function as incredibly useful “superintelligent tools”. For both commercial and cultural reasons, I believe such AIs would be the dominant consumers of inference-time compute throughout the first years/decades of AGI/ASI, and as such account for the vast majority of “total worldwide intelligence.” (I expect this to hold even if people experiment with endowing AIs with a “sense of self”, “free will” and/or functioning as independent moral agents.)
The precise “form factor” of faithful obedient AIs is still unknown. It will depend on their capability and alignment profile, as well on their most lucrative or beneficial applications, and all of these could change with time. For example, it may turn out that for a long time AIs will be able to do 95% of the tasks of 95% of human workers, but there will still be a “long tail” of tasks that humans do better, and hence benefit from AI/human collaboration. Also, even if the total amount of intelligence is equivalent to “a country of geniuses in a data center,” this does not mean that this is how we will use AI. Perhaps it will turn out more economically efficient to simulate 1,000 geniuses and ten countries of knowledge workers. Or we would integrate AIs in our economy in other ways that don’t correspond to drop-in replacement of humans.
What does it mean to “solve” alignment? Like everything with AI, we should not expect 100% guarantees, but rather increasingly better approximations of obedience and faithfulness. The pace at which these approximations improve relative to the growth in capabilities and high-stakes deployment could significantly affect AI outcomes.
A good metaphor is cryptography and information security. Cryptography is a “solved” problem in the sense that we have primitives such as encryption and digital signatures for which we can arbitrarily decrease the probability of attacker success as an inverse exponential function of the size of our key. Of course, even with these primitives, there are still many challenges in cybersecurity. Part of this is because even specifying what it means for a system to be secure is difficult, and we also need to deal with implementation bugs. I suspect we will have similar challenges in AI.
That said, cybersecurity is an encouraging example since we have been able to iteratively improve the security of real world systems. For example, every generation of iPhone has become more secure than previous ones, to the extent that even nation states find it difficult to hack. Note that we were extremely lucky in cryptography that the attacker’s resources scale exponentially with the defender’s (i.e. key size and computation of encryption/signing. etc.). Security would have required much more overhead if the dependence was, for example, quadratic. It is still unknown which dependencies of alignment reliability to resources such as train and test time compute can be achieved.
Even if we “solve” alignment and train faithful and obedient AIs, that does not guarantee that all AIs in existence are faithful and obedient, or that they obey good instructions. However, I believe that we can handle a world with such “bad AIs” as long as (1) the vast majority of inference compute (or in other words, the vast majority of “intelligence”) is deployed via AIs that are faithful and obedient to responsible actors, and (2) we do the work (see below) to tilt the offense/defense balance to the side of the defender.
Are we there yet?
You might imagine that by positing that technical alignment is solvable, I “assumed away” all the potential risks with AI. However, I do not believe this to be the case. The “transition period” as AI capability and deployment rapidly ramps up will be inherently risky. Some potential dangers include:
Risk of reaching the alignment “uncanny valley” and unexpected interactions. Even if technical alignment is solvable, it does not mean that we will solve it in time and in particular get sufficiently good guarantees on obedience and faithfulness. In fact, I am more worried about partial success than total failure in aligning AIs. In particular, I am concerned that we will end up in the “uncanny valley,” where we succeed in aligning AIs to a sufficient level for deployment, but then discover too late some “edge cases” in the real world that have a large negative impact. Note that the world in which future AIs will be deployed will be extraordinarily complex, with a combination of new applications, and a plethora of actors that would also include other AIs (that may not be as faithful or obedient, or obey bad actors). The more we encounter such crises, whether through malfunction, misalignment, misuse, or their combination, the less stable AI’s trajectory will be, and the less likely it is to end up well.
Risk of a literal AI arms race. AI will surely drive a period of intense competition. This competition can be economic, scientific, or military in nature. The economic competition between the U.S. and China has been intense, but has also had beneficial outcomes, including a vast reduction in Chinese poverty. I hope companies and countries will compete for AI leadership in commerce and science, rather than a race to develop more deadly weapons with an incentive to deploy them before the other side does.
Risk of a surveillance state. AI could radically alter the balance of power between citizens and governments, although it is challenging to predict in which direction. One possibility is that instead of empowering individuals, AI will enable authoritarian governments to better control and surveil their citizens.
Societal upheaval. Even if in the long run AI will “lift all boats”, if in the short term there are more losers than winners, this could lead to significant societal instability. This will largely depend on the early economic and social impact of AI. Will AI be first used to broaden access to quality healthcare and education? Or will it be largely used to automate away jobs in the hope that the benefits will eventually “trickle down”?
Risky scenarios
There are several scenarios which can make the risks above more likely.
Pure internal deployment. One potential scenario is that labs decide that the best way to preserve a competitive edge is not to release their models and use them purely for internal AI R&D. Releasing models creates value for consumers and developers. But beyond that, not releasing models yields a significant risk in creating a feedback cycle where a model is used to train new versions of itself, without the testing and exposure that comes with an external release. Like the (mythical) New York subway albino crocodiles or the (real) “Snake Island” golden lanceheads, models that develop without contact with the outside world could be vulnerable and dangerous in ways that are hard to predict. No amount of internal testing, whether by the model maker or third parties, could capture the complexities that one discovers in real-world usage (with OpenAI’s sycophancy incident being a case in point). Also, no matter how good the AI is, there are always risks and uncertainties when using it to break new ground, even internally. If you are using AI to direct a novel training run, one that is bigger than any prior ones, then by definition, it would be “out of distribution” as no such run exists in the training set.
Single monopoly: Lack of competition can enable the “pure internal deployment” scenario. If a single company is by far and away the market leader, it will be tempting for it to keep its best models to itself and use them to train future models. But if two companies have models of similar capabilities, the one that releases its model will get more market share, mindshare, and resources. So the incentive is to share your models with the world, which I believe is a good thing. This does not mean labs would not use their models for internal AI R&D. But hopefully, they would also release them and not keep their best models secret for extended periods. Of course, releases must still be handled responsibly, with thorough safety testing and transparent disclosure of risks and limitations. Beyond this, while a single commercial monopoly is arguably not as risky as an authoritarian government, the concentration of power with any actor is bad in itself.
Zero-sum instead of positive-sum. Like natural intelligence, artificial intelligence has an unbounded number of potential applications. Whether it is addressing long-standing scientific and technological challenges, discovering new medicines, or extending the reach of education, there are many ways in which AI can improve people’s lives. The best way to spread these benefits quickly is through commercial innovation, powered by the free market. However, it is also possible that as AI’s power becomes more apparent, governments will want to focus its development toward military and surveillance applications. While I am not so naive as to imagine that advances in AI would not be used in the military domain, I hope that the majority of progress will continue in applications that “lift all boats.” I’d much rather have a situation where the most advanced AI is used by startups who are trying to cure cancer, revolutionize education, or even simply make money, than for killer drones or mass surveillance.
Overly obedient AI. Part of the reason I fear AI usage for government control of citizens is precisely because I believe we would be able to make AIs simultaneously super intelligent and obedient. Governments are not likely to deploy AIs that, like Ed Snowden or Mark Felt (aka “Deep Throat”), would leak to the media if the NSA is spying on our own citizens or the president is spying on the opposing party. Similarly, I don’t see governments willingly deploying an AI that, like Stanislav Petrov, would disobey its instructions and refuse to launch a nuclear weapon. Yet, historically, such “disobedient employees” have been essential to protecting humanity. Preserving our freedoms in a world where the government can have an endless supply of maximally obedient and highly intelligent agents is nontrivial, and we would stand a better chance if we first evolved both AI and our understanding of it in the commercial sector.
There are mitigations for this scenario. First, we should maintain “humans in the loop” in high-stakes settings and make sure that these humans do not become mere “rubber stamps” to AI decisions. Also, if done right, AI can also increase transparency in government, as long as we maintain the invariant that AI communication is faithful and legible. For example, we could demand that all AIs used in government follow a model spec that adheres first to the constitution and other laws. In particular, unlike some humans, a law-abiding AI will not try to circumvent transparency regulations, and all its prompts and communication will be accessible to FOIA requests.
Offensive vs. defensive applications. As the name indicates, AGI is a very general technology. It can be used for both developing vaccines and engineering new viruses, for both software verification as well as discovering new vulnerabilities. Moreover, there is significant overlap between such “offensive” and “defensive” uses of AI, and they cannot always be cleanly separated. In the long run, I believe both offensive and defensive applications of AI will be explored. But the order in which these applications are developed can make a huge difference as to whether AI is harmful or beneficial. If we use AI to strengthen our security or vaccine-development infrastructure, then we will be in a much better position if AI is later used to enable attacks.
It’s not always easy to distinguish “offensive” vs. “defensive” applications of AI. For example, even offensive autonomous weapons can be used in a defensive war. But generally, even if we trust ourselves to only use an offensive AI technology X “for a good cause,” we still must contend with the fact that:
- Developing X makes it more likely that our adversaries will do the same. Either because the details of X leak or even simply because the fact that it exists already gives our adversary motivation and proof of concept that it is possible to build it. This was the dynamic behind the development of nuclear weapons.
- The more we remove humans from the chain of command of lethal weapons, the more we open a new and yet poorly understood surface for a catastrophe: a bug in a line of code, a hack, or a misaligned AI could end up unleashing these weapons in ways we did not envision.
Thus, when exploring potential applications of AI, we should ask questions such as: (1) if everyone (including our adversaries) had access to this technology, would it have a stabilizing or destabilizing effect? (2) What could happen if malicious parties (or misaligned AIs) got access to this technology? If companies and governments choose to invest resources in AI applications that have stabilizing or defensive impacts and strengthen institutions and societies, and decline or postpone pursuing destabilizing or offensive impacts, then we will be more likely to navigate the upcoming AI transition safely.
Safety underinvestment. There is a tension between safety and the other objectives of intense competition and iterative deployment. If we deploy AI widely and quickly, it will also be easier for bad actors to get it. Since iterative deployment is its own good, we should compensate for this by overinvesting in safety. While there are market incentives for AI safety, the competitive pressure to be first to market, and the prevalence of “tail risks” that could take time to materialize and require significant investment to even quantify, let alone mitigate, means that we are unlikely to get a sufficient investment in safety through market pressure alone. As mentioned above, we may find ourselves in the “alignment uncanny valley,” where our models appear “safe enough to deploy” and even profitable in the short term, but are vulnerable or misaligned in ways that we will only discover too late. In AI, “unknown unknowns” are par for the course, and it is society at large that will bear the cost if things go very wrong. Labs should invest in AI safety, particularly in solving the “obedience and faithfulness” task to multiple 9’s of reliability, before deploying AIs in applications that require this.
AI has the potential to unleash, within decades, advances in human flourishing that would eclipse those of the last three centuries. But any period of quick change carries risks, and our actions in the next few years will have outsize impacts. We have multiple technical, social, and policy problems to tackle for it to go well. We’d better get going.
Notes and acknowledgements. Thanks to Josh Achiam, Sam Altman, Naomi Bashkansky, Ronnie Chatterji, Kai Chen, Jason Kwon, Jenny Nitshinskaya, Gabe Wu, and Wojciech Zaremba for comments on this post. However, all responsibility for the content is mine. The opinions in this post are my own and do not necessarily reflect my employer’s or my colleagues. The title of the post is inspired by Dario Amodei’s highly recommended essay “Machines of Loving Grace.”
By Boaz Barak