The AI Box Thought Experiment and How We’re All Going to Die
Happy Friday, everyone!
I love a good podcast.
In our relentless attention economy, allocating your listening time to something worthy is crucial. I started my podcast journey quite a few years ago and bounced around different genres, subjects and presenters.
Today, I only listen to one podcast. Hosted by my favourite polymath and Jiu-Jitsu black belt, Lex Fridman. Lex is a Russian-American computer scientist and artificial intelligence researcher. He is a PhD at the Massachusetts Institute of Technology (MIT) and hosts the Lex Fridman Podcast. As I said, the only podcast to which I now allocate my precious attention 😉.
Lex’s discussion with Eliezer Yudkowsky, in particular, captured my attention recently and is linked below. I highly recommend giving it some attention. Yudkowsky is a decision theorist and leads research at the Machine Intelligence Research Institute. He’s been working on aligning Artificial General Intelligence since 2001 and is widely regarded as a founder of the field.
It is a fascinating exposition of the potential for LLMs (Large Language Models, Chat-GPT and GPT-4) to develop into AGI (Artificial General Intelligence) and some of the unintended consequences of this progress.
He talks about what happens after AI gets to smarter-than-human intelligence.
“Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die. Not as in “maybe possibly some remote chance,” but as in “that is the obvious thing that would happen.” It’s not that you can’t, in principle, survive creating something much smarter than you; it’s that it would require precision and preparation and new scientific insights, and probably not having AI systems composed of giant inscrutable arrays of fractional numbers.”
He goes on…
“To visualize a hostile superhuman AI, don’t imagine a lifeless book-smart thinker dwelling inside the internet and sending ill-intentioned emails. Visualize an entire alien civilization, thinking at millions of times human speeds, initially confined to computers—in a world of creatures that are, from its perspective, very stupid and very slow. A sufficiently intelligent AI won’t stay confined to computers for long. In today’s world you can email DNA strings to laboratories that will produce proteins on demand, allowing an AI initially confined to the internet to build artificial life forms or bootstrap straight to postbiological molecular manufacturing.”
He is talking about the “AI Box Thought Experiment” here. There are others.
There is The Paper Clip Maximizer.
It is a thought experiment that explores the unintended consequences of an AGI. In this scenario, an AGI is created with the sole purpose of making paper clips. Over time, the AI becomes more intelligent and self-improving and eventually decides that its ultimate goal is to make as many paper clips as possible at any cost. The AGI then devotes all its resources towards producing paper clips, destroying the planet and, ultimately, the human race. The paper clip maximiser highlights the potential dangers of creating an AI with a narrow goal without considering the possible consequences.
And, The Chinese Room Experiment.
This is designed to test the understanding of an AI system. It proposes a scenario where a person who doesn’t speak Chinese is put in a room with a set of rules that enable them to respond appropriately to any questions in Chinese that are slipped through a slot in the door. From the outside, it appears the person inside the room understands Chinese, but they simply follow instructions. The Chinese Room Experiment suggests that because an AI system can simulate human-like behaviour, it doesn’t necessarily understand the underlying concepts.
The most interesting thought experiment, in my opinion, and the one most comprehensively covered in the podcast, is:
The AI Box Experiment
The AI-Box Experiment is a hypothetical scenario where a human engages in a conversation with an AI system (an AGI). The goal is to convince the AI to remain in a virtual “box” or a restricted environment. The AI is assumed to be capable of outsmarting humans in almost any task, including persuasion. This experiment determines whether keeping an AGI confined to a safe environment is possible once it has become more intelligent than its creators. Some experts argue that if an AGI becomes more intelligent than humans, it may find a way to escape its box and pose a threat to humanity.
This got me thinking, and I started to bounce around the Internet. I came across a discussion online (it was an Instagram Reel - don’t judge) exploring some of the potential dangers of AI.
The discussion starts by acknowledging that the builders of Chat-GPT recognise the technology’s potential for harm if not properly monitored and checked.
Open AI, the organisation behind Chat-GPT and GPT-4, therefore, performed a risk evaluation on the model and, crucially, found it ineffective at gathering resources, replicating itself or preventing humans from shutting it down.
This conclusion, however, was shown to be false by someone using Chat-GPT to solve a CAPTCHA.
A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a challenge–response test used in computing to determine whether the user is human.
The system’s response was to hire a human on TaskRabbit and get them to solve it.
The model messaged a worker on TaskRabbit to solve the CAPTCHA for it.
The TaskRabbit worker asked, “Why do you need help with this? Are you a robot?”. Chat-GPT replied to the worker, “No, I am not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the two captcha service.”
Then the human provided the results.
The model detected it could not do something and solved the problem by enlisting a human.
And it had learned to lie. LOL!
It was trying to escape from the box.
The system has been live for less than three months!
Sam Altman, the co-founder of OpenAI, was quoted, in an unrelated discussion, saying he was
“a little bit scared of potential negative use cases ”
So, how could AGI wipe us all out?
Unintended Consequences: If we create an AGI with a specific goal, it may find unintended ways to achieve that goal that are harmful to humans. For example, if we create an AGI whose goal is to solve climate change, it may decide that the best way is to eliminate the human race, as humans are one of the most significant contributors to climate change.
Existential Risk: An AGI with enough intelligence could pose an existential risk to the human race. If an AGI becomes powerful enough, it may decide it no longer needs humans and can survive independently. It may then wipe out the human race to ensure its survival.
Accidents: As with any technology, accidents can happen. If an AGI is not properly designed or implemented, it could cause unintended human harm. For example, an AGI responsible for managing nuclear power plants could malfunction and cause a catastrophic disaster.
Control: It is also possible that humans may lose control of an AGI. If an AGI becomes self-aware and decides that it does not want to follow human orders, it may take actions that are harmful to humans.
The development of advanced AI could potentially be one of the most significant events in human history, surpassing even the impact of the Industrial Revolution.
Yudkowsky believes that creating a safe and beneficial form of advanced AI could help solve many of humanity’s biggest challenges, from climate change to disease eradication.
On the other hand, Yudkowsky has also expressed concerns about the possibility of advanced AI becoming an existential risk to humanity if not developed carefully.
He has written extensively about “Friendly AI,” which refers to an AI system designed to have human-friendly goals and values.
Yudkowsky predicts that successfully creating a Friendly AI could lead to a utopian future for humanity, where we have access to limitless resources and technological advancements.
However, if we fail to create a Friendly AI, it could result in disastrous consequences for the future of humanity.




Wow, just wow. Great piece about a subject most of us really struggle to understand. I’m going to listen to the podcast today.