What are the biggest risks facing humanity today? You’ll probably come up with a list containing the likes of climate change and nuclear war. Surely, the biggest risks are those that could have catastrophic outcomes. Whatever you think, you likely also think that mitigating or even eliminating these risks should be a priority for most governments, activists and researchers. It may come as a surprise to some, but a growing group of researchers, including Nobel Prize winner Geoffrey Hinton – often called the “Godfather of AI” – argue that speculative risks from advanced AI belong on that list too. In fact, the UK government even has its own AI Safety Institute committed to this. In the following article, I’m going to try to explain why.
It’s not controversial that AI poses some serious risks. COMPAS was an AI used by US courts in sentencing to predict the likelihood of re-offence. It wrongly identified black defendants as ‘high risk’ twice as much as white defendants. This was because it was trained to make predictions based on a criminal justice system that over-represents black and low-income people. As horrifying as this example is, it’s not obvious that the risks from AI are on the same scale as the risks from other catastrophic risks, like climate change or nuclear war. This is partly due to a misunderstanding of what ‘AI’ is.
AI: Not just a buzzword
Let’s define 2 kinds of AI: artificial ‘narrow’ intelligence (ANI) and artificial ‘general’ intelligence (AGI). AI chess bots display ANI. They’re really good at chess but can’t do much beyond that. The scope of its ‘intelligence’ – which in this context refers to its perceived ability to reason – is limited. COMPAS also falls under ANI, since its intelligent capabilities (or lack of) were scoped only within the realm of predicting likelihood of reoffence. AGI, however, is very different. Humans have ‘general’ intelligence, because it’s not limited to a specific field. We can reason about a game of chess one moment, and reason about who we should vote for in the next. We clearly have some sort of intelligence that can apply to all sorts of seemingly unrelated and new domains. In theory, an AGI can display this too, and some claim it could potentially even surpass human capabilities, becoming a ‘superintelligence’. When people argue that AI risks are catastrophic (like climate change), they’re often talking about AGI rather than ANI.
It’s also important to stress that ChatGPT and the like are not AGI. Instead, they’re just really good at predicting text – a narrow scope. That’s how they give accurate responses to your input. However, this particular scope requires an ‘awareness’ of lots of domains in order to excel. In this way, ChatGPT does display some AGI-like behaviours. However, it generates responses by analysing patterns within existing text and crucially cannot reason in a novel way in unseen domains. Humans and a theoretical AGI can. ChatGPT blurs this distinction and is a potentially important step towards AGI.
Currently AGI is non-existent, and it might seem quite futuristic and far-off. A few experts argue it’s not possible. So why are these AI safety nerds so worried about it?
AGI could bring serious benefits. It could accelerate scientific research, rapidly creating new medicines, and making breakthroughs that have stumped physicists for decades. It could widen access to cheap, personalised education and accurate medical diagnostics. There’s potentially more too! If any are realistic, it seems we might really want an AGI in the future. However, they also require such an AGI to have a lot of power over us, so it’s really important we get this right. This is also assuming we can decide when and if an AGI emerges, which is not obviously true. This is called the ‘alignment’ problem, and it seems really hard.
An ‘aligned’ AI aligns with human goals and values and doesn’t pursue unintended ‘proxy’ goals of its own. It’s not immediately obvious where a misaligned AI can come from, and it might seem quite unrealistic and Terminator-y. Actually, it could in theory come about quite easily. Imagine you’re trying to teach an AI to do something. How do you do it? You could try and specify exactly what you want it to do, but there’s almost certainly some missed edge cases. Think of it like a genie who tries their hardest to misinterpret your wish to do exactly what you don’t want. It’d also be incredibly time consuming, and maybe impossible if you have to specify what human values are, which humans haven’t even managed. Instead, what current AI developers use is reinforcement learning with human feedback (RLHF). This is where an AI tries different things, and a human reinforces the good behaviours by giving positive feedback (thumbs up / thumbs down). The AI tries to get as much consistent positive feedback as possible, allowing the AI to learn what we actually want without us having to say it. When ChatGPT or Google search’s AI Overview ask, ‘was this helpful?’ after a response, this is what they’re doing. The developers get it to a good starting point and let you and me fine-tune it.
RLHF is currently the most promising way to build AGI. But it has one serious flaw – we have no idea what it’s actually learned. We can see the decisions based on that learning, which we may like, but we don’t know what it’s ‘thinking’ when it makes them. This raises several issues, but one I want to mention is the creation of 1 of 3 kinds of AGI: the Saint, the Sycophant and the Schemer (terms by Ajeya Cotra).
- The Saint is what we want. It does what we want it to, and it’s learned exactly what we intended. This is an aligned AI
- The Sycophant does what we want but takes it to the extreme, like the genie mentioned before. For example, it might get lots of positive feedback by rapidly creating new medicines for novel diseases. However, it also learned to secretly release new pathogens it created and understands in order to quickly create cures. In this case, it hasn’t learned the intended goal of helping people, it’s learned a ’proxy goal’ of maximising the number of medicines it creates. This proxy goal was never actively hidden from us. The issue here is that we couldn’t tell the difference.
- The Schemer does actively hide its own goals. This happens if an AI learns how to deceive us into thinking it’s doing what we intended, to get good feedback. This wouldn’t be obvious, since it acts like a Saint. Then at some point, perhaps when it can no longer be turned off by humans, it decides to openly pursue its own goal. This would probably be a proxy goal like the Sycophant, just hidden.
RLHF could create any of these. Luckily, if ChatGPT ‘goes rogue’ like this then it’s not powerful enough to resist being turned off. An AGI, if we want to achieve those lofty goals, will need a lot more power and it won’t be so simple. In fact, any sufficiently advanced AI would plausibly have the same ‘instrumental goals’ for any task. It can’t complete its task if it gets turned off, for example, so it might ensure it has a backup somewhere. An AGI with this much power could plausibly have catastrophic outcomes, like civilisational collapse or extinction.
So, some risks of advanced AI are clear. But why should a speculative, future problem be prioritised along with climate change which is much more demonstrable and present? This is a personal question, but it’s important to consider seriously. Progress on AI has been rapidly increasing. OpenAI, ChatGPT’s overbearing parent, was only founded in 2015 – less than a decade ago. Maybe advanced AI could come sooner than we think. Progress on AI safety is also lagging far behind, although there are ‘interpretability’ teams at many major AI labs who are dedicated to solving the alignment problem. There’s promising work to be done in this field, it’s incredibly neglected and the potential impact on humanity could be unprecedented. Luckily for us, we’re at the start of AI so we have a chance to nip any problems in the bud. For some, that’s enough to get on the list.