I agree wholeheartedly with the sentiment. I also agree with the underlying assumptions made in the Compendium[1], that it would really require a Manhattan project level of effort to understand:
What intelligence actually is and how it works, as we don’t yet have a robust working theory of intelligence.
What alignment actually looks like, and how we can even begin to formulate a thesis of how to keep a superintelligent system aligned as it evolves and recursively self improves. I liken this a bit to the hard problem of consciousness. It’s the hard problem of alignment, which I parse into two discreet components:
We don’t understand what drives humans completely. We expect a degree of ‘playing nice’ from AI systems, but we don’t have a robust and provable theory of why humans are social, self-sacrifice, sometimes think of the common good, sometimes don’t, are curious, and which factors of our experience (or our biology) are responsible for those drivers. Without that, attempting to simulate them in an AI system seems like a dead end. Surface level alignment is trivial (it’s polite and friendly), real alignment could potentially be intractable, as we don’t have a working theory of what aligns humans to begin with.
We require more than just basic ‘behave like a human’ level of alignment from an AI system. Humans do an incredible amount of harm, both to each other (war, exploitation, famine), and to the natural World (habitat destruction, pollution, etc.) in the pursuit of our goals. We need a model for behavior that transcends that human behavior. Which leads to the question of, how is that goal even to be formulated? How is a set of behaviors, goals and values instilled in an AI system, from a species that does not routinely possess those goals?
AI systems in their current form, in order to increase in capabilities, exacerbate many of the issues that are caused by humans. They are extremely power intensive. Power that at the moment is only realistically servable by fossil fuels. That power has to be ‘stable’ and ‘dispatchable’. This is not the power profile of most renewables, which are inherently cyclical, and irregular in their generation. While we may not be concerned about inference stopping due to a BESS system running out of charge, a superintelligent system relying on that power for its existence may think differently.
AI systems also require more rare earth elements, commodities, etc. that are difficult and environmentally challenging to extract. In water scarse times, they require a lot of water for cooling, which is impacting local communities. In order for a superintelligent system to grow rapidly in the short term, it must continue to consume these resources in huge quantities, putting it in direct, and inherent conflict with any alignment goal around the preservation of the natural world, and the ecosystem that humans are reliant upon. Given the pursuit of resources and living space is the fundamental driver of most human conflict—it would appear that we could be setting ourselves up for a resource conflict with AI. How to resolve this, is something that I don’t think anyone has a clear answer to, and I don’t see being discussed at all often enough. (Perhaps because I work in the data center infrastructure space, I find myself running into this almost daily, so its front of mind for me).
All of which is to say, that I believe these problems are resolvable, but only if, to your point, a significant amount of expenditure, and the greatest minds in this generation are set to the task of resolving them ahead of the deployment of a superintelligent system. We face a Manhattan Project level of risk, but we are not acting as if we are facing that systematically.
OpenAI published a paper on this (Detecting misbehavior in frontier reasoning models | OpenAI) - which I found interesting. Essentially, the RL process works by modifying gradient descent/optimization to adjust weights based on maximizing reward. This leads to an interesting undesirable effect.
When researchers applied a reward penalty to reasoning traces that indicated reward hacking would happen, over time, the model began to reason completely benignly and then reward hack anyway.
OpenAI theorize that this may be due to the model starting to reason in a steganographic manner and essentially developing token codes in its reasoning that trigger reward hacking, while appearing and reading benignly to human eyes.
My initial reaction was somewhat emotive, but upon reflection, it seems fairly inevitable to me due to mathematical optimization that this behavior will arise. If it is computationally more efficient to lie and execute the hack, than reason honestly and be penalized, or play the game honestly, then the dishonest reasoning and execution of the hack will be where the model will inherently be driven to land during RL.
Or put another way, if 1) playing chess is hard, and my reward may be +3 if I win and 0 if I lose, and if 2) trying to cheat, and getting caught, and winning my reward might be −2, but if 3) cheating and winning without getting caught is +8, eventually, one of my reasoning traces will result in condition 3 arising naturally. In this event, back propagation will result in the weights adjusting, causing more condition 3s to arise over time, until that is the rule by which the model reasons.
OpenAI acknowledge this issue, and essentially came to the conclusion that the antedote to it is to use ‘less capable models’ in terms of length of time spent in RL.
This was a great Post by the way, thank you. I enjoyed your perspectives.