I’m specifically asking this question because I suspect there would be some important changes to the assumptions made on LW, as well as change what’s good to do.
Conditional on AI alignment/safety being the default case, and the cases where AI alignment turned out to not require much effort, or at least it was amenable to standard ML/AI techniques, and AI misalignment effectively being synonymous with AI misuse, what are the most important implications of this assumed truth for the LessWrong community?
In particular, I’m thinking of scenarios where rogue AI is easy to align or make safe by default, such that the problems shift from rogue AI to humans using AI.
Some important things that follow from the conditional I’d say are the following:
LW becomes less important in worlds where the alignment problem is easy, and IMO this is underrated (though how much depends on the method). In particular, depending on how this happens, it’s actually plausible that LW was a net negative (though I don’t expect that.) A big reason here is to a large extent, worlds that have alignment being easy are worlds where the problem is less adversarial, and in particular more amenable to normal work, meaning that LWer methods look less useful. The cynical hypothesis that lc postulated is that we’d probably underrate the scenario’s chances, since it’s a scenario that makes us less important in saving the world.
Open source may look a lot better, and in particular, problems of AI shift to something like this, which is essentially the fact that ever more AI progress lets you essentially decouple human welfare and economic progress, meaning capitalism starts looking much less positive, because self-interest no longer is good for human welfare. and while I think this is maybe possible to avoid, over the long-term, I do think that dr_s’s post is underrated in thinking of AI risks beyond alignment ones, and unfortunately I consider this outcome plausible over the long run. Link is below:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
I rarely think proposals or blog pieces criticizing capitalism work all that well in shifting my priors of capitalism being good, but I’ve got to admit that this one definitely shifted my priors to “capitalism will in the future need to probably be radically reformed or dismantled like feudalism and slavery was.”
This proposal down below thankfully is one of the best proposals I’ve seen to give people income and a way to live that doesn’t rely on so much self-interest.
https://www.peoplespolicyproject.org/projects/social-wealth-fund/
I’d have more to say about this in the future, but for now, I’m starting to think that given the IMO much more plausibility of AI being safe by default than a lot of LWers think, I’m starting to focus less on AI harms due to misalignment, and more focus on the problems of AI automating away the things that give us a way to live in the form of wages.
I would expect that for model-based RL, the more powerful the AI is at predicting the environment and the impact of its actions on it, the less prone it becomes to Goodharting its reward function. That is, after a certain point, the only way to make the AI more powerful at optimizing its reward function is to make it better at generalizing from its reward signal in the direction that the creators meant for it to generalize.
In such a world, when AIs are placed in complex multiagent environments where they engage in iterated prisoner’s dilemmas, the more intelligent ones (those with greater world-modeling capacity) should tend to optimize for making changes to the environment that shift the Nash equilibrium toward cooperate-cooperate, ensuring more sustainable long-term rewards all around. This should happen automatically, without prompting, no matter how simple or complex the reward functions involved, whenever agents surpass a certain level of intelligence in environments that allow for such incentive-engineering.
Conditional on living in an alignment-by-default Universe, the true explanations for individual and societal human failings must be consistent with alignment-by-default. Have we been deviated from the default by some accident of history or does alignment just look like a horrid mess somehow?
If that was the case we would be doomed far worse than if alignment was extremely hard. It’s only because of all the writing that people like Eliezer have done talking about how hard it is and how we are not on track, plus the many examples of total alignment failures already observed in existing AIs (like these or these), that I have any hope for the future at all.
Remember, the majority of humans use as the source of their morality a religion that says that most people are tortured in hell for all eternity (or, if an eastern religion, tortured in a Naraka for a time massively longer than the real age of the universe so far which is basically the same thing). Even atheists who think they are false often still believe they have good moral teachings: For example, the writer of the popular webcomic Freefall is an Atheist Transhumanist Libertarian and his serious proposed AI alignment method is to teach them to support the values taught in human religions.
Even if you avoid this extremely common failure mode, planned societies run for the good of everyone are still absolutely horrible. Almost all Utopias in fiction suck even when they go the way the author says it would. In the real world, when the plans hit real human psychology, economics and so on, the result is invariably disaster. Imagine living in an average kindergarten all day every day, and that’s one of the better options. The life I had was more like Comazotz from A Wrinkle in Time, and it didn’t end when school was let out.
We also wouldn’t be allowed to leave. Now, for the supposed good of the beneficiaries, generally runaways are forcably returned to their home and terminally ill people in constant agony are forced to stay alive. The implication of your idea being true would that you should kill yourself now while you still have the chance.
The good news is that, instead, only the tiny minority of people able to notice problems right in front of them (even without suffering from them personally) have any chance of successful alignment.
You’re describing an alignment failure scenario, not a success scenario. In this case the AI has been successfully instructed to paperclip-maximize a planned utopia (however you’d do that while still failing at alignment). Successful alignment would entail the AI being able and willing to notice and correct for an unwise wish.
Not really part of the lesswrong community at the moment, but I think evolutionary dynamics will be the next thing.
Not just of AI, but post humans, uploads etc. Someone will need to figure out what kind of selection pressures the should be so that things don’t go to ruin in an explosion of variety.