I’d say the main things that made my own p(Doom) went down this year are the following:
I’ve come to believe that data was both a major factor in capabilities and alignment, and I also believe that careful interventions on that data could be really helpful for alignment.
I’ve come to think that instrumental convergence is closer to a scalar quantity than a boolean, and while I don’t think 0 instrumental convergence is incentivized for capabilities and domain reasons, I do think that restraining instrumental convergence/putting useful constraints on instrumental convergence like world models is helpful for capabilities to the extent that I think that power-seeking will likely be a lot more local than what humans do.
I’ve overall shifted towards a worldview where the common thought experiment of the second-species argument, where humans have killed over 90%+ of chimpanzees and gorillas due to them running away with intelligence and being misaligned neglects very crucial differences between the human and the AI case that makes my p(Doom) lower.
(Maybe another way to say it is I think the outcome of humans just completely running roughshod on every other species due to instrumental convergence is not the median outcome of AI development, but a deep outiler that is very uninformative to how AI outcomes will look like.)
I’ve come to believe that human values, or at least the generator of values, are actually simpler than a lot of people think, and that a lot of the complexity that appears to be there is because we generally don’t like admitting that very simple rules can generate very complex outcomes.
I’d say the main things that made my own p(Doom) went down this year are the following:
I’ve come to believe that data was both a major factor in capabilities and alignment, and I also believe that careful interventions on that data could be really helpful for alignment.
I’ve come to think that instrumental convergence is closer to a scalar quantity than a boolean, and while I don’t think 0 instrumental convergence is incentivized for capabilities and domain reasons, I do think that restraining instrumental convergence/putting useful constraints on instrumental convergence like world models is helpful for capabilities to the extent that I think that power-seeking will likely be a lot more local than what humans do.
I’ve overall shifted towards a worldview where the common thought experiment of the second-species argument, where humans have killed over 90%+ of chimpanzees and gorillas due to them running away with intelligence and being misaligned neglects very crucial differences between the human and the AI case that makes my p(Doom) lower.
(Maybe another way to say it is I think the outcome of humans just completely running roughshod on every other species due to instrumental convergence is not the median outcome of AI development, but a deep outiler that is very uninformative to how AI outcomes will look like.)
I’ve come to believe that human values, or at least the generator of values, are actually simpler than a lot of people think, and that a lot of the complexity that appears to be there is because we generally don’t like admitting that very simple rules can generate very complex outcomes.