Hi! I’ve been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust. Before I set off all the ‘quack’ filters, I did manage to persuade Richard Ngo that an AGI wouldn’t want to kill humans right away.
I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment.
I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.
Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size. If you can’t reliably predict something like, ‘the position of the moon 3,000 years from tomorrow’ due to the numerical error getting worse over time, i don’t see how it’s possible to compute far more complicated queries about possible futures involving billions of agents.
Hence I suspect that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop.
There is also a longer argument which says, ‘instrumental rationality, once you expand the scope, turns into something like religion’
Fine, replace the agents with rocks. The problem still holds.
There’s no closed form solution for the 3-body problem; you can only numerically approximate the future, with decreasing accuracy as time goes on. There are far more than 3 bodies in the universe relevant to the long term survival of an AGI that could die in any number of ways because it’s made of many complex pieces that can all break or fail.
The reason we’re so concerned with instrumental convergence is that we’re usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it’s a lot smarter than us, any benefit we could give it is small compared to the risk that we’ll try to kill it or create more AGIs that will.
The future is hard to predict, that’s why it’s safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn’t that capable, we’re not that concerned. But AGI will have many ways to relatively rapidly improve itself and steadily become more capable.
The usual rebuttal at this point is “just unplug it”. We’d expect an even decently smart machine to pretend to be friendly and aligned until it has some scheme that prevents us from unplugging it.
Your argument for instrumental rationality converging to being nice only applies when you’re on a roughly even playing field, and you can’t just win the game solo if you decide to.
Hi! I’ve been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust. Before I set off all the ‘quack’ filters, I did manage to persuade Richard Ngo that an AGI wouldn’t want to kill humans right away.
I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment.
I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.
Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size. If you can’t reliably predict something like, ‘the position of the moon 3,000 years from tomorrow’ due to the numerical error getting worse over time, i don’t see how it’s possible to compute far more complicated queries about possible futures involving billions of agents.
Hence I suspect that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop.
There is also a longer argument which says, ‘instrumental rationality, once you expand the scope, turns into something like religion’
If your future doesn’t have billions of agents, you don’t need to predict them.
Fine, replace the agents with rocks. The problem still holds.
There’s no closed form solution for the 3-body problem; you can only numerically approximate the future, with decreasing accuracy as time goes on. There are far more than 3 bodies in the universe relevant to the long term survival of an AGI that could die in any number of ways because it’s made of many complex pieces that can all break or fail.
The reason we’re so concerned with instrumental convergence is that we’re usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it’s a lot smarter than us, any benefit we could give it is small compared to the risk that we’ll try to kill it or create more AGIs that will.
The future is hard to predict, that’s why it’s safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn’t that capable, we’re not that concerned. But AGI will have many ways to relatively rapidly improve itself and steadily become more capable.
The usual rebuttal at this point is “just unplug it”. We’d expect an even decently smart machine to pretend to be friendly and aligned until it has some scheme that prevents us from unplugging it.
Your argument for instrumental rationality converging to being nice only applies when you’re on a roughly even playing field, and you can’t just win the game solo if you decide to.