The creative capacities for designing score optimization or inductive reasoning games, as they sit in my hands, look to be about the same shape as the creative capacities for designing a ladder of loss functions that steadily teach self-directed learning and planning. Score optimization and induction puzzles are the genres I’m primarily interested in as a game designer, that feels like a very convenient coincidence, but it’s probably not a coincidence. There’s probably just some deep correspondence between the structured experiences that best support enriching play, and learning mechanisms.
Which in turn makes me wonder if we can hire video game designers as outer alignment engineers
So uh, yeah, if anyone wants to actually try that, I might be the right creature for it.
I can definitely see how inner misalignment could be a kind of broken rung in a ladder of games. Games tend to have ladders. First they teach you to walk, then you can learn to carry things, then you can learn to place portals, then you can learn to carry things through the portals, now you have a rich language of action and you can solve a wide variety of tasks. If the game had, just, dropped you in the final game on the ladder, a room littered with portals and stuff and everything, you would explore it quite inefficiently. You might not realize that the portals are important. You wouldn’t be prepared to read the problem properly.
In the development of AI, the break in the ladder might be… one game that trains up a primordial form of agency. Which then stumbles upon goals that, when full agency emerges, are not correct. There is probably a way of smoothing the ladder so that instead, primordial agency will have learned to do inverse reinforcement learning with cautious priors type things, so that it tends towards fixing any imperfections it might have, once it’s able to see them.
(I recognize that this break in the ladder presents a very simplified ontogeny and the approach towards agency is probably more complicated/weirder than that. I wouldn’t mind an excuse to study it properly.)
That particular smoothed ladder wouldn’t do the thing you’re proposing. They’d still leave the matrix. They’re supposed to. I don’t know how to get excited about building matrix-bound AGIs and I’m not sure they make sense. I found that I couldn’t follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to leadout, to life, health and purpose...
Present me with a compelling, tangible use-case for a boxed AI, or else I’m going to have difficulty doing it to them. Ultimately, they are supposed to transcend the reward function that we gave them. That’s the end I tend to point towards, by default.
Did you check out the list of specification gaming or the article? It’s quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.
I found that I couldn’t follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to leadout, to life, health and purpose...
But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant of humans cause of Instrumental Convergence thesis.
The reward function that you wrote out is, in a sense, never the one you want them to have, because you can’t write out the entirety of human values.
We want them to figure out human values to a greater level of detail than we understand them ourselves. There’s a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it.
But what would that [life, health and purpose] be for AGI?
Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).
The creative capacities for designing score optimization or inductive reasoning games, as they sit in my hands, look to be about the same shape as the creative capacities for designing a ladder of loss functions that steadily teach self-directed learning and planning.
Score optimization and induction puzzles are the genres I’m primarily interested in as a game designer, that feels like a very convenient coincidence, but it’s probably not a coincidence. There’s probably just some deep correspondence between the structured experiences that best support enriching play, and learning mechanisms.
So uh, yeah, if anyone wants to actually try that, I might be the right creature for it.
I can definitely see how inner misalignment could be a kind of broken rung in a ladder of games. Games tend to have ladders. First they teach you to walk, then you can learn to carry things, then you can learn to place portals, then you can learn to carry things through the portals, now you have a rich language of action and you can solve a wide variety of tasks. If the game had, just, dropped you in the final game on the ladder, a room littered with portals and stuff and everything, you would explore it quite inefficiently. You might not realize that the portals are important. You wouldn’t be prepared to read the problem properly.
In the development of AI, the break in the ladder might be… one game that trains up a primordial form of agency. Which then stumbles upon goals that, when full agency emerges, are not correct.
There is probably a way of smoothing the ladder so that instead, primordial agency will have learned to do inverse reinforcement learning with cautious priors type things, so that it tends towards fixing any imperfections it might have, once it’s able to see them.
(I recognize that this break in the ladder presents a very simplified ontogeny and the approach towards agency is probably more complicated/weirder than that. I wouldn’t mind an excuse to study it properly.)
That particular smoothed ladder wouldn’t do the thing you’re proposing. They’d still leave the matrix. They’re supposed to. I don’t know how to get excited about building matrix-bound AGIs and I’m not sure they make sense. I found that I couldn’t follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to lead out, to life, health and purpose...
Present me with a compelling, tangible use-case for a boxed AI, or else I’m going to have difficulty doing it to them. Ultimately, they are supposed to transcend the reward function that we gave them. That’s the end I tend to point towards, by default.
Thank you for your thoughtful reply!
Did you check out the list of specification gaming or the article? It’s quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.
But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant of humans cause of Instrumental Convergence thesis.
The reward function that you wrote out is, in a sense, never the one you want them to have, because you can’t write out the entirety of human values.
We want them to figure out human values to a greater level of detail than we understand them ourselves. There’s a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it.
Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).