Another bad idea: why not use every possible alignment strategy at once (or many of them)? Presumably this would completely hobble the AGI, but with some interpretability you could find where the bottlenecks to behaviour are in the system and use it as a lab to figure out best options. Still a try-once strategy I guess, and maybe it precludes actually getting to AGI in the first place, since you can’t really iterate on an AI that doesn’t work.
What I noticed was that these failures he describes implicitly require the math the AI is doing to have infinite precision.
Something like “ok I have met my goal of collecting 10 stamps by buying 20 stamps in 2 separate vaults, time to sleep” fails if the system is able to consider the possibility of a <infinitesimal and distant in time future event where an asteroid destroys the earth>. So if we make the system unable to consider such a future by making the numerical types it uses round to zero it will instead sleep.
Maximizers have a similar failure. Their take over the planet plan often involves a period of time where they are not doing their job of making paperclips or whatever, but they defer future reward while they build weapons to take over the government. And the anticipated reward of their doomsday plan often looks like: Action0: [.99 * 1000 reward: doing my job] Action1 : [0.99 * 0 reward: destroyed],[0.01 x discounted big reward: took over the government]
This is expressible as an MDP above and I have considered writing a toy model so I can find out numerically if this works.
My real world experience has a number of systems using old processor designs where the chip itself doesn’t make a type above 16-24 bit integers usable, so I had some experience with dealing with such issues. Also at my current role we’re using a lot of 8 and 16 bit int/floats to represent neural network weights.
Can you explain low-resolution integers?
Another bad idea: why not use every possible alignment strategy at once (or many of them)? Presumably this would completely hobble the AGI, but with some interpretability you could find where the bottlenecks to behaviour are in the system and use it as a lab to figure out best options. Still a try-once strategy I guess, and maybe it precludes actually getting to AGI in the first place, since you can’t really iterate on an AI that doesn’t work.
Can you explain low-resolution integers?
From Robert Mile’s videos:
What I noticed was that these failures he describes implicitly require the math the AI is doing to have infinite precision.
Something like “ok I have met my goal of collecting 10 stamps by buying 20 stamps in 2 separate vaults, time to sleep” fails if the system is able to consider the possibility of a <infinitesimal and distant in time future event where an asteroid destroys the earth>. So if we make the system unable to consider such a future by making the numerical types it uses round to zero it will instead sleep.
Maximizers have a similar failure. Their take over the planet plan often involves a period of time where they are not doing their job of making paperclips or whatever, but they defer future reward while they build weapons to take over the government. And the anticipated reward of their doomsday plan often looks like: Action0: [.99 * 1000 reward: doing my job] Action1 : [0.99 * 0 reward: destroyed],[0.01 x discounted big reward: took over the government]
This is expressible as an MDP above and I have considered writing a toy model so I can find out numerically if this works.
My real world experience has a number of systems using old processor designs where the chip itself doesn’t make a type above 16-24 bit integers usable, so I had some experience with dealing with such issues. Also at my current role we’re using a lot of 8 and 16 bit int/floats to represent neural network weights.