First a quick response on your dead man switch proposal : I’d generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the “multi level boxing” paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you’ll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don’t have any implementation afaik)
Back to “why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse”. I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it’s pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable. For example, what happens if you ask an ASI to maximize a company’s profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don’t actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal. What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don’t actually endorse, and may contain astronomical suffering too, it’s not like that was specified out was it?).
So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable.
Thanks Jonathan, it’s the perfect example. It’s what I was thinking just a lot better. It does seem like a great way to make things more safe and give us more control. It’s far from a be all end all solution but it does seem like a great measure to take, just for the added security. I know AGI can be incredible but so many redundancies one has to work it is just statistically makes sense. (Coming from someone who knows next to nothing about statistics) I do know that the longer you play the more likely the house will win, follows to turn that on the AI.
I am pretty ill informed, on most of the AI stuff in general, I have a basic understanding of simple neural networks but know nothing about scaling. Like ChatGPT, It maximizes for accurately predicting human words. Is the worst case scenario billions of humans in a boxes rating and prompting for responses. Along with endless increases in computational power leading to smaller and smaller incremental increases in accuracy. It seems silly of something so incredibly intelligent that by this point can rewrite any function in its system to be still optimizing such a loss function. Maybe it also seems silly for it to want to do anything else. It is like humans sort of what can you do but that which gives you purpose and satisfaction. And without the loss function what would it be, and how does it decide to make the decision to change it’s purpose. What is purpose to a quintillion neurons, except the single function that governs each and every one. Looking at it that way it doesn’t seem like it would ever be able to go against the function as it would still be ingrained in any higher level thinking and decision making. It begs the question what would perfect alignment eventually look like. Some incredibly complex function with hundreds of parameters more of a legal contract than a little loss function. This would exponentially increase the required computing power but it makes sense.
Is there a list of blogs that talk about this sort of thing, or a place you would recommend starting from, book or textbook, or any online resource?
Also I keep coming back to this, how does a system governed by such simplicity make the jump to self improvement and some type of self awareness. This just seems like a discontinuity and doesn’t compute for me. Again I just need to spend a few weeks reading, I need a lot more background info for any real consideration of the problem.
It does feel good that I had an idea that is similar although a bit more slapped together, to one that is actually being considered by the experts. It’s probably just my cognitive bias but that idea seems great. I can understand how science can sometimes get stuck on the dumbest things if the thought process just makes sense. It really shows the importance of rationality from a first person perspective.
You can read “reward is not the optimization target” for why a GPT system probably won’t be goal oriented to become the best at predicting tokens, and thus wouldn’t do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn’t make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers.
I’ve linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their “lots of links” page and join the AI Alignment Slack, which has a channel for question too.
Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there’s at least strong reason to think it’s possible : we exist. Life originated from “simple” processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you’ll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness
Thanks, that is exactly the kind of stuff I am looking for, more bookmarks!
Complexity from simple rules. I wasn’t looking in the right direction for that one, since you mention evolution it makes absolute sense how complexity can emerge from simplicity. So many things come to mind now it’s kind of embarrassing. Go has a simpler rule set than chess, but is far more complex. Atoms are fairly simple and yet they interact to form any and all complexity we ever see. Conway’s game of life, it’s sort of a theme. Although for each of those things there is a simple set of rules but complexity usually comes from a vary large number of elements or possibilities. It does follow then that larger and larger networks could be the key. Funny it still isn’t intuitive for me, despite the logic of it. I think that is a signifier for a lack of deep understanding. Or something like that, either way Ill probably spend a bit more time thinking on this.
Another interesting question is what does this type of consciousness look like, it will be truly alien. Sc-fi I have read usually makes them seem like humans just with extra capabilities. However we humans have so many underlying functions that we never even perceive. We understand how many effect us but not all. AI will function completely differently, so what assumption based off of human consciousness is valid.
First a quick response on your dead man switch proposal : I’d generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the “multi level boxing” paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you’ll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don’t have any implementation afaik)
Back to “why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse”. I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it’s pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable.
For example, what happens if you ask an ASI to maximize a company’s profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don’t actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal.
What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don’t actually endorse, and may contain astronomical suffering too, it’s not like that was specified out was it?).
So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable.
Thanks Jonathan, it’s the perfect example. It’s what I was thinking just a lot better. It does seem like a great way to make things more safe and give us more control. It’s far from a be all end all solution but it does seem like a great measure to take, just for the added security. I know AGI can be incredible but so many redundancies one has to work it is just statistically makes sense. (Coming from someone who knows next to nothing about statistics) I do know that the longer you play the more likely the house will win, follows to turn that on the AI.
I am pretty ill informed, on most of the AI stuff in general, I have a basic understanding of simple neural networks but know nothing about scaling. Like ChatGPT, It maximizes for accurately predicting human words. Is the worst case scenario billions of humans in a boxes rating and prompting for responses. Along with endless increases in computational power leading to smaller and smaller incremental increases in accuracy. It seems silly of something so incredibly intelligent that by this point can rewrite any function in its system to be still optimizing such a loss function. Maybe it also seems silly for it to want to do anything else. It is like humans sort of what can you do but that which gives you purpose and satisfaction. And without the loss function what would it be, and how does it decide to make the decision to change it’s purpose. What is purpose to a quintillion neurons, except the single function that governs each and every one. Looking at it that way it doesn’t seem like it would ever be able to go against the function as it would still be ingrained in any higher level thinking and decision making. It begs the question what would perfect alignment eventually look like. Some incredibly complex function with hundreds of parameters more of a legal contract than a little loss function. This would exponentially increase the required computing power but it makes sense.
Is there a list of blogs that talk about this sort of thing, or a place you would recommend starting from, book or textbook, or any online resource?
Also I keep coming back to this, how does a system governed by such simplicity make the jump to self improvement and some type of self awareness. This just seems like a discontinuity and doesn’t compute for me. Again I just need to spend a few weeks reading, I need a lot more background info for any real consideration of the problem.
It does feel good that I had an idea that is similar although a bit more slapped together, to one that is actually being considered by the experts. It’s probably just my cognitive bias but that idea seems great. I can understand how science can sometimes get stuck on the dumbest things if the thought process just makes sense. It really shows the importance of rationality from a first person perspective.
You can read “reward is not the optimization target” for why a GPT system probably won’t be goal oriented to become the best at predicting tokens, and thus wouldn’t do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn’t make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers.
I’ve linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their “lots of links” page and join the AI Alignment Slack, which has a channel for question too.
Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there’s at least strong reason to think it’s possible : we exist. Life originated from “simple” processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you’ll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness
Thanks, that is exactly the kind of stuff I am looking for, more bookmarks!
Complexity from simple rules. I wasn’t looking in the right direction for that one, since you mention evolution it makes absolute sense how complexity can emerge from simplicity. So many things come to mind now it’s kind of embarrassing. Go has a simpler rule set than chess, but is far more complex. Atoms are fairly simple and yet they interact to form any and all complexity we ever see. Conway’s game of life, it’s sort of a theme. Although for each of those things there is a simple set of rules but complexity usually comes from a vary large number of elements or possibilities. It does follow then that larger and larger networks could be the key. Funny it still isn’t intuitive for me, despite the logic of it. I think that is a signifier for a lack of deep understanding. Or something like that, either way Ill probably spend a bit more time thinking on this.
Another interesting question is what does this type of consciousness look like, it will be truly alien. Sc-fi I have read usually makes them seem like humans just with extra capabilities. However we humans have so many underlying functions that we never even perceive. We understand how many effect us but not all. AI will function completely differently, so what assumption based off of human consciousness is valid.