I’m a long time on-off lurker here. I’ve made my way through the sequences quite a while ago with a mixed success in implementing some of them. Many of the ideas are intriguing and I would love to have enough spare cycles to play with them. Unfortunately, often enough, I find myself to not have enough capacity to do this properly due to life getting in way. With (not only that) in mind, I’m going to take a sabbatical this summer for at least three months and try to do an update and generally tend to stuff I’ve been putting off.
As the sabbatical approaches, I’ve been looking around and got hit by some information about the AGI alignment issue in a wake-up call of sorts. For now I’m going through the materials, however it is not a field I’m all that familiar with. I’m a programmer by trade so I can parse through most of the stuff, but some of the ideas are somewhat difficult to properly understand. I think I will get to dig deeper in a next pass. For now I’m trying to get the overall feel for the area.
This brings me to a question that has popped to my mind and I’ve yet to stumble upon something at least resembling an answer—possibly because I don’t know where to look yet. If someone can point me in a right direction, it would be appreciated.
Looking for a clarification
Context:
As I understand it, the core of the Alignment is the issue of “can we trust the machine to do what we want it to as opposed to something else”? The whole stuff about hidden complexity of wishes, orthogonality thesis etc. Basically not handing control over to a potentially dangerous agent.
The machine we’re currently most worried about are the LLMs or their successors, potentially turning into AGI/superintelligence.
We would like to have a method to ensure that these are aligned. Many of these methods talk about having a machine validate another ones alignment as we will run out of “human based” capacity due to the intelligence disparity.
Since my background is in programming, I tend to see everything through these lens. So for me a LLM is “just” a large collection of weights that we feed some initial input and watch what comes at the other side [1] and a machine that does all these updates. If we don’t mind the process being slow, this could be achieved by a single “crawler” machine that would go through the matrix field by field and do the updates. Since the machine is finite (albeit huge), this would work.
Let’s now do a rephrasing of the alignment problem. We have a goal A, that we want to achieve and some behavior B, that we want to avoid. So we do the whole training stuff that I don’t know much about[2] resulting in the whole “file with weights” thing. During this process we steer the machine towards producing A while avoiding B as much as we can observe.
Now we take the file of weights, and now we create the small updating program(accepting the slowness for the sake of clarity). Pseudocode:
Starting from the input layer go neuron by neuron to update the network
If output notes “A”, stop
Else feed the output of the network + subsequent input back into the input layer and go to 1.
Of course, we want to avoid B. The only point in time when we can be sure that B is not on the table is when the machine is not moving anymore. E.g. when the machine halts.
So the clarification I’m seeking is: how is alignment different from the halting problem we already know about?
E.g. when we know that we can’t predict whether the machine will halt with a similar power machine, why do we think alignment should follow different set of rules?
Afterword:
I’m aware this might be obvious question for someone already in the field, however considering this sounds almost silly I was somewhat dismayed I didn’t find it somewhere spelled out. Maybe the answer is a result of something I don’t see, maybe there is just a hole in my reasoning.
It bothered me enough for me to write this post, at the same time I’m not sure enough about my reasoning that I’m not doing this as a full article but rather in the introduction section. Any help is appreciated.
Of course it is many orders of magnitude more complex under the hood. But stripped to the basics, this is it. There are no “weird magic-like” parts doing something weird.
I’ve fiddled with some neural networks. Did some small training runs. Even tried implementing basic logic from scratch myself—though that was quite some time ago. So I have some idea about what is going on. However I’m not up-to-date on the state of the art approaches and I’m not an expert by any stretch of imagination.
We can already do RLHF, the alignment technique that made ChatGPT and derivatives well-behaved enough to be useful, but we don’t expect this to scale to superintelligence. It adjusts the weights based on human feedback, but this can’t work once the humans are unable to judge actions (or plans) that are too complex.
If we don’t mind the process being slow, this could be achieved by a single “crawler” machine that would go through the matrix field by field and do the updates. Since the machine is finite (albeit huge), this would work.
Not following. We can already update the weights. That’s training, tuning, RLHF, etc. How does that help?
We have a goal A, that we want to achieve and some behavior B, that we want to avoid.
No. We’re talking about aligning general intelligence. We need to avoid all the dangerous behaviors, not just a single example we can think of, or even numerous examples. We need the AI to output things we haven’t thought of, or why is it useful at all? If there’s a finite and reasonably small number of inputs/outputs we want, there’s a simpler solution: that’s not an AGI—it’s a lookup table.
You can think of the LLM weights as a lossy compression of the corpus it was trained on. If you can predict text better than chance, you don’t need as much capacity to store it, so an LLM could be a component in a lossless text compressor as well. But these predictors generated by the training process generalize beyond their corpus to things that haven’t been written yet. It has an internal model of possible worlds that could have generated the corpus. That’s intelligence.
If I understand correctly, you’re basically saying:
We can’t know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can’t know in advance whether a program will run forever.
If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn’t.
Since we can’t prove that the machine won’t do something catastrophic, the alignment problem is impossible.
The halting problem doesn’t say that we can’t know whether any program will halt, just that we can’t determine the halting status of every single program. It’s easy to “prove” that a program that runs an LLM will halt. Just program it to “run the LLM until it decides to stop; but if it doesn’t stop itself after 1 million tokens, cut it off.” This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn’t necessarily about proving that a AI will never do something catastrophic. It’s enough to have good informal arguments that it won’t do something bad with (say) 99.99% probability over the length of its deployment.
Introduction
Hello everyone,
I’m a long time on-off lurker here. I’ve made my way through the sequences quite a while ago with a mixed success in implementing some of them. Many of the ideas are intriguing and I would love to have enough spare cycles to play with them. Unfortunately, often enough, I find myself to not have enough capacity to do this properly due to life getting in way. With (not only that) in mind, I’m going to take a sabbatical this summer for at least three months and try to do an update and generally tend to stuff I’ve been putting off.
As the sabbatical approaches, I’ve been looking around and got hit by some information about the AGI alignment issue in a wake-up call of sorts. For now I’m going through the materials, however it is not a field I’m all that familiar with. I’m a programmer by trade so I can parse through most of the stuff, but some of the ideas are somewhat difficult to properly understand. I think I will get to dig deeper in a next pass. For now I’m trying to get the overall feel for the area.
This brings me to a question that has popped to my mind and I’ve yet to stumble upon something at least resembling an answer—possibly because I don’t know where to look yet. If someone can point me in a right direction, it would be appreciated.
Looking for a clarification
Context:
As I understand it, the core of the Alignment is the issue of “can we trust the machine to do what we want it to as opposed to something else”? The whole stuff about hidden complexity of wishes, orthogonality thesis etc. Basically not handing control over to a potentially dangerous agent.
The machine we’re currently most worried about are the LLMs or their successors, potentially turning into AGI/superintelligence.
We would like to have a method to ensure that these are aligned. Many of these methods talk about having a machine validate another ones alignment as we will run out of “human based” capacity due to the intelligence disparity.
Since my background is in programming, I tend to see everything through these lens. So for me a LLM is “just” a large collection of weights that we feed some initial input and watch what comes at the other side [1] and a machine that does all these updates.
If we don’t mind the process being slow, this could be achieved by a single “crawler” machine that would go through the matrix field by field and do the updates. Since the machine is finite (albeit huge), this would work.
Let’s now do a rephrasing of the alignment problem. We have a goal A, that we want to achieve and some behavior B, that we want to avoid. So we do the whole training stuff that I don’t know much about[2] resulting in the whole “file with weights” thing. During this process we steer the machine towards producing A while avoiding B as much as we can observe.
Now we take the file of weights, and now we create the small updating program(accepting the slowness for the sake of clarity). Pseudocode:
Grab the first token of the input[3]
Starting from the input layer go neuron by neuron to update the network
If output notes “A”, stop
Else feed the output of the network + subsequent input back into the input layer and go to 1.
Of course, we want to avoid B. The only point in time when we can be sure that B is not on the table is when the machine is not moving anymore. E.g. when the machine halts.
So the clarification I’m seeking is: how is alignment different from the halting problem we already know about?
E.g. when we know that we can’t predict whether the machine will halt with a similar power machine, why do we think alignment should follow different set of rules?
Afterword:
I’m aware this might be obvious question for someone already in the field, however considering this sounds almost silly I was somewhat dismayed I didn’t find it somewhere spelled out. Maybe the answer is a result of something I don’t see, maybe there is just a hole in my reasoning.
It bothered me enough for me to write this post, at the same time I’m not sure enough about my reasoning that I’m not doing this as a full article but rather in the introduction section. Any help is appreciated.
Of course it is many orders of magnitude more complex under the hood. But stripped to the basics, this is it. There are no “weird magic-like” parts doing something weird.
I’ve fiddled with some neural networks. Did some small training runs. Even tried implementing basic logic from scratch myself—though that was quite some time ago. So I have some idea about what is going on. However I’m not up-to-date on the state of the art approaches and I’m not an expert by any stretch of imagination.
All input that we want to provide to the machine. Could be first frame of a video/text prompt/reading from sensors, whatever else.
Rob Miles’ YouTube channel has some good explanations about why alignment is hard.
We can already do RLHF, the alignment technique that made ChatGPT and derivatives well-behaved enough to be useful, but we don’t expect this to scale to superintelligence. It adjusts the weights based on human feedback, but this can’t work once the humans are unable to judge actions (or plans) that are too complex.
Not following. We can already update the weights. That’s training, tuning, RLHF, etc. How does that help?
No. We’re talking about aligning general intelligence. We need to avoid all the dangerous behaviors, not just a single example we can think of, or even numerous examples. We need the AI to output things we haven’t thought of, or why is it useful at all? If there’s a finite and reasonably small number of inputs/outputs we want, there’s a simpler solution: that’s not an AGI—it’s a lookup table.
You can think of the LLM weights as a lossy compression of the corpus it was trained on. If you can predict text better than chance, you don’t need as much capacity to store it, so an LLM could be a component in a lossless text compressor as well. But these predictors generated by the training process generalize beyond their corpus to things that haven’t been written yet. It has an internal model of possible worlds that could have generated the corpus. That’s intelligence.
A problem is that
we don’t know specific goal representation (actual string in place of “A”),
we don’t know how to evaluate LLM output (in particular, how to check whether the plan suggested works for a goal),
we have a large (presumably infinite non-enumerable) set of behavior B we want to avoid,
we have explicit representation for some items in B, mentally understand a bit more, and don’t understand/know about other unwanted things.
If I understand correctly, you’re basically saying:
We can’t know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can’t know in advance whether a program will run forever.
If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn’t.
Since we can’t prove that the machine won’t do something catastrophic, the alignment problem is impossible.
The halting problem doesn’t say that we can’t know whether any program will halt, just that we can’t determine the halting status of every single program. It’s easy to “prove” that a program that runs an LLM will halt. Just program it to “run the LLM until it decides to stop; but if it doesn’t stop itself after 1 million tokens, cut it off.” This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn’t necessarily about proving that a AI will never do something catastrophic. It’s enough to have good informal arguments that it won’t do something bad with (say) 99.99% probability over the length of its deployment.