I think what’s really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that’s necessary (to get concepts we care about right).
I’m super on board with this desideratum, and agree that it would not be a good move to change it to some fixed number of levels. I also agree that from a conceptual standpoint many ideas are “about all the levels”.
My questions / comments are about the implementation proposed in this post. I thought that you were identifying “levels of reasoning” with “depth in the idealized recursive QAS tree”; if that’s the case I don’t see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)
I’m pretty sure I’m just failing to understand some fact about the particular implementation, or what you mean by “levels of reasoning”, or its relation to the idealized recursive QAS tree.
This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.
I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn’t do this.
In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it’s all part of what’s available “at the start”).
Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.
It seems like you have to choose one of two options:
Order of feedback does matter, in which case bad early feedback can lock you in to a bad outcome
Order of feedback doesn’t matter, in which case you can’t improve your interpretation of feedback over time (at least, not in a consistent way)
(This seems true more generally for any system that aims to learn at all the levels, not just for the implementation proposed in this post.)
It seems to me like you’re trying to illustrate something like “Abram’s proposal doesn’t get at the bottlenecks”.
I think it’s more like “I’m not clear on the benefit of this proposal over (say) learning from comparisons”. I’m not asking about bottlenecks; I’m asking about what the improvement is.
I’m curious how you see whitelisting working.
The same way I see any other X working: we explicitly train the neural net to satisfy X through human feedback (perhaps using amplification, debate, learning the prior, etc). For a whitelist, we might be able to do something slightly different: we train a classifier to say whether the situation is or isn’t in our whitelist, and then only query the agent when it is in our whitelist (otherwise reverting to a safe baseline). The classifier and agent share most of their weights.
Then we also do a bunch of stuff to verify that the neural net actually satisfies X (perhaps adversarial training, testing, interpretability, etc). In the whitelisting case, we’d be doing this on the classifier, if that’s the route we went down.
It feels like your beliefs about what kind of methods might work for “merely way-better-than-human” systems are a big difference between you and I, which might be worth discussing more, although I don’t know if it’s very central to everything else we’re discussing.
My questions / comments are about the implementation proposed in this post. I thought that you were identifying “levels of reasoning” with “depth in the idealized recursive QAS tree”; if that’s the case I don’t see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)
I’m pretty sure I’m just failing to understand some fact about the particular implementation, or what you mean by “levels of reasoning”, or its relation to the idealized recursive QAS tree.
OK. Looking back, the post really doesn’t address this, so I can understand why you’re confused.
My basic argument for cross-level generalization is that a QAS has to be represented compactly while being prepared to answer questions at any level; so, it has to generalize across levels. But there are also other effects.
So, suppose I give the system feedback about some specific 3rd-level judgement. The way I imagine this happening is that the feedback gets added to a big dataset. Evaluating QASs on this dataset is part of how the initial value function, Hv, does its thing.Hv also should prefer QASs which produce value functions which are pretty similar to Hv, so that this property is approximately preserved as the system gets amplified. So, a few things happen:
The feedback is added to the dataset, so it is used to judge the next generation of QASs (really the next learned distribution over QASs) so they will avoid doing poorly on this 3rd-level judgement.
This creates some cross-level generalization, because the QASs which perform poorly on this probably do so for reasons not isolated to 3rd-level judgments. In NN terms, there are shared hidden neurons which serve multiple different levels. In algorithmic information theory, there is mutual information between levels, so programs which do well will share information across levels rather than represent them all separately.
The feedback is also used as an example of how to judge (ie the fourth-level skill which would be able to generate the specific 3rd-level feedback). This also constrains the next generation of QASs, and so similarly has a cross-level generalization effect, due to shared information in the QAS representation (eg multi-level neurons, bits of code relevant across multiple levels, etc).
Similarly, this provides indirect evidence about 5th level, 6th level, etc because just as the 4th level needs to be such that it could have generated the 3rd-level feedback, the 5th level needs to be such that it would approve of such 4th levels, the 6th level needs to approve of 5th levels with that property, and so on.
So you can see, feedback on one level propagates information to all the other levels along many pathways.
>This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.
I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn’t do this.
I would argue it’s not true of the first; learning human preferences will fail to account for ways humans agree human preference judgments are error-prone (eg the story about how judges judge more harshly right before lunch).
As for iterated amplification, it definitely has this property “in spirit” (ie if everything works well), but whether particular versions have this property is another question. Specifically, it’s possible to ask “how should I answer questions like this?” and such meta-questions, to try to get debiasing information before coming up with a strategy to answer a question. However, it’s up to the human in the box to come up with these strategies, and you can’t go meta too much without going into an infinite loop. And the human in the box also has to have good strategy for searching for this kind of meta-info.
>In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it’s all part of what’s available “at the start”).
Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. Subsequent value functions also look at this big pool. So in the hypothetical where we re-start the training every time we give feedback,
First, the initial naive interpretation is used, on every piece of feedback ever. This helps define D1, the first learned distribution on QASs.
Then, D1 uses its new, slightly refined interpretation of all the feedback to form new judgments of QAS quality, which help define D2.
We keep iterating like this, getting better interpretations of feedback which we use to generate even better interpretations. We do this until we reach some stopping point, which might depend on safety concerns (eg stopping while we’re confident it has not drifted too much)
We then interact with the resulting system, generating more feedback for a while, until we have produced enough feedback that we wan to re-start the process again.
This procedure ensures that the system doesn’t outright ignore any feedback due to overconfidence (because all feedback is used by Hv every restart), while also ensuring that the most sophisticated model is (eventually) used to interpret feedback. The result (if you iterate to convergence) is a fixed-point where the distribution Dn would reproduce itself, so in a significant sense, the end result is as if you used the most sophisticated feedback-interpretation model from the beginning. At the same time, what you actually use at the beginning is the naive feedback interpretation model, which gives us the guarantee that EG if you stomp out a self-aggrandizing mental pattern (which would pointedly ignore feedback against itself), it actually gets stomped out.
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. [...]
Okay, I think with this elaboration I stand by what I originally said:
It seemed to me like since the first few bits of feedback determine how the system interprets all future feedback, it’s particularly important for those first few bits to be correct and not lock in e.g. a policy that ignores all future feedback.
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
Okay, I think with this elaboration I stand by what I originally said
You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
Because I think this is pretty solidly wrong of the system that restarts.
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
All feedback so far determines the new D1 when the system restarts training.
(Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
I’m super on board with this desideratum, and agree that it would not be a good move to change it to some fixed number of levels. I also agree that from a conceptual standpoint many ideas are “about all the levels”.
My questions / comments are about the implementation proposed in this post. I thought that you were identifying “levels of reasoning” with “depth in the idealized recursive QAS tree”; if that’s the case I don’t see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)
I’m pretty sure I’m just failing to understand some fact about the particular implementation, or what you mean by “levels of reasoning”, or its relation to the idealized recursive QAS tree.
I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn’t do this.
Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.
It seems like you have to choose one of two options:
Order of feedback does matter, in which case bad early feedback can lock you in to a bad outcome
Order of feedback doesn’t matter, in which case you can’t improve your interpretation of feedback over time (at least, not in a consistent way)
(This seems true more generally for any system that aims to learn at all the levels, not just for the implementation proposed in this post.)
I think it’s more like “I’m not clear on the benefit of this proposal over (say) learning from comparisons”. I’m not asking about bottlenecks; I’m asking about what the improvement is.
The same way I see any other X working: we explicitly train the neural net to satisfy X through human feedback (perhaps using amplification, debate, learning the prior, etc). For a whitelist, we might be able to do something slightly different: we train a classifier to say whether the situation is or isn’t in our whitelist, and then only query the agent when it is in our whitelist (otherwise reverting to a safe baseline). The classifier and agent share most of their weights.
Then we also do a bunch of stuff to verify that the neural net actually satisfies X (perhaps adversarial training, testing, interpretability, etc). In the whitelisting case, we’d be doing this on the classifier, if that’s the route we went down.
(Addressed this in the other comment)
OK. Looking back, the post really doesn’t address this, so I can understand why you’re confused.
My basic argument for cross-level generalization is that a QAS has to be represented compactly while being prepared to answer questions at any level; so, it has to generalize across levels. But there are also other effects.
So, suppose I give the system feedback about some specific 3rd-level judgement. The way I imagine this happening is that the feedback gets added to a big dataset. Evaluating QASs on this dataset is part of how the initial value function, Hv, does its thing.Hv also should prefer QASs which produce value functions which are pretty similar to Hv, so that this property is approximately preserved as the system gets amplified. So, a few things happen:
The feedback is added to the dataset, so it is used to judge the next generation of QASs (really the next learned distribution over QASs) so they will avoid doing poorly on this 3rd-level judgement.
This creates some cross-level generalization, because the QASs which perform poorly on this probably do so for reasons not isolated to 3rd-level judgments. In NN terms, there are shared hidden neurons which serve multiple different levels. In algorithmic information theory, there is mutual information between levels, so programs which do well will share information across levels rather than represent them all separately.
The feedback is also used as an example of how to judge (ie the fourth-level skill which would be able to generate the specific 3rd-level feedback). This also constrains the next generation of QASs, and so similarly has a cross-level generalization effect, due to shared information in the QAS representation (eg multi-level neurons, bits of code relevant across multiple levels, etc).
Similarly, this provides indirect evidence about 5th level, 6th level, etc because just as the 4th level needs to be such that it could have generated the 3rd-level feedback, the 5th level needs to be such that it would approve of such 4th levels, the 6th level needs to approve of 5th levels with that property, and so on.
So you can see, feedback on one level propagates information to all the other levels along many pathways.
I would argue it’s not true of the first; learning human preferences will fail to account for ways humans agree human preference judgments are error-prone (eg the story about how judges judge more harshly right before lunch).
As for iterated amplification, it definitely has this property “in spirit” (ie if everything works well), but whether particular versions have this property is another question. Specifically, it’s possible to ask “how should I answer questions like this?” and such meta-questions, to try to get debiasing information before coming up with a strategy to answer a question. However, it’s up to the human in the box to come up with these strategies, and you can’t go meta too much without going into an infinite loop. And the human in the box also has to have good strategy for searching for this kind of meta-info.
Every piece of feedback gets put into the same big pool which helps define Hv, the initial (“human”) value function. Subsequent value functions also look at this big pool. So in the hypothetical where we re-start the training every time we give feedback,
First, the initial naive interpretation is used, on every piece of feedback ever. This helps define D1, the first learned distribution on QASs.
Then, D1 uses its new, slightly refined interpretation of all the feedback to form new judgments of QAS quality, which help define D2.
We keep iterating like this, getting better interpretations of feedback which we use to generate even better interpretations. We do this until we reach some stopping point, which might depend on safety concerns (eg stopping while we’re confident it has not drifted too much)
We then interact with the resulting system, generating more feedback for a while, until we have produced enough feedback that we wan to re-start the process again.
This procedure ensures that the system doesn’t outright ignore any feedback due to overconfidence (because all feedback is used by Hv every restart), while also ensuring that the most sophisticated model is (eventually) used to interpret feedback. The result (if you iterate to convergence) is a fixed-point where the distribution Dn would reproduce itself, so in a significant sense, the end result is as if you used the most sophisticated feedback-interpretation model from the beginning. At the same time, what you actually use at the beginning is the naive feedback interpretation model, which gives us the guarantee that EG if you stomp out a self-aggrandizing mental pattern (which would pointedly ignore feedback against itself), it actually gets stomped out.
That’s the ideal I’d shoot for.
Most of this makes sense (or perhaps more accurately, sounds like it might be true, but there’s a good chance if I reread the post and all the comments I’d object again / get confused somehow). One thing though:
Okay, I think with this elaboration I stand by what I originally said:
Specifically, isn’t it the case that the first few bits of feedback determine D1, which might then lock in some bad way of interpreting feedback (whether existing or future feedback)?
You mean with respect to the system as described in the post (in which case I 100% agree), or the modified system which restarts training upon new feedback (which is what I was just describing)?
Because I think this is pretty solidly wrong of the system that restarts.
All feedback so far determines the new D1 when the system restarts training.
(Again, I’m not saying it’s feasible to restart training all the time, I’m just using it as a proof-of-concept to show that we’re not fundamentally forced to make a trade-off between (a) order independence and (b) using the best model to interpret feedback.)
I continue to not understand this but it seems like such a simple question that it must be that there’s just some deeper misunderstanding of the exact proposal we’re now debating. It seems not particularly worth it to find this misunderstanding; I don’t think it will really teach us anything conceptually new.
(If I did want to find it, I would write out pseudocode for the new proposed system and then try to make a more precise claim in terms of the variables in the pseudocode.)
Fair.