I think that you should have adressed points by referring to number of point, quoting only parts that are easier to quote that refer to, it would have reduced the size of the comment.
I am going to adress only one object-level point:
synthetic data letting us control what the AI learns and what they value
No, obviously, we can’t control what AI learns and value using synthetic data in practice, because we need AI to learn things that we don’t know. If you feed AI all physics and chemistry data with expectation to get nanotech, you are doing this because you expect that AI learns facts and principles you don’t know about and, therefore, can’t control. You don’t know about these facts and principles and can’t control them because otherwise you would be able to design nanotech yourself.
Of course, I’m saying “can’t” meaning “practically can’t”, not “in principle”. But to do this you need to do basically “GOFAI in trenchcoat of SGD” and it doesn’t look competitive with any other method of achieving AGI, unless you manage to make yourself AGI Czar.
If you feed AI all physics and chemistry data with expectation to get nanotech, you are doing this because you expect that AI learns facts and principles you don’t know about and, therefore, can’t control. You don’t know about these facts and principles and can’t control them because otherwise you would be able to design nanotech yourself.
This is basically a combo of very high sample efficiency, defining a good-enough ground truth reward signal, very good online learning, very good credit assignment and handling uncertainity and simulatability well.
But for our purposes, if we decided that we didn’t in fact want to learn nanotech, we could just remove the data from it’s experience, in a way we couldn’t do with humans, which is quite a big win for misuse concerns.
But my point here was that you can get large sets of data on values early on in training, and we can both iteratively refine on values by testing the model’s generalization of the value data to new situations, as well as rely on the fact that alignment generalizes further than capabilities does.
I think my crux here is this:
Of course, I’m saying “can’t” meaning “practically can’t”, not “in principle”. But to do this you need to do basically “GOFAI in trenchcoat of SGD” and it doesn’t look competitive with any other method of achieving AGI, unless you manage to make yourself AGI Czar.
I think this is just not correct, and while we should start making large datasets now, I think a crux here is that I believe that far less data is necessary for models to generalize alignment, and that we aren’t trying to hand-code everything, and instead rely on the fact that models will generalize better and better on human values as they get more capable, due to alignment generalizing further than capabilities and there likely being a simple core to alignment, so I don’t think we need a GOFAI in trenchcoat of SGD.
We’ve discussed this before https://x.com/quetzal_rainbow/status/1834268698565059031, but while I agree with TurnTrout that RL doesn’t maximize reward by definition, and the reward maximization hypothesis isn’t an automatic consequence of RL training, I do think that something like reward maximization might well occur in practice, and more generally I think that the post ignores the possibility that future RL could generalize better towards maximizing the reward function.
I think that you should have adressed points by referring to number of point, quoting only parts that are easier to quote that refer to, it would have reduced the size of the comment.
I am going to adress only one object-level point:
No, obviously, we can’t control what AI learns and value using synthetic data in practice, because we need AI to learn things that we don’t know. If you feed AI all physics and chemistry data with expectation to get nanotech, you are doing this because you expect that AI learns facts and principles you don’t know about and, therefore, can’t control. You don’t know about these facts and principles and can’t control them because otherwise you would be able to design nanotech yourself.
Of course, I’m saying “can’t” meaning “practically can’t”, not “in principle”. But to do this you need to do basically “GOFAI in trenchcoat of SGD” and it doesn’t look competitive with any other method of achieving AGI, unless you manage to make yourself AGI Czar.
Okay, the reason for this happening:
This is basically a combo of very high sample efficiency, defining a good-enough ground truth reward signal, very good online learning, very good credit assignment and handling uncertainity and simulatability well.
But for our purposes, if we decided that we didn’t in fact want to learn nanotech, we could just remove the data from it’s experience, in a way we couldn’t do with humans, which is quite a big win for misuse concerns.
But my point here was that you can get large sets of data on values early on in training, and we can both iteratively refine on values by testing the model’s generalization of the value data to new situations, as well as rely on the fact that alignment generalizes further than capabilities does.
I think my crux here is this:
I think this is just not correct, and while we should start making large datasets now, I think a crux here is that I believe that far less data is necessary for models to generalize alignment, and that we aren’t trying to hand-code everything, and instead rely on the fact that models will generalize better and better on human values as they get more capable, due to alignment generalizing further than capabilities and there likely being a simple core to alignment, so I don’t think we need a GOFAI in trenchcoat of SGD.
We’ve discussed this before https://x.com/quetzal_rainbow/status/1834268698565059031, but while I agree with TurnTrout that RL doesn’t maximize reward by definition, and the reward maximization hypothesis isn’t an automatic consequence of RL training, I do think that something like reward maximization might well occur in practice, and more generally I think that the post ignores the possibility that future RL could generalize better towards maximizing the reward function.
(It seems like “here” link got mixed with the word “here”?)
Alright, I fixed the link, though I don’t know why you can’t transform non-Lesswrong links into links that have a shorter title link.