[EDIT 24h LATER: THIS MAY HAVE BEEN BASED ON A MISREAD OF THE PAPER. more on that later. I’ve reversed my self-vote.]
initial thoughts from talking to @Jozdien about this on discord—seems like this might have the same sort of issues that rlhf does still? possibly less severely, possibly more severely. might make things less interpretable? looking forward to seeing them comment their thoughts directly in a few hours.
“changes to the simulator should be at the mechanics level vs the behavior level”… but jozdien says that might be too hunchy of a sentence and get misinterpreted due to ambiguous phrasing. maybe anti-toxicity isn’t the target you want, if you have a richly representative prior and it’s missing toxicity it’ll be blind to the negative possibilities and have trouble steering away from them?
-> anti toxicity isn’t the target you want because if you have a prior and make changes at behavior level, you’re making a model of a world that functions differently than ours, and the internal mechanics of that world may not actually be particularly “nice”/”good”/”useful”/safe towards creating that world in our world/etc.
[EDIT 24h LATER: THIS MAY HAVE BEEN BASED ON A MISREAD OF THE PAPER. more on that later. I’ve reversed my self-vote.]
initial thoughts from talking to @Jozdien about this on discord—seems like this might have the same sort of issues that rlhf does still? possibly less severely, possibly more severely. might make things less interpretable? looking forward to seeing them comment their thoughts directly in a few hours.
edit: they linked me to https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research?commentId=eB4JSiMvxqFAYgBwF, saying RLHF makes it a lot worse but PHF is probably still not that great. if it’s better at apparent alignment, it’s probably worse, not better, or something? this is still “speedread hunches”, might change views after paper read passes 2 and 3 (2 page pdf). a predictive model being unpredictive of reality fucks with your prior, what you want is to figure out what simulacra is detoxifying of additional input, not what prior will make a model unwilling to consider a possibility of toxicity.
so, something to do with is vs ought models perhaps.
edit 2: maybe also related to https://www.lesswrong.com/posts/rh477a7fmWmzQdLMj/asot-finetuning-rl-and-gpt-s-world-prior
“changes to the simulator should be at the mechanics level vs the behavior level”… but jozdien says that might be too hunchy of a sentence and get misinterpreted due to ambiguous phrasing. maybe anti-toxicity isn’t the target you want, if you have a richly representative prior and it’s missing toxicity it’ll be blind to the negative possibilities and have trouble steering away from them?
-> anti toxicity isn’t the target you want because if you have a prior and make changes at behavior level, you’re making a model of a world that functions differently than ours, and the internal mechanics of that world may not actually be particularly “nice”/”good”/”useful”/safe towards creating that world in our world/etc.