Good paper! Thank you for sharing. I have a few nit-picky suggestions with wording and grammar. I will put them here rather than email directly because some of them are subjective. This way others can feel free to chime in if they feel inclined to nit-pick my nit-picks :)
“artificial general intelligence (AGI) may surpass” → “artificial general intelligence (AGI) seems likely to surpass” (I feel like “may” is a somewhat weak word in this context, but I don’t feel strongly here.)
”undesirable (in other words, misaligned)” → “undesirable (i.e., misaligned)” (This is precisely the situation when “i.e.,” applies, and I think it’s cleaner.)
“However, while caution is deserved, there are several reasons” → “While caution is indeed deserved, there are nonetheless several reasons” (I don’t think “However” is quite right here.)
“However, we hope” → “That said, we hope” (Again, I don’t think “However” is quite the right word here. “However” implies a certain contrast with what was previously said that doesn’t apply in this case.)
“”energy” in 17 th-century physics; “evolutionary fitness” in 19th-century biology; and “computation” in 20th-century mathematics” → “”energy” in 17 th-century physics, “evolutionary fitness” in 19th-century biology, and “computation” in 20th-century mathematics” (semicolons (”;”) can indeed sometimes be used for lists, especially when utilizing complex clauses. That said, commas (,) are preferred in grammatically simple cases like this.)
“However, RLHF may reinforce” → “Unfortunately, RLHF may reinforce” (Again, “However” is not quite the right word here.)
Thanks for this manuscript, already a key reading for newcomers.
[principled methods] constitutes an important difference from other technologies such as planes and bridges, whose safety we can ensure because we understand the principles that govern them.
This innocent looking sentence is actually a very strong statement. Could we ensure the safety of TWA Flight 800 and Tacoma bridge? Or we couldn’t but we could have, if only we did understand better the principles that govern electric sparks and mechanical resonance? Imho we should instead thank learning from trying, and that’s a damn as alignement likely includes errors we can’t learn from. On the other hand, you have a point that better interpretability should help, so I suggest to replace « whose safety we can ensure… » by a lighter statement, for example « whose safety is made easier… ».
Good paper! Thank you for sharing. I have a few nit-picky suggestions with wording and grammar. I will put them here rather than email directly because some of them are subjective. This way others can feel free to chime in if they feel inclined to nit-pick my nit-picks :)
“artificial general intelligence (AGI) may surpass” → “artificial general intelligence (AGI) seems likely to surpass” (I feel like “may” is a somewhat weak word in this context, but I don’t feel strongly here.)
”undesirable (in other words, misaligned)” → “undesirable (i.e., misaligned)” (This is precisely the situation when “i.e.,” applies, and I think it’s cleaner.)
”trained in similar ways as today’s” → “trained in similar ways to today’s” (See https://english.stackexchange.com/questions/170475/in-a-similar-way-as-or-in-a-similar-way-to)
“However, while caution is deserved, there are several reasons” → “While caution is indeed deserved, there are nonetheless several reasons” (I don’t think “However” is quite right here.)
”Firstly,”, “Secondly,”, etc → “First,”, “Second,”, etc (These words are used a couple times throughout the paper. It is generally recommended to use the simple “First” instead of “Firstly”. See https://www.merriam-webster.com/words-at-play/first-or-firstly )
“However, we hope” → “That said, we hope” (Again, I don’t think “However” is quite the right word here. “However” implies a certain contrast with what was previously said that doesn’t apply in this case.)
“”energy” in 17 th-century physics; “evolutionary fitness” in 19th-century biology; and “computation” in 20th-century mathematics” → “”energy” in 17 th-century physics, “evolutionary fitness” in 19th-century biology, and “computation” in 20th-century mathematics” (semicolons (”;”) can indeed sometimes be used for lists, especially when utilizing complex clauses. That said, commas (,) are preferred in grammatically simple cases like this.)
“However, RLHF may reinforce” → “Unfortunately, RLHF may reinforce” (Again, “However” is not quite the right word here.)
Thanks for this manuscript, already a key reading for newcomers.
This innocent looking sentence is actually a very strong statement. Could we ensure the safety of TWA Flight 800 and Tacoma bridge? Or we couldn’t but we could have, if only we did understand better the principles that govern electric sparks and mechanical resonance? Imho we should instead thank learning from trying, and that’s a damn as alignement likely includes errors we can’t learn from. On the other hand, you have a point that better interpretability should help, so I suggest to replace « whose safety we can ensure… » by a lighter statement, for example « whose safety is made easier… ».