the general strategy of “dealing with things as they come up” is much more viable under continuous takeoff. Therefore, if a continuous takeoff is more likely, we should focus our attention on questions which fundamentally can’t be solved as they come up.
Can you give some examples of AI alignment failure modes which would be definitely (or probably) easy to solve if we had a reproducible demonstration of that failure mode sitting in front of us? It seems to me that none of the current ongoing work is in that category.
When I imagine that type of iterative debugging, the example in my mind is a bad reward function that the programmers are repeatedly patching, which would be a bad situation because it would probably amount to a “nearest unblocked strategy” loop.
Can you give some examples of AI alignment failure modes which would be definitely (or probably) easy to solve if we had a reproducible demonstration of that failure mode sitting in front of us? It seems to me that none of the current ongoing work is in that category.
When I imagine that type of iterative debugging, the example in my mind is a bad reward function that the programmers are repeatedly patching, which would be a bad situation because it would probably amount to a “nearest unblocked strategy” loop.