I must be missing something here. Isn’t optimizing necessary for superhuman behavior? So isn’t “superhuman behavior” a strictly stronger requirement than “being a mesaoptimizer”? So isn’t it clear which one happens first?
Fast imitations of subhuman behavior or imitations of augmented of humans are also superhuman. As is planning against a human-level imitation. And so on.
It’s unclear if systems trained in that way will be imitating a process that optimizes, or will be optimizing in order to imitate. (Presumably they are doing both to varying degrees.) I don’t think this can be settled a priori.
This “imitating an optimizer” / “optimizing to imitate” dichotomy seems unnecessarily confusing to me. Isn’t it just inner alignment / inner misalignment (with the human behavior you’re being trained on)? If you’re imitating an optimizer, you’re still an optimizer.
I agree with this. If the key idea is, for example, optimising imitators generalise better than imitations of optimisers, or for a second example that they pursue simpler goals, it seems to me that it’d be better just to draw distinctions based on generalisation or goal simplicity and not on optimising imitators/imitations of optimisers.
Sorry, I should be more specific. We are talking about AGI Safety, it seems unlikely that running narrow AI faster gets you AGI. I’m not sure if you disagree with that. I don’t understand what you mean by “imitations of augmented of humans” and “planning against a human-level imitation”.
I must be missing something here. Isn’t optimizing necessary for superhuman behavior? So isn’t “superhuman behavior” a strictly stronger requirement than “being a mesaoptimizer”? So isn’t it clear which one happens first?
Fast imitations of subhuman behavior or imitations of augmented of humans are also superhuman. As is planning against a human-level imitation. And so on.
It’s unclear if systems trained in that way will be imitating a process that optimizes, or will be optimizing in order to imitate. (Presumably they are doing both to varying degrees.) I don’t think this can be settled a priori.
This “imitating an optimizer” / “optimizing to imitate” dichotomy seems unnecessarily confusing to me. Isn’t it just inner alignment / inner misalignment (with the human behavior you’re being trained on)? If you’re imitating an optimizer, you’re still an optimizer.
I agree with this. If the key idea is, for example, optimising imitators generalise better than imitations of optimisers, or for a second example that they pursue simpler goals, it seems to me that it’d be better just to draw distinctions based on generalisation or goal simplicity and not on optimising imitators/imitations of optimisers.
Sorry, I should be more specific. We are talking about AGI Safety, it seems unlikely that running narrow AI faster gets you AGI. I’m not sure if you disagree with that. I don’t understand what you mean by “imitations of augmented of humans” and “planning against a human-level imitation”.