I am trying to come up with a reason this isn’t 99%? Why this is the hard step at all?
tl;dr (pls correct me if this summary is wrong) the argument is that the more likely hypothesis is that the model will be simply myopically obsessed with getting reward. (Though I think Joe might still think that the model also might be actually alignd/honest/etc.? Not sure why, would need to reread.)
tl;dr (pls correct me if this summary is wrong) the argument is that the more likely hypothesis is that the model will be simply myopically obsessed with getting reward. (Though I think Joe might still think that the model also might be actually alignd/honest/etc.? Not sure why, would need to reread.)