If you try to give feedback during training, there is a risk you’ll just reward it for being deceptive. One advantage to selecting post hoc is that you can avoid incentivizing deception.
I agree with you (both). It’s a framing difference iff you can translate back and forth. My thinking was that the problem might be setup so that it’s “easy” to recognize but difficult to implement. If you can define a strategy which sets it up to be easy to recognize, that is. Another way I thought about it, is that you can use your ‘meta’ knowledge about human imitators versus direct translators to give you a probability over all reporters. Approaching the problem not with certainty of a solution but with recognized uncertainty (I refrain from using ‘known uncertainty’ here, because knowing how much you don’t know something is hard).
I obviously don’t have a good strategy, else my name would be up above, haha. But to give it my attempt: One way to get this knowledge is to study for example the way NNs generalize and studying how likely it is that you create a direct translator versus an imitator. The proposal I send in used this knowledge and submitted a lot of reporters to random input. In the complex input space (outside the simple training set) translators should behave differently to imitators. Given that direct translators are created less frequently then imitators (which is what the report suggested), the smaller group or the outliers are more likely to be translators than imitators. Using this approach you can build evidence. That’s where I stopped, because time was up.
And that is why I was interested if there were any others using this approach that might be more successful than me(if there were any, they probably would be).
EDIT to add:
[...]I think that almost all of the difficulty will be in how you can tell good from bad reporters by looking at them.
I agree it is difficult. But it’s also difficult to have overparameterized models converge on a guaranteed single point on the ridge of all good solutions. So not having to do that would be great. But a conclusion of this contest could be that we have no other option, because we definitively have no way of recognizing good from bad.
In the complex input space (outside the simple training set) translators should behave differently to imitators. Given that direct translators are created less frequently then imitators (which is what the report suggested), the smaller group or the outliers are more likely to be translators than imitators.
It seems like there are a ton of clusters though—there are exponentially many ways of modifying the human simulator to give different behavior off distribution, each of which is more likely than the direct translator (enough more likely that there are a whole cluster of equivalent variations each of which is still individually more likely than all of the direct translators put together).
Thanks for your reply! You are right of course. The argument being more about building up evidence.
But, after thinking about it some more , I see now that any evidence gathered with a single known feature of priorless models (like the one I mentioned) would be so minuscule (approaching 0[1]) that you’d need to combine unlikely amounts of features[2]. You’d end up in (a worse version of) an arms race akin to the ‘science it’ example/counterexample scenario mentioned in the report and thus it’s a dead end. By extension, all priorless models with or without a good training regime[3], with or without a good loss-function[4] , with or without some form of regularization[4], all of them are out.
So, I think it’s an interesting dead end. It reduces possible solutions to ELK to solutions that reduce the the ridge of good solutions with a prior imposed by/on the architecture or statistical features of the model. In other words, it requires either non-overparameterized models or a way to reduce the ridge of good solutions in overparameterized models. Of the latter I have only seen good descriptions[5] but no solutions[6] (but do let me know me if I missed something).
If you try to give feedback during training, there is a risk you’ll just reward it for being deceptive. One advantage to selecting post hoc is that you can avoid incentivizing deception.
I agree with you (both). It’s a framing difference iff you can translate back and forth. My thinking was that the problem might be setup so that it’s “easy” to recognize but difficult to implement. If you can define a strategy which sets it up to be easy to recognize, that is.
Another way I thought about it, is that you can use your ‘meta’ knowledge about human imitators versus direct translators to give you a probability over all reporters. Approaching the problem not with certainty of a solution but with recognized uncertainty (I refrain from using ‘known uncertainty’ here, because knowing how much you don’t know something is hard).
I obviously don’t have a good strategy, else my name would be up above, haha.
But to give it my attempt: One way to get this knowledge is to study for example the way NNs generalize and studying how likely it is that you create a direct translator versus an imitator. The proposal I send in used this knowledge and submitted a lot of reporters to random input. In the complex input space (outside the simple training set) translators should behave differently to imitators. Given that direct translators are created less frequently then imitators (which is what the report suggested), the smaller group or the outliers are more likely to be translators than imitators.
Using this approach you can build evidence. That’s where I stopped, because time was up.
And that is why I was interested if there were any others using this approach that might be more successful than me(if there were any, they probably would be).
EDIT to add:
I agree it is difficult. But it’s also difficult to have overparameterized models converge on a guaranteed single point on the ridge of all good solutions. So not having to do that would be great. But a conclusion of this contest could be that we have no other option, because we definitively have no way of recognizing good from bad.
It seems like there are a ton of clusters though—there are exponentially many ways of modifying the human simulator to give different behavior off distribution, each of which is more likely than the direct translator (enough more likely that there are a whole cluster of equivalent variations each of which is still individually more likely than all of the direct translators put together).
Thanks for your reply!
You are right of course. The argument being more about building up evidence.
But, after thinking about it some more , I see now that any evidence gathered with a single known feature of priorless models (like the one I mentioned) would be so minuscule (approaching 0[1]) that you’d need to combine unlikely amounts of features[2]. You’d end up in (a worse version of) an arms race akin to the ‘science it’ example/counterexample scenario mentioned in the report and thus it’s a dead end. By extension, all priorless models with or without a good training regime[3], with or without a good loss-function[4] , with or without some form of regularization[4], all of them are out.
So, I think it’s an interesting dead end. It reduces possible solutions to ELK to solutions that reduce the the ridge of good solutions with a prior imposed by/on the architecture or statistical features of the model. In other words, it requires either non-overparameterized models or a way to reduce the ridge of good solutions in overparameterized models. Of the latter I have only seen good descriptions[5] but no solutions[6] (but do let me know me if I missed something).
Do you agree?
assuming a blackbox, overparameterized model
Like finding a single needle in a infinite haystack. Even if your evidence hacks it in half you’ll be hacking away a long time.
in the case of the ELK-contest, due to not being able to sample outside the human understandable
because they would depend on the same (assumption of) evidence/feature
like this one
I know you can see the initialization of parameters as a prior, but I haven’t seen a meaningful prior