Johannes Treutlein comments on Trying to Make a Treacherous Mesa-Optimizer

Johannes Treutlein 9 Feb 2023 2:19 UTC
LW: 10 AF: 5
3
AF
I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.