Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).
Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there’s a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don’t have any concrete ideas at the moment—I can be in touch if I think of something suitable for collaboration!
Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).
Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there’s a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don’t have any concrete ideas at the moment—I can be in touch if I think of something suitable for collaboration!