This was interesting to read, and I agree that this experiment should be done!
Speaking as another person who’s never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is “go reimplement an old paper,” and it seems like this wouldn’t require anything new as far as ML techniques go. If you want to upskill in ML, I’d say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)
The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing “what it hears,” which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)
Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn’t have support for RL, but you can probably find a simple paper or code that does RL for transformers.
This was interesting to read, and I agree that this experiment should be done!
Speaking as another person who’s never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is “go reimplement an old paper,” and it seems like this wouldn’t require anything new as far as ML techniques go. If you want to upskill in ML, I’d say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)
The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing “what it hears,” which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)
Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn’t have support for RL, but you can probably find a simple paper or code that does RL for transformers.