I noticed some time ago there is a big overlap between lines of hope mentioned in Garret Baker’s post and lines of hope I already had. The remaining things he mentions are lines of hope that I at least can’t antipredict which is rare. It’s currently the top plan/model of Alignment that I would want to read a critique of (to destroy or strengthen my hopes). Since no one else seems to have written that critique yet I might write a post myself (Leave a comment if you’d be interested to review a draft or have feedback on the points below).
if singular learning theory is roughly correct in explaining confusing phenomena about neural nets (double descent, grokking), then the things confusing about these architectures are pretty straightforward implications from probability theory (Implying we might expect fewer diffs in priors between humans and neural nets because biases are less architecture dependent).
the idea of whether something like “reinforcing shards” can be stable if your internals are part of the context during training even if you don’t have perfect interpretability
The idea that maybe the two ideas abovecan stack? If for both humans and AI training data is the most crucial, then perhaps we can develop methods comparing human brains and AI. If we get to the point of being able to do this in detail (big If, especially on the neuroscience side this seems possibly hopeless?), then we could get further guarantees that the AI we are training is not a “psychopath”.
Quite possibly further reflection feedback would change my mind and counterarguments/feedback would be appreciated. I am quite worried about motivated reasoning to think this plan is better than I think because it would give me something tractable to work on. Also to which extent people planning to work on methods that should be robust enough to survive a sharp left turn are pessimistic about lines of research like this only because of the capability externalities. I have a hard time evaluating the capability externalities of publishing research on plans like the above. If someone is interested in writing a post about this or reading it feel free to leave a comment.
I noticed some time ago there is a big overlap between lines of hope mentioned in Garret Baker’s post and lines of hope I already had. The remaining things he mentions are lines of hope that I at least can’t antipredict which is rare. It’s currently the top plan/model of Alignment that I would want to read a critique of (to destroy or strengthen my hopes). Since no one else seems to have written that critique yet I might write a post myself (Leave a comment if you’d be interested to review a draft or have feedback on the points below).
if singular learning theory is roughly correct in explaining confusing phenomena about neural nets (double descent, grokking), then the things confusing about these architectures are pretty straightforward implications from probability theory (Implying we might expect fewer diffs in priors between humans and neural nets because biases are less architecture dependent).
the idea of whether something like “reinforcing shards” can be stable if your internals are part of the context during training even if you don’t have perfect interpretability
The idea that maybe the two ideas above can stack? If for both humans and AI training data is the most crucial, then perhaps we can develop methods comparing human brains and AI. If we get to the point of being able to do this in detail (big If, especially on the neuroscience side this seems possibly hopeless?), then we could get further guarantees that the AI we are training is not a “psychopath”.
Quite possibly further reflection feedback would change my mind and counterarguments/feedback would be appreciated. I am quite worried about motivated reasoning to think this plan is better than I think because it would give me something tractable to work on. Also to which extent people planning to work on methods that should be robust enough to survive a sharp left turn are pessimistic about lines of research like this only because of the capability externalities. I have a hard time evaluating the capability externalities of publishing research on plans like the above. If someone is interested in writing a post about this or reading it feel free to leave a comment.