James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth’s response. Genuinely interested in hearing his thoughts.
Dakara
Karma: 112
I do believe that Shubhorup’s second edit provides another case for concern. If you combine his proposal with steganography, the collusion team has another way of successfully mitigating Ryan’s safety strategy.
This collusion team’s strategy should also ideally be addressed.
This paper creates a new 1-bit strategy for circumventing Ryan and Buck’s safety strategy. The collusion team achieves 90%+ success rate.
I would really appreciate any input, especially from Ryan or his co-authors. This seems like a very important issue to address.
I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?
This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.
I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.