Dakara

Karma: 112

Dakara Nov 18, 2024, 10:07 AM
1 point
0
AF
on: Simple probes can catch sleeper agents
This paper argues that unintended deceptive behavior is not susceptible to detection by probing method. The authors of that paper argue that the probing method fares no better than random guessing for detecting unintended deceptive behavior.

I would really appreciate any input, especially from Monte or his co-authors. This seems like a very important issue to address.

Dakara Nov 17, 2024, 10:19 PM
4 points
0
in reply to: James Stephen Brown’s comment on: If we solve alignment, do we die anyway?
James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth’s response. Genuinely interested in hearing his thoughts.

Dakara 17 Nov 2024 21:58 UTC
1 point
0
in reply to: Shubhorup Biswas’s comment on: AI Control: Improving Safety Despite Intentional Subversion
I do believe that Shubhorup’s second edit provides another case for concern. If you combine his proposal with steganography, the collusion team has another way of successfully mitigating Ryan’s safety strategy.

This collusion team’s strategy should also ideally be addressed.

Dakara 17 Nov 2024 21:48 UTC
2 points
0
on: AI Control: Improving Safety Despite Intentional Subversion
This paper creates a new 1-bit strategy for circumventing Ryan and Buck’s safety strategy. The collusion team achieves 90%+ success rate.

I would really appreciate any input, especially from Ryan or his co-authors. This seems like a very important issue to address.

Dakara 28 Aug 2024 15:50 UTC
3 points
0
on: If we solve alignment, do we die anyway?
I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?