A paraphrasing is, how easy is it to make it look like you’ll do a particular thing, when you really do a very different thing. Programmers encounter lots of reasons to think that it’s very easy, we miss bugs, and many bugs are security bugs, and security bugs can be used to amplify small divergences from expected behavior into large ones.
But we’re also just really bad at this, even for humans. Most of us don’t deal with formal correctness proofs very well, and relatedly: Our programming paradigms are predisposed to security bugs because, well, civilization invests like almost nothing in improving them because programming languages are non-excludable goods. And most of us have no manufacturing experience so we can’t generalize what we know into the messy physical world. There’s probably a lot of variation in this skill just in humans to the extent that I don’t want to say anything about how objectively hard it is.
I wonder if it’s worth training social agents in environments where they naturally need to do this sort of thing a lot (trust each other on the basis of partial information about their code, architectures and training processes), there may be some very robust social technologies that they develop to keep each other mutually transparent, which would also be useful to us for aligning them.
osgt most naturally can make ai cooperate against humans if the humans don’t also understand how to formally bind themselves usefully and safely and reliably. however, humans are able to make fairly strongly binding commitments to use strategies that condition on how others choose strategies, and there are a bunch of less exact strategy inference papers I could go hunt down that are pretty interesting.
That wasn’t my curiosity at all. I already know that deception is fairly rampant in humans, and expect that some kinds will be present in agentic AI. I am wondering if the cooperation conditions require mutual perfect knowledge, and the knowledge of mutual comprehension.
yeah I suspect that you might never be totally sure another part of a computer is what it claimed to be. you can be almost certain but totally certain? perfect?
A paraphrasing is, how easy is it to make it look like you’ll do a particular thing, when you really do a very different thing. Programmers encounter lots of reasons to think that it’s very easy, we miss bugs, and many bugs are security bugs, and security bugs can be used to amplify small divergences from expected behavior into large ones.
But we’re also just really bad at this, even for humans. Most of us don’t deal with formal correctness proofs very well, and relatedly: Our programming paradigms are predisposed to security bugs because, well, civilization invests like almost nothing in improving them because programming languages are non-excludable goods. And most of us have no manufacturing experience so we can’t generalize what we know into the messy physical world. There’s probably a lot of variation in this skill just in humans to the extent that I don’t want to say anything about how objectively hard it is.
I wonder if it’s worth training social agents in environments where they naturally need to do this sort of thing a lot (trust each other on the basis of partial information about their code, architectures and training processes), there may be some very robust social technologies that they develop to keep each other mutually transparent, which would also be useful to us for aligning them.
osgt most naturally can make ai cooperate against humans if the humans don’t also understand how to formally bind themselves usefully and safely and reliably. however, humans are able to make fairly strongly binding commitments to use strategies that condition on how others choose strategies, and there are a bunch of less exact strategy inference papers I could go hunt down that are pretty interesting.
That wasn’t my curiosity at all. I already know that deception is fairly rampant in humans, and expect that some kinds will be present in agentic AI. I am wondering if the cooperation conditions require mutual perfect knowledge, and the knowledge of mutual comprehension.
yeah I suspect that you might never be totally sure another part of a computer is what it claimed to be. you can be almost certain but totally certain? perfect?