Value is fragile. The goal is not to alleviate every ounce of discomfort; the goal is to make the future awesome. My guess is that that involves leaving people with real decisions that have real consequences, that it involves giving people the opportunity to screw up, that it involves allowing the universe to continue to be a place of obstacles that people must overcome.
that’s probably part of it, agreed. it’s probably not even close to closing the key holes in the loss landscape, though. you have a track record of calling important shots and people would do well to take you seriously, but at the same time, they’d do well not to assume you have all the answers. upvote and agree. how does that solve program equilibria in osgt though? how does it bound worst mistake, how does it bound worst powerseeking? how does it ensure defensibility?
I’m not sure what you mean by solve program equilibria in osgt—partly because i’m not sure what ‘osgt’ means.
Optimizing for human/external empowerment doesn’t bound the worst mistakes the agent can make. If by powerseeking you mean the AI seeking its own empowerment, the AI may need to do that in the near term, but in the long term that is an opposing rather obviously unaligned utility function. Navigating that transition tradeoff is where much of the difficulty seems to lie—but I expect that to be true of any viable solution. Not sure what you mean by defensibility.
program equilibria in open-source game theory: once a model is strong enough to make exact mathematical inferences about the implications of the way the approximator’s actual learned behavior landed after training, the reflection about game theory can be incredibly weird. this is where much of the stuff about decision theories comes up, and the reason we haven’t run into it already is because current models are big enough to be really hard to prove through. Related work, new and old:
https://arxiv.org/pdf/2208.07006.pdf—Cooperative and uncooperative institution designs: Surprises and problems in open-source game theory—near the top of my to read list; by Andrew Critch who has some other posts on the topic, especially the good ol “Open Source Game Theory is weird”, and several more recent ones I haven’t read properly at all
This also connects through to putting neural networks in formal verification systems. The summary right now is that it’s possible but doesn’t scale to current model sizes. I expect scalability to surprise us.
Bounding worst mistake: preventing adversarial examples and generalization failures. Plenty of work on this in general, but in particular I’m interested in certified bounds. (Though those usually turn out to have some sort of unhelpfully tight premise.)
tons of papers I could link here that I haven’t evaluated deeply, but you can find a lot of them by following citations from https://www.katz-lab.com/research—in particular:
(and I didn’t even mention reliable ontological grounding in the face of arbitrarily large ontological shifts due to fundamental representation corrections)
I’m replying as much to anyone else who’d ask the same question as I’m answering you in particular; I imagine you’ve seen some of this stuff in passing before. Hope the detail helps anyway! I’m replying in multiple comments to organize responses better. I’ve unvoted all my own comments but this one so it shows up as the top reply to start with.
inappropriate powerseeking: seeking to achieve empowerment of the ai over empowerment of others, in order to achieve adversarial peaks of the reward model, or etc; ie, “you asked for collaborative powerseeking and instead got deceptive alignment due to an adversarial hole in your model”. Some recent papers that try to formalize this in terms of RL:
(also, having adversarial holes in behavior makes the OSGT branch of concern look like “smart model reads the weights of your vulnerable model and pwns it” rather than any sort of agentically intentional cooperation.)
This is AGI optimizing for human empowerment.
that’s probably part of it, agreed. it’s probably not even close to closing the key holes in the loss landscape, though. you have a track record of calling important shots and people would do well to take you seriously, but at the same time, they’d do well not to assume you have all the answers. upvote and agree. how does that solve program equilibria in osgt though? how does it bound worst mistake, how does it bound worst powerseeking? how does it ensure defensibility?
I’m not sure what you mean by solve program equilibria in osgt—partly because i’m not sure what ‘osgt’ means.
Optimizing for human/external empowerment doesn’t bound the worst mistakes the agent can make. If by powerseeking you mean the AI seeking its own empowerment, the AI may need to do that in the near term, but in the long term that is an opposing rather obviously unaligned utility function. Navigating that transition tradeoff is where much of the difficulty seems to lie—but I expect that to be true of any viable solution. Not sure what you mean by defensibility.
program equilibria in open-source game theory: once a model is strong enough to make exact mathematical inferences about the implications of the way the approximator’s actual learned behavior landed after training, the reflection about game theory can be incredibly weird. this is where much of the stuff about decision theories comes up, and the reason we haven’t run into it already is because current models are big enough to be really hard to prove through. Related work, new and old:
https://arxiv.org/pdf/2208.07006.pdf—Cooperative and uncooperative institution designs: Surprises and problems in open-source game theory—near the top of my to read list; by Andrew Critch who has some other posts on the topic, especially the good ol “Open Source Game Theory is weird”, and several more recent ones I haven’t read properly at all
https://arxiv.org/pdf/2211.05057.pdf—A Note on the Compatibility of Different Robust Program Equilibria of the Prisoner’s Dilemma
https://arxiv.org/pdf/1401.5577.pdf—Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic
https://www.semanticscholar.org/paper/Program-equilibrium-Tennenholtz/e1a060cda74e0e3493d0d81901a5a796158c8410?sort=pub-date—the paper that introduced OSGT, with papers citing it sorted by recency
also interesting https://www.semanticscholar.org/paper/Open-Problems-in-Cooperative-AI-Dafoe-Hughes/2a1573cfa29a426c695e2caf6de0167a12b788ef and https://www.semanticscholar.org/paper/Foundations-of-Cooperative-AI-Conitzer-Oesterheld/5ccda8ca1f04594f3dadd621fbf364c8ec1b8474
This also connects through to putting neural networks in formal verification systems. The summary right now is that it’s possible but doesn’t scale to current model sizes. I expect scalability to surprise us.
Bounding worst mistake: preventing adversarial examples and generalization failures. Plenty of work on this in general, but in particular I’m interested in certified bounds. (Though those usually turn out to have some sort of unhelpfully tight premise.)
tons of papers I could link here that I haven’t evaluated deeply, but you can find a lot of them by following citations from https://www.katz-lab.com/research—in particular:
Verifying Generalization in Deep Learning
gRoMA: a Tool for Measuring Deep Neural Networks Global Robustness
here’s what’s on my to-evaluate list in my
ai formal verification and hard robustness
tag in semanticscholar: https://arxiv.org/pdf/2302.04025.pdf https://arxiv.org/pdf/2304.03671.pdf https://arxiv.org/pdf/2303.10513.pdf https://arxiv.org/pdf/2303.03339.pdf https://arxiv.org/pdf/2303.01076.pdf https://arxiv.org/pdf/2303.14564.pdf https://arxiv.org/pdf/2303.07917.pdf https://arxiv.org/pdf/2304.01218.pdf https://arxiv.org/pdf/2304.01826.pdf https://arxiv.org/pdf/2304.00813.pdf https://arxiv.org/pdf/2304.01874.pdf https://arxiv.org/pdf/2304.03496.pdf https://arxiv.org/pdf/2303.02251.pdf https://arxiv.org/pdf/2303.14961.pdf https://arxiv.org/pdf/2301.11374.pdf https://arxiv.org/pdf/2303.10024.pdf—most of these are probably not that amazing, but some of them seem quite interesting. would love to hear which stand out to anyone passing by!(and I didn’t even mention reliable ontological grounding in the face of arbitrarily large ontological shifts due to fundamental representation corrections)
I’m replying as much to anyone else who’d ask the same question as I’m answering you in particular; I imagine you’ve seen some of this stuff in passing before. Hope the detail helps anyway! I’m replying in multiple comments to organize responses better. I’ve unvoted all my own comments but this one so it shows up as the top reply to start with.
inappropriate powerseeking: seeking to achieve empowerment of the ai over empowerment of others, in order to achieve adversarial peaks of the reward model, or etc; ie, “you asked for collaborative powerseeking and instead got deceptive alignment due to an adversarial hole in your model”. Some recent papers that try to formalize this in terms of RL:
https://arxiv.org/pdf/2304.06528.pdf—Power-seeking can be probable and predictive for trained agents—see also the lesswrong post
https://arxiv.org/pdf/2206.13477.pdf—Parametrically Retargetable Decision-Makers Tend To Seek Power—see also the lesswrong post
Boundaries sequence is also relevant to this
(also, having adversarial holes in behavior makes the OSGT branch of concern look like “smart model reads the weights of your vulnerable model and pwns it” rather than any sort of agentically intentional cooperation.)