There’s something I’ve seen some rationalists try for, which I think Eliezer might be aiming at here, which is to try and be a truly robust agent.
Be the sort of person that Omega (even a version of Omega who’s only 90% accurate) can clearly tell is going to one-box.
Be the sort of agent who cooperates when it is appropriate, defects when it is appropriate, and can realize that cooperating-in-this-particular-instance might look superficially like defecting, but avoid falling into a trap.
Be the sort of agent who, if some AI engineers were whiteboarding out the agent’s decision making, they were see that the agent makes robustly good choices, such that those engineers would choose to implement that agent as software and run it.
Not sure if that’s precisely what’s going on here but I think is at least somewhat related. If your day job is designing agents that could be provably friendly, it suggests the question of “how can I be provably friendly?”
There’s something I’ve seen some rationalists try for, which I think Eliezer might be aiming at here, which is to try and be a truly robust agent.
This is very close to what I’ve always seen as the whole intent of the Sequences. I also feel like there’s a connection here to what I see as a bidirectional symmetry of the Sequences’ treatment of human rationality and Artificially Intelligent Agents. I still have trouble phrasing exactly what it is I feel like I notice here, but here’s an attempt:
As an introductory manual on improving the Art of Human Rationality, the hypothetical perfectly rational, internally consistent, computationally Bayes-complete superintelligence is used as the Platonic Ideal of a Rational Intelligence, and the Sequences ground many of Rationality’s tools, techniques, and heuristics as approximations of that fundamentally non-human ideal evidence processor.
or in the other direction:
As an introductory guide to building a Friendly Superintelligence, the Coherent Extrapolated Human Rationalist, a model developed from intuitively appealing rational virtues, is used as a guide for what we want optimal intelligent agents to look like, and the Sequences as a whole are about taking this more human grounding, and justifying it as the basis on which to guide the development of AI into something that works properly, and something that we see as Friendly.
Maybe that’s not the best description, but I think there’s something there and that it’s relevant to this idea of trying to use rationality to be a “truly robust agent”. In any case I’ve always felt there was an interesting parallel with how the Sequences can be seen as “A Manual For Building Friendly AI” based on rational Bayesian principles, or “A Manual For Teaching Humans Rational Principles” based on an idealized Bayesian AI.
There’s something I’ve seen some rationalists try for, which I think Eliezer might be aiming at here, which is to try and be a truly robust agent.
I really like this phrasing. Previously, I’ve just had vague sense that there are ways to have a lot more integrity such that many situations get a huge utility bump, which can only be achieved through very clear thinking. Your comment helped me make that a lot more concrete.
There’s something I’ve seen some rationalists try for, which I think Eliezer might be aiming at here, which is to try and be a truly robust agent.
Be the sort of person that Omega (even a version of Omega who’s only 90% accurate) can clearly tell is going to one-box.
Be the sort of agent who cooperates when it is appropriate, defects when it is appropriate, and can realize that cooperating-in-this-particular-instance might look superficially like defecting, but avoid falling into a trap.
Be the sort of agent who, if some AI engineers were whiteboarding out the agent’s decision making, they were see that the agent makes robustly good choices, such that those engineers would choose to implement that agent as software and run it.
Not sure if that’s precisely what’s going on here but I think is at least somewhat related. If your day job is designing agents that could be provably friendly, it suggests the question of “how can I be provably friendly?”
This is very close to what I’ve always seen as the whole intent of the Sequences. I also feel like there’s a connection here to what I see as a bidirectional symmetry of the Sequences’ treatment of human rationality and Artificially Intelligent Agents. I still have trouble phrasing exactly what it is I feel like I notice here, but here’s an attempt:
As an introductory manual on improving the Art of Human Rationality, the hypothetical perfectly rational, internally consistent, computationally Bayes-complete superintelligence is used as the Platonic Ideal of a Rational Intelligence, and the Sequences ground many of Rationality’s tools, techniques, and heuristics as approximations of that fundamentally non-human ideal evidence processor.
or in the other direction:
As an introductory guide to building a Friendly Superintelligence, the Coherent Extrapolated Human Rationalist, a model developed from intuitively appealing rational virtues, is used as a guide for what we want optimal intelligent agents to look like, and the Sequences as a whole are about taking this more human grounding, and justifying it as the basis on which to guide the development of AI into something that works properly, and something that we see as Friendly.
Maybe that’s not the best description, but I think there’s something there and that it’s relevant to this idea of trying to use rationality to be a “truly robust agent”. In any case I’ve always felt there was an interesting parallel with how the Sequences can be seen as “A Manual For Building Friendly AI” based on rational Bayesian principles, or “A Manual For Teaching Humans Rational Principles” based on an idealized Bayesian AI.
I really like this phrasing. Previously, I’ve just had vague sense that there are ways to have a lot more integrity such that many situations get a huge utility bump, which can only be achieved through very clear thinking. Your comment helped me make that a lot more concrete.