Thanks for the reply :) I’ll try to convey some of my thinking, but I don’t expect great success. I’m working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.
(...) part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)
Yeah, I guess this is where a lot of the differences in our perspective are located.
if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech)
Things have to cash out in terms of concrete actions in the world. Maybe a contention is the level of indirection we imagine in our heads (by which we try to obtain systems that can help us do concrete actions).
Prominent in my mind are scenarios that involve a lot of iterative steps (but over a short amount of time) before we start evaluating systems by doing AGI-generated experiments. In the earlier steps, we avoid doing any actions in the “real world” that are influenced in a detailed way by AGI output, and we avoid having real humans be exposed to AGI-generated argumentation.
Examples of stuff we might try to obtain:
AGI “lie detector techniques” (maybe something that is in line with the ideas of Collin Burns)
Argument/proof evaluators (this is an interest of mine, but making better explainers is still a work in progress, and I have some way to go)
If we are good at program-search, this can itself be used to obtain programs that help us be better at program-search (finding functions that score well according to well-defined criteria).
Some tasks can be considered to be inside of “test-range”[1]:
Predicting human answers to questions posed by other humans[2].
Outputting prime numbers[3]
Predicting experimental results from past experimental data[4]
Whether a chess-move is good[5]
Etc, etc
Other tasks are outside of “test-range”:
Predicting human answers to any question (including questions that involve being exposed to AGI-generated content)[6]
Whether a given instruction-plan actually results in machines that copy strawberries at the molecular level (and does so in accordance with “the spirit” of the request that was given)
Etc, etc
Most requests that actually are helpful to us are outside of test-range. And when the requirements that matter to us are outside of test-range, it is of course harder to test in a safe/reliable way if systems are giving us what we want.
But we can have AGIs output programs that help us with tasks, and we can define requirements[7]Be for these programs. And for these program-requirements, AGIs can help us explore stuff such as:
Are there programs that satisfy the requirements but disagree about certain outputs? (be that outputs that are inside of test-range or outside of test-range)
Are there programs that satisfy the requirements, but perform poorly for certain inputs that are inside of test-range?
If we only require correct outputs within the entire “test-range”, is it nonetheless infeasible to make programs that satisfy the requirements while giving bad output for ≥1 inputs within the test-range?
Phrased somewhat differently: We may come up with techniques for making hard to not give helpful/honest outputs[8], and test how well these seem to generalize.
And similar techniques can also be applied to program-requirement-requirements, program-requirement-requirement-requirements, etc[9].
Anyway, it’s hard to write about this kind of stuff in a concrete way without going off large tangents, and it’s hard to write about it in an abstract way without being vague and convoluted. For whatever it’s worth, these are some principles I think may be helpful (very much not a complete list):
Generality as it correlates with measurables (such as e.g. source code length or use of computational resource use): When writing a program to do well for many outputs (while fulfilling certain requirements), intentionally having it fail for particular outputs will sometimes require it to be less general (e.g. adding extra if-sentences, resulting in a longer minimum source code length for the program to exhibit this behavior).
Optimization-targets that incentivize modularity make it harder to make programs that intentionally fail for particular outputs: In a more modular system, it will less often be clear from sub-routine inputs what the macro-level task is[10].
When a requirement or optimization-target fails at “forcing” good outputs, AGIs that act so as to be selected for are “incentivized” to reveal this as being the case: Suppose some set of requirements is compatible with bad outputs inside “test-range”. Or suppose that some set of requirements allow for programs that give contradictory outputs. If we do “gradient descent”, looking for AGIs that help us explore spaces of programs defined by requirements that we specify, we will select for AGIs that point out this kind of stuff.
Anyway, I’m not sure how much or little this attempt at conveying my thoughts achieves. It’s a bit of a dilemma for me if I am to try to convey anything at all, or wait to do that until I have completed better explainers that I can point people to (as the explainers I have now fall short).
I guess an incomplete summary of my perspective might be:
There are lots of techniques we may apply for getting help with alignment while greatly limiting the extent to which we interact with AGI putout in ways that are dangerous
Not being able to directly evaluate outputs we are interested in does not necessarily preclude us from finding indirect methods of evaluation that are powerful.
We may do program-search for AGIs that help us explore possible program-requirements, and the spaces of programs (and program outputs) that are compatible with those program-requirements (and program-requirement-requirements, etc). We may learn a lot from patterns we observe relating to these “spaces”.
Sometimes there are possible patterns such that if we know ourselves to be sufficiently capable at program-search, P(requirement R1 “forces” outputs to be “good” | we observe pattern P1) reasonably could be deemed to be high. Even if we’re not opinionated in regards to P(requirement R1 “forces” outputs to be “good”).
- ^
By which I mean something akin to: “We have a safe and reliable way of testing if the output fulfills the requirements in question”.
Admittedly, if we haven’t solved some of the technical challenges related to containment, even playing tic tac toe can constitute some risk. But here, I mean to focus on risk from interacting with AGI output—not risk from just having AGI-systems system run (and them maybe finding a security vulnerability in the computer systems they are running at). - ^
(we can pick random predictions to test, and we can have AIs competitively point out predictions made by other AIs that they think are miscalibrated)
- ^
(we can write programs that take alleged prime numbers as input, and test if they really are prime numbers)
- ^
(or from experiments that haven’t been run, but that humans have come up with, such that there is little room for AGIs to somehow influence the experiment to itself be dangerous somehow)
- ^
(we may not be capable of evaluating chess-moves ourselves, but we may safely have AGI-systems play chess against each other—indicating something akin to: “insofar as we feel confident that we are good at program-search, we can feel confident that we can obtain programs that output good chess-moves”)
- ^
There are more and less dangerous ways to do this. With less dangerous ways of doing it, experiments (where actual humans answer questions) are done with humans that are temporarily isolated from the rest of the rest of the world (and who certainly aren’t AGI-system operators).
- ^
Such requirements may relate to:
- How the program is constructed
- Observables relating to the source code (source code length, etc)
- Restrictions the source code must adhere to
- Whether the program is accepted by a given verifier (or any verifier that itself fulfills certain requirements)
- “Proofs” of various kinds relating to the program
- Tests of program outputs that the program must be robust in regards to
- Etc - ^
By “making it hard” I means something like “hard to do while being the sort of program we select for when doing program-search”. Kind of like how it’s not “hard” for a chess program to output bad chess-moves, but it’s hard for it to do that while also being the kind of program we continue to select for while doing “gradient descent”.
- ^
In my view of things, this is a very central technique (it may appear circular somehow, but when applied correctly, I don’t think it is). But it’s hard for me to talk about it in a concrete way without going off on tangents, and it’s hard for me to talk about it in an abstract way without being vague. Also, my texts become more convoluted when I try to write about this, and I think people often just glaze over it.
- ^
One example of this: If we are trying to obtain argument evaluators, the argumentation/demonstrations/proofs these evaluators evaluate should be organized into small and modular pieces, such that it’s not car from any given piece what the macro-level conclusion is.
I’m a bit unsure about how to interpret you here.
In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.
Good point (I should have written my comment in such a way that pointing out this didn’t feel necessary).
I guess this is more central to what I was trying to communicate than whether it is expressed in terms of a utility function per se.
In this tweet, Eliezer writes:
”The idea with agent foundations, which I guess hasn’t successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).”
Based on e.g. this talk from 2016, I get the sense that when he says “coherent target” he means targets that relate to the non-digital world. But perhaps that’s not the case (or perhaps it’s sort of the case, but more nuanced).
Maybe I’m making this out to have been a bigger part of their work than what actually was the case.