Thinking through the “vast majority of problem-space for X fails” argument; assume we have a random text generator that we want to run a sorting algorithm:
Vast majority don’t sort (or are even compilable)
The vast majority of programs that “look like they work”, don’t (eg “forgot a semicolon”, “didn’t account for an already sorted list”, etc)
Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says “looks good to me”, simple], don’t work.
Could be incomprehensible, pass several unit tests, but still fail in weird edge cases (eg. when the input number is [84, >100, a prime number > 13, etc], then it spits out gibberish)
counterargument for alignment check of “run it in a simulation to see if it breaks out of the box” because this is just another proxy.
Some constraints above are necessary, like being compilable, and some aren’t, like some randomly generated sorting algorithms that are really hard to understand. For example, could be written in brainfuck, or contain 10,000 lines of code that are mostly redundant or happen to cancel out and sorts correctly
To relate to the original talk, I agree that I can recognize my own values once I reflect on them, but this is different than seeing a plan about an AI that keeps my values and thinking “this looks like it works”. In other words, the “human values” shouldn’t be a strict subset of the “human says it looks like it works”, just like “correctly sorts” shouldn’t be a strict subset of “human says it looks like it works” due to incomprehensibility.
For programs specifically, if it’s simple and passes a relevant distribution of unit tests, we can be highly confident it in fact sorts correctly, but what’s the equivalent for “plan that maintains human values”? Let’s say John succeeds and finds what we think to be the generators of human values, would it be comprehensible enough to verify it?
Applying the argument again but to John’s proposed solution, the vast majority of [Ai’s trained in human environments with what we think are the simple generators of human values]’s plans & behaviors may look good but not actually be good. Or the weights are incomprehensible, so we use unit tests to verify and it could still fail.
Counter-counterargument: I can imagine these generators being simple enough that we can indeed be confident they do what we want. Since it should be human-value-equivalent, it should also be human-interpretable (under reflection?).
This sounds like a good idea overall, but I wouldn’t bet my life on it. It’d be nice to have necessary and sufficient conditions for this possible solution.
Thinking through the “vast majority of problem-space for X fails” argument; assume we have a random text generator that we want to run a sorting algorithm:
Vast majority don’t sort (or are even compilable)
The vast majority of programs that “look like they work”, don’t (eg “forgot a semicolon”, “didn’t account for an already sorted list”, etc)
Generalizing: the vast majority of programs that pass [Unit tests, compiles, human says “looks good to me”, simple], don’t work.
Could be incomprehensible, pass several unit tests, but still fail in weird edge cases (eg. when the input number is [84, >100, a prime number > 13, etc], then it spits out gibberish)
counterargument for alignment check of “run it in a simulation to see if it breaks out of the box” because this is just another proxy.
Some constraints above are necessary, like being compilable, and some aren’t, like some randomly generated sorting algorithms that are really hard to understand. For example, could be written in brainfuck, or contain 10,000 lines of code that are mostly redundant or happen to cancel out and sorts correctly
To relate to the original talk, I agree that I can recognize my own values once I reflect on them, but this is different than seeing a plan about an AI that keeps my values and thinking “this looks like it works”. In other words, the “human values” shouldn’t be a strict subset of the “human says it looks like it works”, just like “correctly sorts” shouldn’t be a strict subset of “human says it looks like it works” due to incomprehensibility.
For programs specifically, if it’s simple and passes a relevant distribution of unit tests, we can be highly confident it in fact sorts correctly, but what’s the equivalent for “plan that maintains human values”? Let’s say John succeeds and finds what we think to be the generators of human values, would it be comprehensible enough to verify it?
Applying the argument again but to John’s proposed solution, the vast majority of [Ai’s trained in human environments with what we think are the simple generators of human values]’s plans & behaviors may look good but not actually be good. Or the weights are incomprehensible, so we use unit tests to verify and it could still fail.
Counter-counterargument: I can imagine these generators being simple enough that we can indeed be confident they do what we want. Since it should be human-value-equivalent, it should also be human-interpretable (under reflection?).
This sounds like a good idea overall, but I wouldn’t bet my life on it. It’d be nice to have necessary and sufficient conditions for this possible solution.