Let’s define a scientific difficulty of D(P) of a scientific problem as “an approximate number of years of trial-and-error effort that humanity would need to solve P, if P was considered an important problem to solve”. He estimates D(alignment) at about 50 years—but his whole point is that for alignment, this particular metric is meaningless because the trial-and-error is not an option. This is just meant to be as a counterargument to somebody saying that alignment does not seem to be much harder than X, and we solved X—but his counterargument is yes, D(X) was shown to be about 50 years in the past, and by just scientific difficulty level D(alignment) might also have the same order of magnitude, but unlike X, alignment cannon be solved via trial-and-error, so comparison with X is not actually informative.
This is the opposite of considering a trial-and-error solution scenario for alignment as an actual possibility.
That makes perfect sense, thank you. And maybe, if we’ve already got the necessary utility function, stability under self-improvement might be solvable as if it were just a really difficult maths problem. It doesn’t look that difficult to me, a priori, to change your cognitive abilities whilst keeping your goals.
AlphaZero got its giant inscrutable matrices by working from a straightforward start of ‘checkmate is good’. I can imagine something like AlphaZero designing a better AlphaZero (AlphaOne?) and handing over the clean definition of ‘checkmate is good’ and trusting its successor to work out the details better than it could itself.
I get cleverer if I use pencil and paper, it doesn’t seem to redefine what’s good when I do. And no-one stopped liking diamonds when we worked out that carbon atoms weren’t fundamental objects.
---
My point is that the necessary utility function is the hard bit. It doesn’t look anything like a maths problem to me, *and* we can’t sneak up on it iteratively with a great mass of patches until it’s good enough.
We’ve been paying a reasonable amount of attention to ‘what is good?’ for at least two thousand years, and in all that time no-one came up with anything remotely sensible sounding.
I would doubt that the question meant anything, if it were not that I can often say which of two possible scenarios I prefer. And I notice that other people often have the same preference.
I do think that Eliezer thinks that given the Groundhog Day version of the problem, restart every time you do something that doesn’t work out, we’d be able to pull it off.
I doubt that even that’s true. ‘Doesn’t work out’ is too nebulous.
But at this point I guess we’re talking only about Eliezer’s internal thoughts, and I have no insight there. I was attacking a direct quote from the podcast, but maybe I’m misinterpreting something that wasn’t meant to bear much weight.
Restarting an earlier thread in a clean slate.
Let’s define a scientific difficulty of D(P) of a scientific problem as “an approximate number of years of trial-and-error effort that humanity would need to solve P, if P was considered an important problem to solve”. He estimates D(alignment) at about 50 years—but his whole point is that for alignment, this particular metric is meaningless because the trial-and-error is not an option. This is just meant to be as a counterargument to somebody saying that alignment does not seem to be much harder than X, and we solved X—but his counterargument is yes, D(X) was shown to be about 50 years in the past, and by just scientific difficulty level D(alignment) might also have the same order of magnitude, but unlike X, alignment cannon be solved via trial-and-error, so comparison with X is not actually informative.
This is the opposite of considering a trial-and-error solution scenario for alignment as an actual possibility.
Does this make sense?
That makes perfect sense, thank you. And maybe, if we’ve already got the necessary utility function, stability under self-improvement might be solvable as if it were just a really difficult maths problem. It doesn’t look that difficult to me, a priori, to change your cognitive abilities whilst keeping your goals.
AlphaZero got its giant inscrutable matrices by working from a straightforward start of ‘checkmate is good’. I can imagine something like AlphaZero designing a better AlphaZero (AlphaOne?) and handing over the clean definition of ‘checkmate is good’ and trusting its successor to work out the details better than it could itself.
I get cleverer if I use pencil and paper, it doesn’t seem to redefine what’s good when I do. And no-one stopped liking diamonds when we worked out that carbon atoms weren’t fundamental objects.
---
My point is that the necessary utility function is the hard bit. It doesn’t look anything like a maths problem to me, *and* we can’t sneak up on it iteratively with a great mass of patches until it’s good enough.
We’ve been paying a reasonable amount of attention to ‘what is good?’ for at least two thousand years, and in all that time no-one came up with anything remotely sensible sounding.
I would doubt that the question meant anything, if it were not that I can often say which of two possible scenarios I prefer. And I notice that other people often have the same preference.
I do think that Eliezer thinks that given the Groundhog Day version of the problem, restart every time you do something that doesn’t work out, we’d be able to pull it off.
I doubt that even that’s true. ‘Doesn’t work out’ is too nebulous.
But at this point I guess we’re talking only about Eliezer’s internal thoughts, and I have no insight there. I was attacking a direct quote from the podcast, but maybe I’m misinterpreting something that wasn’t meant to bear much weight.