The motivation seems trivial to me, which might be part of the problem. Most training procedures are obviously outer-misaligned, so if we have one that may plausibly be outer-aligned (and might plausibly scale), that seems like an obvious reason to take it seriously. I’ve felt like I totally got that once I first understood what IDA is trying to do.
Does it make sense what experience I had reading the post?
It does, but it still leaves me with the problem that it doesn’t seem to be connected to the remaining sequence. IDA isn’t about how we take an already trained system with human-level performance and use it for good things; it’s about how we train a system from the ground up.
The real problem may be that I expect a post like this to closely tie into the actual scheme, so when it doesn’t I take it as evidence that I’ve misunderstood something. What this post talks about may just not be intended to be [the problem the remaining sequence is trying to solve].
The motivation seems trivial to me, which might be part of the proble
Yeah, for a long time many people have been very confused about the motivation of Paul’s research, so I don’t think you’re typical in this regard. I think that due to this sequence, Alex Zhu’s FAQ, lots of writing by Evan, Paul’s post “What Failure Looks Like”, and more posts on LW by others, a many more people understand Paul’s work on a basic level that was not at all the case 3 years ago. Like, you say “Most training procedures are obviously outer-misaligned”, and ‘outer-alignment’ was not a concept with a name or a write-up at the time this sequence was published.
I agree that this post talks about assuming human-level performance, whereas much of iterated amplification also relaxes that assumption. My sense is that if someone were to just read this sequence, it would still help them focus on the brunt of the problem, that being ‘helpful’ or ‘useful’ is not well-defined in the way many other tasks are, and help realize the urgency of this task and why it’s possible to make progress now.
The motivation seems trivial to me, which might be part of the problem. Most training procedures are obviously outer-misaligned, so if we have one that may plausibly be outer-aligned (and might plausibly scale), that seems like an obvious reason to take it seriously. I’ve felt like I totally got that once I first understood what IDA is trying to do.
It does, but it still leaves me with the problem that it doesn’t seem to be connected to the remaining sequence. IDA isn’t about how we take an already trained system with human-level performance and use it for good things; it’s about how we train a system from the ground up.
The real problem may be that I expect a post like this to closely tie into the actual scheme, so when it doesn’t I take it as evidence that I’ve misunderstood something. What this post talks about may just not be intended to be [the problem the remaining sequence is trying to solve].
Yeah, for a long time many people have been very confused about the motivation of Paul’s research, so I don’t think you’re typical in this regard. I think that due to this sequence, Alex Zhu’s FAQ, lots of writing by Evan, Paul’s post “What Failure Looks Like”, and more posts on LW by others, a many more people understand Paul’s work on a basic level that was not at all the case 3 years ago. Like, you say “Most training procedures are obviously outer-misaligned”, and ‘outer-alignment’ was not a concept with a name or a write-up at the time this sequence was published.
I agree that this post talks about assuming human-level performance, whereas much of iterated amplification also relaxes that assumption. My sense is that if someone were to just read this sequence, it would still help them focus on the brunt of the problem, that being ‘helpful’ or ‘useful’ is not well-defined in the way many other tasks are, and help realize the urgency of this task and why it’s possible to make progress now.
That makes me feel a bit like the student who thinks they can debate the professor after having researched the field for 5 minutes.
But it would actually be a really good sign if the area has become more accessible.
Hah!
Yeah, I think we’ve made substantial progress in the last couple of years.