My main claim is that the 2nd gap is likely far less present between what AGIs actually value and what their creators value.
I also think that the gap between what AGIs value and what maximizes performance in training actually isn’t very large, because we can create more robust reward functions that encode a whole lot of human values solely from data.
one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.
idk if our goals are much more densely defined. Inclusive genetic fitness is approximately what, like 1-3 standard reproductive cycles? “Number of grandkids” basically? So like fifty years?
Humans trying to make their AGI be helpful harmless and honest… well technically the humans have longer-term goals because we care a lot about e.g. whether the AGI will put humanity on a path to destruction even if that path takes a century to complete, but I agree that in practice if we can e.g. get the behavior we’d ideally want over the course of the next year, that’s probably good enough. Possibly even for shorter periods like a month. Also, separately, our design cycle for the AGIs is more like months than hours or years. Months is how long the biggest training runs take, for one thing.
So I’d say the comparison is like 5 months to 50 years, a 2-OOM difference in calendar time. But AIs read, write, and think much faster than humans. In those 5 months, they’ll do serial thinking (and learning, during a training run) that is probably deeper/longer than a 50-year human lifetime, no? (Not sure how to think about the parallel computation advantage)
idk my point is that it doesn’t seem like a huge difference to me, it probably matters but I would want to see it modelled and explained more carefully and then measurements done to quantify it.
Hm, inclusive genetic fitness is a very non-local criterion, at least as is often assumed on LW, because a lot of the standard alignment failures that people talk about like birth control and sex, took about 10,000-300,000 years to happen, and in general with the exception of bacteria or extreme selection pressure, thousands of years time-scales for mammals and other animals is the norm to generate sufficient selection pressure to develop noticeable traits, so compared to evolution, there’s a 4-5+ OOMs difference in calendar time.
IGF is often assumed to be the inclusive genetic fitness of all genes for all time, otherwise the problems that are usually trotted out become far less evidence for alignment problems arising when we try to align AIs to human values.
But there’s a second problem that exists independently of the first problem, and that’s the other differences in how we can control AIs versus how evolution controlled humans here:
You can say that evolution had an “intent” behind the hardcoded circuitry, and humans in the current environment don’t fulfill this intent. But I don’t think evolution’s “intent” matters here. We’re not evolution. We can actually choose an AI’s training data, and we can directly choose what rewards to associate with each of the AI’s actions on that data. Evolution cannot do either of those things.
Evolution does this very weird and limited “bi-level” optimization process, where it searches over simple data labeling functions (your hardcoded reward circuitry), then runs humans as an online RL process on whatever data they encounter in their lifetimes, with no further intervention from evolution whatsoever (no supervision, re-labeling of misallocated rewards, gathering more or different training data to address observed issues in the human’s behavior, etc). Evolution then marginally updates the data labeling functions for the next generation. It’s is a fundamentally different type of thing than an individual deep learning training run.
My main claim is that the 2nd gap is likely far less present between what AGIs actually value and what their creators value.
I also think that the gap between what AGIs value and what maximizes performance in training actually isn’t very large, because we can create more robust reward functions that encode a whole lot of human values solely from data.
OK, why? You say:
idk if our goals are much more densely defined. Inclusive genetic fitness is approximately what, like 1-3 standard reproductive cycles? “Number of grandkids” basically? So like fifty years?
Humans trying to make their AGI be helpful harmless and honest… well technically the humans have longer-term goals because we care a lot about e.g. whether the AGI will put humanity on a path to destruction even if that path takes a century to complete, but I agree that in practice if we can e.g. get the behavior we’d ideally want over the course of the next year, that’s probably good enough. Possibly even for shorter periods like a month. Also, separately, our design cycle for the AGIs is more like months than hours or years. Months is how long the biggest training runs take, for one thing.
So I’d say the comparison is like 5 months to 50 years, a 2-OOM difference in calendar time. But AIs read, write, and think much faster than humans. In those 5 months, they’ll do serial thinking (and learning, during a training run) that is probably deeper/longer than a 50-year human lifetime, no? (Not sure how to think about the parallel computation advantage)
idk my point is that it doesn’t seem like a huge difference to me, it probably matters but I would want to see it modelled and explained more carefully and then measurements done to quantify it.
Hm, inclusive genetic fitness is a very non-local criterion, at least as is often assumed on LW, because a lot of the standard alignment failures that people talk about like birth control and sex, took about 10,000-300,000 years to happen, and in general with the exception of bacteria or extreme selection pressure, thousands of years time-scales for mammals and other animals is the norm to generate sufficient selection pressure to develop noticeable traits, so compared to evolution, there’s a 4-5+ OOMs difference in calendar time.
IGF is often assumed to be the inclusive genetic fitness of all genes for all time, otherwise the problems that are usually trotted out become far less evidence for alignment problems arising when we try to align AIs to human values.
But there’s a second problem that exists independently of the first problem, and that’s the other differences in how we can control AIs versus how evolution controlled humans here:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#sYA9PLztwiTWY939B
The important parts are these: