There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post’s point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all.
Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we’re able to). I predict that a mechanistic understanding will enable the below knowledge:
I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.
I think it can be worthwhile to look at those mechanisms, in my original post I’m just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it’s quite likely that they’ve concluded that the mechanisms they’ve come up with for humans don’t work.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.
On your first point, I do think people have thought about this before and determined it doesn’t work. But from the post:
If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insightI gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
That makes sense. I mean if you’ve found some good results that others have missed, then it may be very worthwhile. I’m just not sure what they look like.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I don’t think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don’t think it can be used for solving the diamond-tiling problem.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I think it might be a bit dangerous to use the metaphor/terminology of mechanism when talking about the processes that align humans within a society. That is a very complex and complicated environment that I find very poorly described by the term “mechanisms”.
When considering how humans align and how that might inform for the AI alignment what stands out the most for me is that alignment is a learning process and probably needs to start very early in the AI’s development—don’t start training the AI on maximizing things but on learning what it means to be aligned with humans. I’m guessing this has been considered—and probably a bit difficult to implement. It is probably also worth noting that we also have a whole legal system that also serves to reinforce cultural norms along with reactions from other one interacts with.
While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize (and I don’t think they are as closely linked as standard econ describes even if equilibrium condition still holds, the per monetary unit of marginal utilities are equalized). That type of structure also insures that those maximize on one axis results are not realistic. I think the risk here is similar to that of addiction for humans.
While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize
Seems like this wouldn’t really help; the AI would just consume all resources making whichever basket of goods you ask it to maximize.
The problem with a paperclip maximizer isn’t the part where it makes paperclips; making paperclips is OK as paperclips have nonzero value in human society. The problem is the part where it consumes all available resources.
I think that over simplifies what I was saying but accept I did not elaborate either.
The consuming all available resources is not a economically sensible outcome (unless one is defining available resources very narrowly) so saying the AI is not a economically informed AI. That doesn’t seem to be too difficult to address.
If the AI is making output that humans value and follows some simple economic rules then that gross over production and exhausting all available resources is not very likely at all. At some point more is in the basket than wanted so production costs exceed output value and the AI should settle into a steady state type mode.
Now if the AI doesn’t care at all about humans and doesn’t act in anything that resembles what we would understand as normal economic behavior you might get that all resources consumed. But I’m not sure it is correct to think an AI would just not be some type of economic agent given so many of the equilibrating forces in economics seem to have parallel processes in other areas.
Does anyone have a pointer to some argument where the AI does consume all resources and points to why the economics of the environment are not holding? Or, a bit differently, why the economics are so different making the outcome rational?
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post’s point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we’re able to). I predict that a mechanistic understanding will enable the below knowledge:
I think it can be worthwhile to look at those mechanisms, in my original post I’m just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it’s quite likely that they’ve concluded that the mechanisms they’ve come up with for humans don’t work.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.
On your first point, I do think people have thought about this before and determined it doesn’t work. But from the post:
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
That makes sense. I mean if you’ve found some good results that others have missed, then it may be very worthwhile. I’m just not sure what they look like.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I don’t think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don’t think it can be used for solving the diamond-tiling problem.
The place where I encountered this idea was Learning What to Value (Daniel Dewey, 2010).
“Reward Tampering Problems and Solutions in Reinforcement Learning” describes how to do what you outlined.
I think it might be a bit dangerous to use the metaphor/terminology of mechanism when talking about the processes that align humans within a society. That is a very complex and complicated environment that I find very poorly described by the term “mechanisms”.
When considering how humans align and how that might inform for the AI alignment what stands out the most for me is that alignment is a learning process and probably needs to start very early in the AI’s development—don’t start training the AI on maximizing things but on learning what it means to be aligned with humans. I’m guessing this has been considered—and probably a bit difficult to implement. It is probably also worth noting that we also have a whole legal system that also serves to reinforce cultural norms along with reactions from other one interacts with.
While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize (and I don’t think they are as closely linked as standard econ describes even if equilibrium condition still holds, the per monetary unit of marginal utilities are equalized). That type of structure also insures that those maximize on one axis results are not realistic. I think the risk here is similar to that of addiction for humans.
Seems like this wouldn’t really help; the AI would just consume all resources making whichever basket of goods you ask it to maximize.
The problem with a paperclip maximizer isn’t the part where it makes paperclips; making paperclips is OK as paperclips have nonzero value in human society. The problem is the part where it consumes all available resources.
I think that over simplifies what I was saying but accept I did not elaborate either.
The consuming all available resources is not a economically sensible outcome (unless one is defining available resources very narrowly) so saying the AI is not a economically informed AI. That doesn’t seem to be too difficult to address.
If the AI is making output that humans value and follows some simple economic rules then that gross over production and exhausting all available resources is not very likely at all. At some point more is in the basket than wanted so production costs exceed output value and the AI should settle into a steady state type mode.
Now if the AI doesn’t care at all about humans and doesn’t act in anything that resembles what we would understand as normal economic behavior you might get that all resources consumed. But I’m not sure it is correct to think an AI would just not be some type of economic agent given so many of the equilibrating forces in economics seem to have parallel processes in other areas.
Does anyone have a pointer to some argument where the AI does consume all resources and points to why the economics of the environment are not holding? Or, a bit differently, why the economics are so different making the outcome rational?