Wait, but surely RL-developed shards that work like human values are the biomimicry approach here, and designing a value learning scheme top-down is the modernist approach. I think this metaphor has its wires crossed.
I wasn’t intending for a metaphor of “biomimicry” vs “modernist”.
(Claim 1) Wings can’t work in space because there’s no air. The lack of air is a fundamental reason for why no wing design, no matter how clever it is, will ever solve space travel.
If TurnTrout is right, then the equivalent statement is something like (Claim 2) “reward functions can’t solve alignment because alignment isn’t maximizing a mathematical function.”
The difference between Claim 1 and Claim 2 is that we have a proof of Claim 1, and therefore don’t bother debating it anymore, while with Claim 2 we only have an arbitrarily long list of examples for why reward functions can be gamed, exploited, or otherwise fail in spectacular ways, but no general proof yet for why reward functions will never work, so we keep arguing about a Sufficiently Smart Reward Function That Definitely Won’t Blow up as if that is a thing that can be found if we try hard enough.
As of right now, I view “shard theory” sort of like a high-level discussion of chemical propulsion without the designs for a rocket or a gun. I see the novelty of it, but I don’t understand how you would build a device that can use it. Until someone can propose actual designs for hardware or software that would implement “shard theory” concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it’s not incredibly useful to me. However, I think it’s worth engaging with the idea because if correct then other research directions might be a dead-end.
Does that help explain what I was trying to do with the metaphor?
Until someone can propose actual designs for hardware or software that would implement “shard theory” concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it’s not incredibly useful to me. However, I think it’s worth engaging with the idea because if correct then other research directions might be a dead-end.
Yeah, but on the other hand, I think this is looking for essential differences where they don’t exist. I made a comment similar to this on the previous post. It’s not like one side is building rockets and the other side is building ornithopters—or one side is advocating building computers out of evilite, while the other side says we should build the computer out of alignmentronium.
“reward functions can’t solve alignment because alignment isn’t maximizing a mathematical function.”
Alignment doesn’t run on some nega-math that can’t be cast as an optimization problem. If you look at the example of the value-child who really wants to learn a lot in school, I admit it’s a bit tricky to cash this out in terms of optimization. But if the lesson you take from this is “it works because it really wants to succeed, this is a property that cannot be translated as maximizing a mathematical function,” then I think that’s a drastic overreach.
I realize that my position might seem increasingly flippant, but I really think it is necessary to acknowledge that you’ve stated a core assumption as a fact.
Alignment doesn’t run on some nega-math that can’t be cast as an optimization problem.
I am not saying that the concept of “alignment” is some bizarre meta-physical idea that cannot be approximated by a computer because something something human souls etc, or some other nonsense.
However the assumption that “alignment is representable in math” directly implies “alignment is representable as an optimization problem” seems potentially false to me, and I’m not sure why you’re certain it is true.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
I don’t think it is ridiculous to suggest that what the human brain does is 1.) representable in math, 2.) in some type of way that we could actually understand and re-implement it on hardware / software systems, and 3.) but not as an optimization problem where there exists some reward function to maximize or some loss function to minimize.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
You don’t even have to go that far. What about, just, regular non-iterative programs? Are type(obj) or json.dump(dict) or resnet50(image) usefully/nontrivially recast as optimization programs? AFAICT there are a ton of things that are made up of normal math/computation and where trying to recast them as optimization problems isn’t helpful.
Wait, but surely RL-developed shards that work like human values are the biomimicry approach here, and designing a value learning scheme top-down is the modernist approach. I think this metaphor has its wires crossed.
I wasn’t intending for a metaphor of “biomimicry” vs “modernist”.
(Claim 1) Wings can’t work in space because there’s no air. The lack of air is a fundamental reason for why no wing design, no matter how clever it is, will ever solve space travel.
If TurnTrout is right, then the equivalent statement is something like (Claim 2) “reward functions can’t solve alignment because alignment isn’t maximizing a mathematical function.”
The difference between Claim 1 and Claim 2 is that we have a proof of Claim 1, and therefore don’t bother debating it anymore, while with Claim 2 we only have an arbitrarily long list of examples for why reward functions can be gamed, exploited, or otherwise fail in spectacular ways, but no general proof yet for why reward functions will never work, so we keep arguing about a Sufficiently Smart Reward Function That Definitely Won’t Blow up as if that is a thing that can be found if we try hard enough.
As of right now, I view “shard theory” sort of like a high-level discussion of chemical propulsion without the designs for a rocket or a gun. I see the novelty of it, but I don’t understand how you would build a device that can use it. Until someone can propose actual designs for hardware or software that would implement “shard theory” concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it’s not incredibly useful to me. However, I think it’s worth engaging with the idea because if correct then other research directions might be a dead-end.
Does that help explain what I was trying to do with the metaphor?
Have you read A shot at the diamond alignment problem? If so, what do you think of it?
Yeah, but on the other hand, I think this is looking for essential differences where they don’t exist. I made a comment similar to this on the previous post. It’s not like one side is building rockets and the other side is building ornithopters—or one side is advocating building computers out of evilite, while the other side says we should build the computer out of alignmentronium.
Alignment doesn’t run on some nega-math that can’t be cast as an optimization problem. If you look at the example of the value-child who really wants to learn a lot in school, I admit it’s a bit tricky to cash this out in terms of optimization. But if the lesson you take from this is “it works because it really wants to succeed, this is a property that cannot be translated as maximizing a mathematical function,” then I think that’s a drastic overreach.
I realize that my position might seem increasingly flippant, but I really think it is necessary to acknowledge that you’ve stated a core assumption as a fact.
I am not saying that the concept of “alignment” is some bizarre meta-physical idea that cannot be approximated by a computer because something something human souls etc, or some other nonsense.
However the assumption that “alignment is representable in math” directly implies “alignment is representable as an optimization problem” seems potentially false to me, and I’m not sure why you’re certain it is true.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
I don’t think it is ridiculous to suggest that what the human brain does is 1.) representable in math, 2.) in some type of way that we could actually understand and re-implement it on hardware / software systems, and 3.) but not as an optimization problem where there exists some reward function to maximize or some loss function to minimize.
You don’t even have to go that far. What about, just, regular non-iterative programs? Are
type(obj)
orjson.dump(dict)
orresnet50(image)
usefully/nontrivially recast as optimization programs? AFAICT there are a ton of things that are made up of normal math/computation and where trying to recast them as optimization problems isn’t helpful.