I was actually just going to post something about curiosity myself! But my reason for how it works is more concrete, and imo less handwavey (sorry): an AI which wishes to maximize its rate of learning and which doesn’t simply fall into wireheading (let’s assume away that part for the moment) will tend to seek to create, and perhaps avoid destroying, complex systems that it has trouble predicting, which evolve according to emergent principles—and life fits that definition better than unlife does, and intelligence fits it even better.
I think curiosity in the sense of seeking a high rate of learning automatically leads to interest in other living and intelligent organisms and the desire for them to exist or continue existing. However, that doesn’t solve alignment, because it could be curious about what happens when it pokes us—or it may not realize we’re interesting until after it’s already destroyed us.
Interesting idea, but it seems risky. Would life be the only, or for that matter, even the primary, complex system that such an AI would avoid interfering with?
Further, it seems likely that a curiosity-based AI might intentionally create or seek out complexity, which could be risky. Think of how kids love to say “I want to go to the moon!” “I want to go to every country in the world!”. I mean, I do too and I’m an adult. Surely a curiosity-based AI would attempt to go to fairly extreme limits for the sake of satiating its own curiosity, at the expense of other values.
Maybe such an AGI could have like… an allowance? “Never spend more than 1% of your resources on a single project” or something? But I have absolutely no idea how you could define a consistent idea of a “single project”.
Note, to be entirely clear, I’m not saying that this is anywhere near sufficient to align an AGI completely. Mostly it’s just a mechanism of decreasing the chance of totally catastrophic misalignment, and encouraging it to be just really really destructive instead. I don’t think curiosity alone is enough to prevent wreaking havoc, but I think it would lead to fitting the technical definition of alignment, which is that at least one human remains alive.
I was actually just going to post something about curiosity myself! But my reason for how it works is more concrete, and imo less handwavey (sorry): an AI which wishes to maximize its rate of learning and which doesn’t simply fall into wireheading (let’s assume away that part for the moment) will tend to seek to create, and perhaps avoid destroying, complex systems that it has trouble predicting, which evolve according to emergent principles—and life fits that definition better than unlife does, and intelligence fits it even better.
I think curiosity in the sense of seeking a high rate of learning automatically leads to interest in other living and intelligent organisms and the desire for them to exist or continue existing. However, that doesn’t solve alignment, because it could be curious about what happens when it pokes us—or it may not realize we’re interesting until after it’s already destroyed us.
Interesting idea, but it seems risky.
Would life be the only, or for that matter, even the primary, complex system that such an AI would avoid interfering with?
Further, it seems likely that a curiosity-based AI might intentionally create or seek out complexity, which could be risky.
Think of how kids love to say “I want to go to the moon!” “I want to go to every country in the world!”. I mean, I do too and I’m an adult. Surely a curiosity-based AI would attempt to go to fairly extreme limits for the sake of satiating its own curiosity, at the expense of other values.
Maybe such an AGI could have like… an allowance? “Never spend more than 1% of your resources on a single project” or something? But I have absolutely no idea how you could define a consistent idea of a “single project”.
Note, to be entirely clear, I’m not saying that this is anywhere near sufficient to align an AGI completely. Mostly it’s just a mechanism of decreasing the chance of totally catastrophic misalignment, and encouraging it to be just really really destructive instead. I don’t think curiosity alone is enough to prevent wreaking havoc, but I think it would lead to fitting the technical definition of alignment, which is that at least one human remains alive.