AI value alignment problem / AI goal alignment problem
Incidentally, rephrasing and presenting the problem in this manner is how MIRI could possibly gain additional traction in the private sector and its associated academia. As self-modifying code will gain more popularity, there will be plenty of people encountering just this hurdle, i.e. “but how can I make sure that my modified agent still optimizes for X?” Establishing that as a well delineated subfield, unrelated to the whole “fate of humanity”-thing, could both prompt/shape additional external research and lay the groundwork to the whole “the new goals post-modication may be really damaging”, reducing the inferential distance to one of scope alone.
A company making a lot of money off of their self-modifying stock-brokering algorithms* (a couple of years down the line), in a perpetual tug-of-war with their competitor’s self-modifying stock-brokering algorithms* will be quite interested in proofs that their modified-beyond-recognition agent will still try to make them a profit.
I imagine that a host of expert systems, medium term, will increasingly rely on self-modification. Now, compare: “There is an institute concerned with the friendliness of AI, in the x-risk sense” versus “an institute concerned with preserving AI goals across modifications as an invariant which I can use to keep the money flowing” in terms of industry attractiveness.
Even if both match to the same research activities at MIRI, modulo the “what values should the AI have?”, which in large parts isn’t a technical question anyways and brings us into the murky maelstrom between game theory / morality / psychology / personal dogmatism. Just a branding suggestion, since the market on the goal alignment question in the concrete economic sense can still be captured. Could even lead to some Legg-itimate investments.
* Using algorithm/tool/agent interchangeably, since they aren’t separated by more than a trivial modification.
Incidentally, rephrasing and presenting the problem in this manner is how MIRI could possibly gain additional traction in the private sector and its associated academia. As self-modifying code will gain more popularity, there will be plenty of people encountering just this hurdle, i.e. “but how can I make sure that my modified agent still optimizes for X?” Establishing that as a well delineated subfield, unrelated to the whole “fate of humanity”-thing, could both prompt/shape additional external research and lay the groundwork to the whole “the new goals post-modication may be really damaging”, reducing the inferential distance to one of scope alone.
A company making a lot of money off of their self-modifying stock-brokering algorithms* (a couple of years down the line), in a perpetual tug-of-war with their competitor’s self-modifying stock-brokering algorithms* will be quite interested in proofs that their modified-beyond-recognition agent will still try to make them a profit.
I imagine that a host of expert systems, medium term, will increasingly rely on self-modification. Now, compare: “There is an institute concerned with the friendliness of AI, in the x-risk sense” versus “an institute concerned with preserving AI goals across modifications as an invariant which I can use to keep the money flowing” in terms of industry attractiveness.
Even if both match to the same research activities at MIRI, modulo the “what values should the AI have?”, which in large parts isn’t a technical question anyways and brings us into the murky maelstrom between game theory / morality / psychology / personal dogmatism. Just a branding suggestion, since the market on the goal alignment question in the concrete economic sense can still be captured. Could even lead to some Legg-itimate investments.
* Using algorithm/tool/agent interchangeably, since they aren’t separated by more than a trivial modification.