Great post! I especially liked the New Zealand example, it seems like a surprisingly good fit. Against a general backdrop of agreement, let me list a few points of disagreement:
Example: predictive policy algorithms used in the US arebiasedagainst people of colour.
Somewhat of a nitpick, but it is not clear to me that this is an example of problems of short-term incentives. Are we sure that given the choice between “lower crime, lower costs and algorithmic bias” and “higher crime, higher costs and only human bias”, and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection? Similarly with other current AI examples (e.g. the Google gorilla misclassification).
(It’s possible that if I delved into the research and the numbers the answer would be obvious; I have extremely little information on the scale of the problem currently.)
Most human institutions are at least somewhat interpretable. This means, for example, that humans who tamper with the measurement process to pursue easy-to-measure objectives are prone to being caught, as eventually happened with CompStat. However, ML systems today are currently hard to interpret, and so it may be more difficult to catch interference with the measurement process.
Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn’t make that much of a difference. If a single ML system replaces an entire institution, then it probably is less interpretable than that institution. It doesn’t seem obvious to me which of these we should be considering. (Partly it depends on how general and capable the AI systems are.) Overall I’d guess that for WFLL1 it’s closer to “replacing humans” than “replacing institutions”.
Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn’t make that much of a difference.
I guess you mean here that activations and weights in NNs are more interpretable to us than neurological processes in the human brain, but if so this comparison does not seem relevant to the text you quoted. Consider that it seems easier to understand why an editor of a newspaper placed some article on the front page than why FB’s algorithm showed some post to some user (especially if we get to ask the editor questions or consult with other editors).
Overall I’d guess that for WFLL1 it’s closer to “replacing humans” than “replacing institutions”.
Even if so (which I would expect to become uncompetitive with “replacing institutions” at some point) you may still get weird dynamics between AI systems within an institution and across institutions (e.g. between a CEO advisor AI and a regulator advisor AI). These dynamics may be very hard to interpret (and may not even involve recognizable communication channels).
I guess you mean here that activations and weights in NNs are more interpretable to us than neurological processes in the human brain, but if so this comparison does not seem relevant to the text you quoted. Consider that it seems easier to understand why an editor of a newspaper placed some article on the front page than why FB’s algorithm showed some post to some user (especially if we get to ask the editor questions or consult with other editors).
Isn’t this what I said in the rest of that paragraph (although I didn’t have an example)?
which I would expect to become uncompetitive with “replacing institutions” at some point
I’m not claiming that replacing humans is more competitive than replacing institutions. I’m claiming that, if we’re considering the WFLL1 setting, and we’re considering the point at which we could have prevented failure, at that point I’d expect AI systems are in the “replacing humans” category. By the time they’re in the “replacing institutions” category, we probably are far beyond the position where we could do anything about the future.
Separately, even in the long run, I expect modularity to be a key organizing principle for AI systems.
you may still get weird dynamics between AI systems within an institution and across institutions (e.g. between a CEO advisor AI and a regulator advisor AI). These dynamics may be very hard to interpret (and may not even involve recognizable communication channels).
I agree this is possible but it doesn’t seem very likely to me, since we’ll very likely be training our AI systems to communicate in natural language, and those AI systems will likely be trained to behave in vaguely human-like ways.
Isn’t this what I said in the rest of that paragraph (although I didn’t have an example)?
I meant to say that even if we replace just a single person (like a newspaper editor) with an ML system, it may become much harder to understand why each decision was made.
I agree this is possible but it doesn’t seem very likely to me, since we’ll very likely be training our AI systems to communicate in natural language, and those AI systems will likely be trained to behave in vaguely human-like ways.
The challenge here seems to me to train competitive models—that behave in vaguely human-like ways—for general real-world tasks (e.g. selecting content for a FB user feed or updating item prices on Walmart). In the business-as-usual scenario we would need such systems to be competitive with systems that are optimized for business metrics (e.g. users’ time spent or profit).
Are we sure that given the choice between “lower crime, lower costs and algorithmic bias” and “higher crime, higher costs and only human bias”, and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection?
Good point, thanks, I hadn’t thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I’d add is that if one did delve into the research to work this out for a particular case, it seems that an important (but hard to quantify) consideration would be the extent to which choosing the algorithm in this case makes it more likely that the use of that algorithm becomes entrenched, or it sets a precedent for the use of such algorithms. This feels important since these effects could plausibly make WFLL1-like things more likely in the longer run (when the harm of using misaligned systems is higher, due to the higher capabilities of those systems).
Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn’t make that much of a difference.
Good catch. I had the “AI systems replace entire institutions” scenario in mind, but agree that WFLL1 actually feels closer to “AI systems replace humans”. I’m pretty confused about what this would look like though, and in particular, whether institutions would retain their interpretability if this happened. It seems plausible that the best way to “carve up” an institution into individual agents/services differs for humans and AI systems. E.g. education/learning is big part of human institution design—you start at the bottom and work your way up as you learn skills and become trusted to act more autonomously—but this probably wouldn’t be the case for institutions composed of AI systems, since the “CEO” could just copy their model parameters to the “intern” :). And if institutions composed of AI systems are quite different to institutions composed of humans, then they might not be very interpretable. Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.
One caveat is that on my models of AI development I don’t expect the CEO could just copy model parameters to the intern. I think it’s more likely that we have something along the lines of “graduate of <specific college major>” AI systems that you then copy and use as needed. But I don’t think this really affects your point.
Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.
Yeah jtbc I definitely would not assert this. If I had to make an argument for as-much-interpretability, it would be something like “in the scenario we’re considering, AI systems are roughly human-level in capability; at this level of capability societal organization will still require a lot of modularity; if we know nothing else and assume agents are as black-boxy as humans, it seems reasonable to assume this will lead to a roughly similar amount of interpretability as current society”. But this is not a particularly strong argument, especially in the face of vast uncertainty about what the future looks like.
Great post! I especially liked the New Zealand example, it seems like a surprisingly good fit. Against a general backdrop of agreement, let me list a few points of disagreement:
Somewhat of a nitpick, but it is not clear to me that this is an example of problems of short-term incentives. Are we sure that given the choice between “lower crime, lower costs and algorithmic bias” and “higher crime, higher costs and only human bias”, and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection? Similarly with other current AI examples (e.g. the Google gorilla misclassification).
(It’s possible that if I delved into the research and the numbers the answer would be obvious; I have extremely little information on the scale of the problem currently.)
Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn’t make that much of a difference. If a single ML system replaces an entire institution, then it probably is less interpretable than that institution. It doesn’t seem obvious to me which of these we should be considering. (Partly it depends on how general and capable the AI systems are.) Overall I’d guess that for WFLL1 it’s closer to “replacing humans” than “replacing institutions”.
I guess you mean here that activations and weights in NNs are more interpretable to us than neurological processes in the human brain, but if so this comparison does not seem relevant to the text you quoted. Consider that it seems easier to understand why an editor of a newspaper placed some article on the front page than why FB’s algorithm showed some post to some user (especially if we get to ask the editor questions or consult with other editors).
Even if so (which I would expect to become uncompetitive with “replacing institutions” at some point) you may still get weird dynamics between AI systems within an institution and across institutions (e.g. between a CEO advisor AI and a regulator advisor AI). These dynamics may be very hard to interpret (and may not even involve recognizable communication channels).
Isn’t this what I said in the rest of that paragraph (although I didn’t have an example)?
I’m not claiming that replacing humans is more competitive than replacing institutions. I’m claiming that, if we’re considering the WFLL1 setting, and we’re considering the point at which we could have prevented failure, at that point I’d expect AI systems are in the “replacing humans” category. By the time they’re in the “replacing institutions” category, we probably are far beyond the position where we could do anything about the future.
Separately, even in the long run, I expect modularity to be a key organizing principle for AI systems.
I agree this is possible but it doesn’t seem very likely to me, since we’ll very likely be training our AI systems to communicate in natural language, and those AI systems will likely be trained to behave in vaguely human-like ways.
I meant to say that even if we replace just a single person (like a newspaper editor) with an ML system, it may become much harder to understand why each decision was made.
The challenge here seems to me to train competitive models—that behave in vaguely human-like ways—for general real-world tasks (e.g. selecting content for a FB user feed or updating item prices on Walmart). In the business-as-usual scenario we would need such systems to be competitive with systems that are optimized for business metrics (e.g. users’ time spent or profit).
Thanks for your comment!
Good point, thanks, I hadn’t thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I’d add is that if one did delve into the research to work this out for a particular case, it seems that an important (but hard to quantify) consideration would be the extent to which choosing the algorithm in this case makes it more likely that the use of that algorithm becomes entrenched, or it sets a precedent for the use of such algorithms. This feels important since these effects could plausibly make WFLL1-like things more likely in the longer run (when the harm of using misaligned systems is higher, due to the higher capabilities of those systems).
Good catch. I had the “AI systems replace entire institutions” scenario in mind, but agree that WFLL1 actually feels closer to “AI systems replace humans”. I’m pretty confused about what this would look like though, and in particular, whether institutions would retain their interpretability if this happened. It seems plausible that the best way to “carve up” an institution into individual agents/services differs for humans and AI systems. E.g. education/learning is big part of human institution design—you start at the bottom and work your way up as you learn skills and become trusted to act more autonomously—but this probably wouldn’t be the case for institutions composed of AI systems, since the “CEO” could just copy their model parameters to the “intern” :). And if institutions composed of AI systems are quite different to institutions composed of humans, then they might not be very interpretable. Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.
Yup, all of that sounds right to me!
One caveat is that on my models of AI development I don’t expect the CEO could just copy model parameters to the intern. I think it’s more likely that we have something along the lines of “graduate of <specific college major>” AI systems that you then copy and use as needed. But I don’t think this really affects your point.
Yeah jtbc I definitely would not assert this. If I had to make an argument for as-much-interpretability, it would be something like “in the scenario we’re considering, AI systems are roughly human-level in capability; at this level of capability societal organization will still require a lot of modularity; if we know nothing else and assume agents are as black-boxy as humans, it seems reasonable to assume this will lead to a roughly similar amount of interpretability as current society”. But this is not a particularly strong argument, especially in the face of vast uncertainty about what the future looks like.