Okay, wow, I don’t know if I quite understand any of this, but this part caught my attention:
The Omohundrian/Yudkowskian argument is not that we can take an arbitrary stupid young AI and it will be smart enough to self-modify in a way that preserves its values, but rather that most AIs that don’t self-destruct will eventually end up at a stable fixed-point of coherent consequentialist values. This could easily involve a step where, e.g., an AI that started out with a neural-style delta-rule policy-reinforcement learning algorithm, or an AI that started out as a big soup of self-modifying heuristics, is “taken over” by whatever part of the AI first learns to do consequentialist reasoning about code.
I have sometimes wondered whether the best way to teach an AI a human’s utility function would not be to program it into the AI directly (because that will require that we figure out what we really want in a really precisely-defined way, which seems like a gargantuan task), but rather, perhaps the best way would be to “raise” the AI like a kid at a stage where the AI would have minimal and restricted ways of interacting with human society (to minimize harm...much like a toddler thankfully does not have the muscles of Arnold Schwarzenegger to use during its temper tantrums), and where we would then “reward” or “punish” the AI for seeming to demonstrate better or worse understanding of our utility function.
It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.
In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the “raising an AI” strategy—that there would be a limited time window in which the AI’s utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its “critical period” in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?
It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.
In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the “raising an AI” strategy—that there would be a limited time window in which the AI’s utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its “critical period” in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?
This problem is essentially what MIRI has been calling corrigibility. A corrigible AI is one that understands and accepts that it or its utility function is not yet complete.
Not saying you’re committing this fallacy, but it does explain some of the bigger problems with “raising an AI like a child” that you might not have thought of.
Hardly dispositive. A utility function that says “learn and care what your parents care about” looks relatively simple on paper. And we know the minumum intelligence required is that of a human toddler,
I don’t think “learn and care about what your parents care about” is noticeably simpler than abstractly trying to determine what an arbitrary person cares about or CEV.
It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.
You could, you know, look inside the machine and see what makes it tick. It’s not a black box.
That seems desirable and perhaps possible, but extremely difficult, especially when you have a superintelligent mind anticipating that you’ll do it and trying to work out how to ensure you come away with the wrong impression.
Why? That’s not a necessary logical consequence. These aren’t (or don’t have to be) chaotical systems, so there is no reason that scaling up the size of the computation results in an unauditable mess. The techniques used depend very much on the AGI architecture, but there are designs which allow for tracing thought patterns and answering questions about its operation in ways which are computationally tractable.
Because there are limits on complexity that a human brain can process. There are tools and techniques which help, sure, but at some point you will hit the limit anyway.
there is no reason that scaling up the size of the computation results in an unauditable mess.
I am pretty sure there is. Imagine a few million threads running different code asynchronously on distributed hardware, all interacting. And if the scenario involves an AI which bootstrapped itself through self-modification, your chances of understanding what’s happening by looking at pieces of code are… very low.
Sorry some clarification is necessary. I didn’t mean to imply that a human being manually inspect the AI and understand its workings, but rather that humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.
For example, the input/output channels which handle a combination of long-term memory, self-modifications, and actions may require that complete computational traces be provided. Now obviously the traces must be pruned in order to keep the auditing workload manageable, but using cryptographic techniques this can be done such that the AI cannot know ahead of time which branches will be checked, and therefore cannot cheat by creating fraudulent traces while escaping probabalistic or energetic detection.
The AI itself is executing on a framework that includes Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp. The auditor’s job then it to make sure that the updates or actions make sense in terms of the data provided, and that choices wree made in alignment with its claimed utility function (remember the AI can’t make up data without that fraud being energetically detectable).
The above basically acts as a lie detector to the question “what’s your highest-level utility function?” and followup questions like “give examples of why this clause of the utility function is necessary”.
--
Actually, maybe that’s a simpler way of framing the issue: assume that you have access to a lie detector. I posit that an accurate lie detector is sufficient to create boxable AI, or maybe even enable some procedural processes for the creation of true friendly AI (e.g. some extension of indirect normativity). The above is basically a technical outline of how one would create such a lie detector for certain AGI architectures.
humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.
I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain’s complexity limit is not high.
complete computational traces be provided
Going back to my example, what does a “complete computational trace” mean in the context of a few million threads running different code asynchronously on distributed hardware, all interacting?
Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp
A list in Lisp is just a data structure. What is a Bayesian probability graph as a unit of computation? I don’t even understand what it means.
assume that you have access to a lie detector
Sure. Give that lie detector to someone with the IQ of 60 and see how well they together will figure out something that a guy with the IQ of 150 wants to keep hidden from them, say, his true intentions and capabilities.
Just as an example of something a human couldn’t understand that a sufficiently smart computer might—writing code directly to binary, without the intermediate step of a programming language.
Okay, wow, I don’t know if I quite understand any of this, but this part caught my attention:
I have sometimes wondered whether the best way to teach an AI a human’s utility function would not be to program it into the AI directly (because that will require that we figure out what we really want in a really precisely-defined way, which seems like a gargantuan task), but rather, perhaps the best way would be to “raise” the AI like a kid at a stage where the AI would have minimal and restricted ways of interacting with human society (to minimize harm...much like a toddler thankfully does not have the muscles of Arnold Schwarzenegger to use during its temper tantrums), and where we would then “reward” or “punish” the AI for seeming to demonstrate better or worse understanding of our utility function.
It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.
In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the “raising an AI” strategy—that there would be a limited time window in which the AI’s utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its “critical period” in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?
This problem is essentially what MIRI has been calling corrigibility. A corrigible AI is one that understands and accepts that it or its utility function is not yet complete.
Very relevant article from the sequences: Detached Lever Fallacy.
Not saying you’re committing this fallacy, but it does explain some of the bigger problems with “raising an AI like a child” that you might not have thought of.
I completely made this mistake right up until the point I read that article.
Hardly dispositive. A utility function that says “learn and care what your parents care about” looks relatively simple on paper. And we know the minumum intelligence required is that of a human toddler,
Citation needed. That sounds extremely complex to specify.
relatively
I don’t think “learn and care about what your parents care about” is noticeably simpler than abstractly trying to determine what an arbitrary person cares about or CEV.
You could, you know, look inside the machine and see what makes it tick. It’s not a black box.
That seems desirable and perhaps possible, but extremely difficult, especially when you have a superintelligent mind anticipating that you’ll do it and trying to work out how to ensure you come away with the wrong impression.
After a certain level of complexity you can look but you wouldn’t understand what you are seeing.
Why? That’s not a necessary logical consequence. These aren’t (or don’t have to be) chaotical systems, so there is no reason that scaling up the size of the computation results in an unauditable mess. The techniques used depend very much on the AGI architecture, but there are designs which allow for tracing thought patterns and answering questions about its operation in ways which are computationally tractable.
Because there are limits on complexity that a human brain can process. There are tools and techniques which help, sure, but at some point you will hit the limit anyway.
I am pretty sure there is. Imagine a few million threads running different code asynchronously on distributed hardware, all interacting. And if the scenario involves an AI which bootstrapped itself through self-modification, your chances of understanding what’s happening by looking at pieces of code are… very low.
Sorry some clarification is necessary. I didn’t mean to imply that a human being manually inspect the AI and understand its workings, but rather that humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.
For example, the input/output channels which handle a combination of long-term memory, self-modifications, and actions may require that complete computational traces be provided. Now obviously the traces must be pruned in order to keep the auditing workload manageable, but using cryptographic techniques this can be done such that the AI cannot know ahead of time which branches will be checked, and therefore cannot cheat by creating fraudulent traces while escaping probabalistic or energetic detection.
The AI itself is executing on a framework that includes Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp. The auditor’s job then it to make sure that the updates or actions make sense in terms of the data provided, and that choices wree made in alignment with its claimed utility function (remember the AI can’t make up data without that fraud being energetically detectable).
The above basically acts as a lie detector to the question “what’s your highest-level utility function?” and followup questions like “give examples of why this clause of the utility function is necessary”.
--
Actually, maybe that’s a simpler way of framing the issue: assume that you have access to a lie detector. I posit that an accurate lie detector is sufficient to create boxable AI, or maybe even enable some procedural processes for the creation of true friendly AI (e.g. some extension of indirect normativity). The above is basically a technical outline of how one would create such a lie detector for certain AGI architectures.
I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain’s complexity limit is not high.
Going back to my example, what does a “complete computational trace” mean in the context of a few million threads running different code asynchronously on distributed hardware, all interacting?
A list in Lisp is just a data structure. What is a Bayesian probability graph as a unit of computation? I don’t even understand what it means.
Sure. Give that lie detector to someone with the IQ of 60 and see how well they together will figure out something that a guy with the IQ of 150 wants to keep hidden from them, say, his true intentions and capabilities.
A human brain is at the end of all the alternative strategies as well.
Just as an example of something a human couldn’t understand that a sufficiently smart computer might—writing code directly to binary, without the intermediate step of a programming language.
That would be read as decompiled assembler which humans can understand, though not in large quantities.
Interesting. Consider me corrected.
For anything nontrivial, we need software support to do that—and it still won’t work very well. You might not be absolutely correct, but you’re close.
IDA is a wonderful piece of software, though. A heck of a lot better than working manually.