I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).
I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking “when talking/thinking about the methodology, what capabilities are assumed to be in place?”.
I’m not sure about this, but unless I’m mistaken[1], a good amount of the work done by MIRI has been under a premise that goes (something sort of like): “Let’s assume that we already know how to give AIs real-world objectives defined in terms of utility functions (not because we really assume that, but as a simplifying assumption)”. And as far as I know, they haven’t done much work where the assumption was something more like “suppose we were extremely good at gradient descent / searching through spaces of possible programs”.
In my own theorizing, I don’t make all of the simplifying assumptions that (I think/suspect) MIRI made in their “orthodox” research. But I make other assumptions (for the purpose of simplification), such as:
“let’s assume that we’re really good at gradient descent / searching for possible AIs in program-space”[2]
“let’s assume that the things I’m imagining are not made infeasible due to a lack of computational resources”
“let’s assume that resources and organizational culture makes it possible to carry out the plans as described/envisioned (with high technical security, etc)”
In regards to your alignment ideas, is it easy to summarize what you assume to be in place? Like, if someone came to you and said “we have written the source code for a superintelligent AGI, but we haven’t turned it on yet” (and you believed them), is it easy to summarize what more you then would need in order to implement your methodology?
I very well could be, and would appreciate any corrections.
(I know they have worked on lots of detail-oriented things that aren’t “one big plan” to “solve alignment”. And maybe how I phrase myself makes it seem like I don’t understand that. But if so, that’s probably due to bad wording on my part.)
As far as I understand, MIRI did not assume that we’re just able to give the AI a utility function directly. The Risks from Learned Optimization paper was written mainly by people from MIRI!
Other things like Ontological Crises and Low Impact sort of assume yoi can get some info into the values of an agent, and Logical Induction was more about how to construct systems that satisfy some properties in their cognition.
As far as I understand, MIRI did not assume that we’re just able to give the AI a utility function directly.
I’m a bit unsure about how to interpret you here.
In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.
The Risks from Learned Optimization paper was written mainly by people from MIRI!
Good point (I should have written my comment in such a way that pointing out this didn’t feel necessary).
Other things like Ontological Crises and Low Impact sort of assume you can get some info into the values of an agent
I guess this is more central to what I was trying to communicate than whether it is expressed in terms of a utility function per se.
”The idea with agent foundations, which I guess hasn’t successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).”
Based on e.g. this talk from 2016, I get the sense that when he says “coherent target” he means targets that relate to the non-digital world. But perhaps that’s not the case (or perhaps it’s sort of the case, but more nuanced).
Maybe I’m making this out to have been a bigger part of their work than what actually was the case.
Yeah, I find it difficult to figure out how to look at this. At lot of MIRI discussion focused on their decision theory work, but I think that’s just not that important.
Tiling agents e.g. was more about constructing or theorizing about agents that may have access to their own values, in a highly idealized setting about logic.
I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).
I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking “when talking/thinking about the methodology, what capabilities are assumed to be in place?”.
I’m not sure about this, but unless I’m mistaken[1], a good amount of the work done by MIRI has been under a premise that goes (something sort of like): “Let’s assume that we already know how to give AIs real-world objectives defined in terms of utility functions (not because we really assume that, but as a simplifying assumption)”. And as far as I know, they haven’t done much work where the assumption was something more like “suppose we were extremely good at gradient descent / searching through spaces of possible programs”.
In my own theorizing, I don’t make all of the simplifying assumptions that (I think/suspect) MIRI made in their “orthodox” research. But I make other assumptions (for the purpose of simplification), such as:
“let’s assume that we’re really good at gradient descent / searching for possible AIs in program-space”[2]
“let’s assume that the things I’m imagining are not made infeasible due to a lack of computational resources”
“let’s assume that resources and organizational culture makes it possible to carry out the plans as described/envisioned (with high technical security, etc)”
In regards to your alignment ideas, is it easy to summarize what you assume to be in place? Like, if someone came to you and said “we have written the source code for a superintelligent AGI, but we haven’t turned it on yet” (and you believed them), is it easy to summarize what more you then would need in order to implement your methodology?
I very well could be, and would appreciate any corrections.
(I know they have worked on lots of detail-oriented things that aren’t “one big plan” to “solve alignment”. And maybe how I phrase myself makes it seem like I don’t understand that. But if so, that’s probably due to bad wording on my part.)
Well, I sort of make that assumption, but there are caveats.
As far as I understand, MIRI did not assume that we’re just able to give the AI a utility function directly. The Risks from Learned Optimization paper was written mainly by people from MIRI!
Other things like Ontological Crises and Low Impact sort of assume yoi can get some info into the values of an agent, and Logical Induction was more about how to construct systems that satisfy some properties in their cognition.
There’s lots of material that does assume that, even if there is some that doesnt.
I’m a bit unsure about how to interpret you here.
In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.
Good point (I should have written my comment in such a way that pointing out this didn’t feel necessary).
I guess this is more central to what I was trying to communicate than whether it is expressed in terms of a utility function per se.
In this tweet, Eliezer writes:
”The idea with agent foundations, which I guess hasn’t successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).”
Based on e.g. this talk from 2016, I get the sense that when he says “coherent target” he means targets that relate to the non-digital world. But perhaps that’s not the case (or perhaps it’s sort of the case, but more nuanced).
Maybe I’m making this out to have been a bigger part of their work than what actually was the case.
Yeah, I find it difficult to figure out how to look at this. At lot of MIRI discussion focused on their decision theory work, but I think that’s just not that important.
Tiling agents e.g. was more about constructing or theorizing about agents that may have access to their own values, in a highly idealized setting about logic.