Why are highly interactive training protocols a massive pain?
Do you have any thoughts on self-supervised learning? That’s my current guess for how we’ll get AGI, and it’s a framework that makes the alignment problem seem relatively straightforward to me.
They’re a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.
RE self-supervised learning: I don’t see why we needed the rebranding (of unsupervised learning). I don’t see why it would make alignment straightforward (ETA: except to the extent that you aren’t necessarily, deliberately building something agenty). The boundaries between SSL and other ML is fuzzy; I don’t think we’ll get to AGI using just SSL and nothing like RL. SSL doesn’t solve the exploration problem, if you start caring about exploration, I think you end up doing things that look more like RL.
I also tend to agree (e.g. with that gwern article) that AGI designs that aren’t agenty are going to be at a significant competitive disadvantage, so probably aren’t a satisfying solution to alignment, but could be a stop-gap.
They’re a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.
I see. How about doing active learning of computable functions? That solves all 3 problems.
Instead of standard benchmarks, you could offer an API which provides an oracle for some secret functions to be learned. You could run a competition every X months and give each competition entrant a budget of Y API calls over the course of the competition.
RE self-supervised learning: I don’t see why we needed the rebranding (of unsupervised learning).
Well I don’t see why neural networks needed to be rebranded as “deep learning” either :-)
When I talk about “self-supervised learning”, I refer to chopping up your training set into automatically created supervised learning problems (predictive processing), which feels different from clustering/dimensionality reduction. It seems like a promising approach regardless of what you call it.
I don’t see why it would make alignment straightforward (ETA: except to the extent that you aren’t necessarily, deliberately building something agenty).
In order to make accurate predictions about reality, you need to understand humans, because humans exist in reality. So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning). But I suspect fine-tuning might not even be necessary. Just ask it what Gandhi would do or something like that.
Re: gwern’s article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that’s not the same as RL.
Autonomy is also nice (and also not the same as RL). I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it’s safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances. I have notes for a post about (2), let me know if you think I should prioritize writing it.
I see. How about doing active learning of computable functions? That solves all 3 problems
^ I don’t see how?
I should elaborate… it sounds like your thinking of active learning (where the AI can choose to make queries for information, e.g. labels), but I’m talking about *inter*active training, where a human supervisor is *also* actively monitoring the AI system, making queries of it, and intelligently selecting feedback for the AI. This might be simulated as well, using multiple AIs, and there might be a lot of room for good work there… but I think if we want to solve alignment, we want a deep and satisfying understanding of AI systems, which seems hard to come by without rich feedback loops between humans and AIs. Basically, by interactive training, I have in mind something where training AIs looks more like teaching other humans.
So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning).
I think it’s a very open question how well we can expect advanced AI systems to understand or mirror human concepts by default. Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways. I’m cautiously optimistic, since this could make things a lot easier. It’s also unclear ATM how precisely AI concepts need to track human concepts in order for things to work out OK. The “basin of attraction” line of thought suggests that they don’t need to be that great, because they can self-correct or learn to defer to humans appropriately. My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.
Re: gwern’s article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that’s not the same as RL.
Yes ofc they are different.
I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning. RL can also be pointed at narrow domains, but for a lot of problems, I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.
I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it’s safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances.
That seems great, but also likely to be very difficult, especially if we demand high reliability and performance.
No human labor: Just compute the function. Fast experiment loop: Computers are faster than humans. Reproducible: Share the code for your function with others.
I’m talking about interactive training
I think for a sufficiently advanced AI system, assuming it’s well put together, active learning can beat this sort of interactive training—the AI will be better at the task of identifying & fixing potential weaknesses in its models than humans.
Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways.
I think the problem with adversarial examples is that deep neural nets don’t have the right inductive biases. I expect meta-learning approaches which identify & acquire new inductive biases (in order to determine “how to think” about a particular domain) will solve this problem and will also be necessary for AGI anyway.
BTW, different human brains appear to learn different representations (previous discussion), and yet we are capable of delegating tasks to each other.
I’m cautiously optimistic, since this could make things a lot easier.
Huh?
My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.
Maybe. But my intuition is that if you can create a superintelligent system, you can make one which is “superhumanly reliable” even in domains which are novel to it. I think the core problems for reliable AI are very similar to the core problems for AI in general. An example is the fact that solving adversarial examples and improving classification accuracy seem intimately related.
I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning.
In what sense does RL try to understand the world? It seems very much not focused on that. You essentially have to hand it a reasonably accurate simulation of the world (i.e. a world that is already fully understood, in the sense that we have a great model for it) for it to do anything interesting.
If the planning is only “implicit”, RL sounds like overkill and probably not a great fit. RL seems relatively good at long sequences of actions for a stateful system we have a great model of. If most of the value can be obtained by planning 1 step in advance, RL seems like a solution to a problem you don’t have. It is likely to make your system less safe, since planning many steps in advance could let it plot some kind of treacherous turn. But I also don’t think you will gain much through using it. So luckily, I don’t think there is a big capabilities vs safety tradeoff here.
I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.
Agreed. But general knowledge is also not RL, and is handled much more naturally in other frameworks such as transfer learning, IMO.
So basically I think daemons/inner optimizers/whatever you want to call them are going to be the main safety problem.
Why are highly interactive training protocols a massive pain?
Do you have any thoughts on self-supervised learning? That’s my current guess for how we’ll get AGI, and it’s a framework that makes the alignment problem seem relatively straightforward to me.
They’re a pain because they involve a lot of human labor, slow down the experiment loop, make reproducing results harder, etc.
RE self-supervised learning: I don’t see why we needed the rebranding (of unsupervised learning). I don’t see why it would make alignment straightforward (ETA: except to the extent that you aren’t necessarily, deliberately building something agenty). The boundaries between SSL and other ML is fuzzy; I don’t think we’ll get to AGI using just SSL and nothing like RL. SSL doesn’t solve the exploration problem, if you start caring about exploration, I think you end up doing things that look more like RL.
I also tend to agree (e.g. with that gwern article) that AGI designs that aren’t agenty are going to be at a significant competitive disadvantage, so probably aren’t a satisfying solution to alignment, but could be a stop-gap.
I see. How about doing active learning of computable functions? That solves all 3 problems.
Instead of standard benchmarks, you could offer an API which provides an oracle for some secret functions to be learned. You could run a competition every X months and give each competition entrant a budget of Y API calls over the course of the competition.
Well I don’t see why neural networks needed to be rebranded as “deep learning” either :-)
When I talk about “self-supervised learning”, I refer to chopping up your training set into automatically created supervised learning problems (predictive processing), which feels different from clustering/dimensionality reduction. It seems like a promising approach regardless of what you call it.
In order to make accurate predictions about reality, you need to understand humans, because humans exist in reality. So at the very least, a superintelligent self-supervised learning system trained on loads of human data would have a lot of conceptual building blocks (developed in order to make predictions about its training data) which could be tweaked and combined to make predictions about human values (analogous to fine-tuning in the context of transfer learning). But I suspect fine-tuning might not even be necessary. Just ask it what Gandhi would do or something like that.
Re: gwern’s article, RL does not seem to me like a good fit for most of the problems he describes. I agree active learning/interactive training protocols are powerful, but that’s not the same as RL.
Autonomy is also nice (and also not the same as RL). I think the solution for autonomy is (1) solve calibration/distributional shift, so the system knows when it’s safe to act autonomously (2) have the system adjust its own level of autonomy/need for clarification dynamically depending on the apparent urgency of its circumstances. I have notes for a post about (2), let me know if you think I should prioritize writing it.
^ I don’t see how?
I should elaborate… it sounds like your thinking of active learning (where the AI can choose to make queries for information, e.g. labels), but I’m talking about *inter*active training, where a human supervisor is *also* actively monitoring the AI system, making queries of it, and intelligently selecting feedback for the AI. This might be simulated as well, using multiple AIs, and there might be a lot of room for good work there… but I think if we want to solve alignment, we want a deep and satisfying understanding of AI systems, which seems hard to come by without rich feedback loops between humans and AIs. Basically, by interactive training, I have in mind something where training AIs looks more like teaching other humans.
I think it’s a very open question how well we can expect advanced AI systems to understand or mirror human concepts by default. Adversarial examples suggest we should be worried that apparently similar concepts will actually be wildly different in non-obvious ways. I’m cautiously optimistic, since this could make things a lot easier. It’s also unclear ATM how precisely AI concepts need to track human concepts in order for things to work out OK. The “basin of attraction” line of thought suggests that they don’t need to be that great, because they can self-correct or learn to defer to humans appropriately. My problem with that argument is that it seems like we will have so many chances to fuck up that we would need 1) AI systems to be extremely reliable, or 2) for catastrophic mistakes to be rare, and minor mistakes to be transient or detectable. (2) seems plausible to me in many applications, but probably not all of the applications where people will want to use SOTA AI.
Yes ofc they are different.
I think algorithms the significant features of RL here are: 1) having the goal of understanding the world and how to influence it, and 2) doing (possibly implicit) planning. RL can also be pointed at narrow domains, but for a lot of problems, I think having general knowledge will be very valuable, and hard to replicate with a network of narrow systems.
That seems great, but also likely to be very difficult, especially if we demand high reliability and performance.
No human labor: Just compute the function. Fast experiment loop: Computers are faster than humans. Reproducible: Share the code for your function with others.
I think for a sufficiently advanced AI system, assuming it’s well put together, active learning can beat this sort of interactive training—the AI will be better at the task of identifying & fixing potential weaknesses in its models than humans.
I think the problem with adversarial examples is that deep neural nets don’t have the right inductive biases. I expect meta-learning approaches which identify & acquire new inductive biases (in order to determine “how to think” about a particular domain) will solve this problem and will also be necessary for AGI anyway.
BTW, different human brains appear to learn different representations (previous discussion), and yet we are capable of delegating tasks to each other.
Huh?
Maybe. But my intuition is that if you can create a superintelligent system, you can make one which is “superhumanly reliable” even in domains which are novel to it. I think the core problems for reliable AI are very similar to the core problems for AI in general. An example is the fact that solving adversarial examples and improving classification accuracy seem intimately related.
In what sense does RL try to understand the world? It seems very much not focused on that. You essentially have to hand it a reasonably accurate simulation of the world (i.e. a world that is already fully understood, in the sense that we have a great model for it) for it to do anything interesting.
If the planning is only “implicit”, RL sounds like overkill and probably not a great fit. RL seems relatively good at long sequences of actions for a stateful system we have a great model of. If most of the value can be obtained by planning 1 step in advance, RL seems like a solution to a problem you don’t have. It is likely to make your system less safe, since planning many steps in advance could let it plot some kind of treacherous turn. But I also don’t think you will gain much through using it. So luckily, I don’t think there is a big capabilities vs safety tradeoff here.
Agreed. But general knowledge is also not RL, and is handled much more naturally in other frameworks such as transfer learning, IMO.
So basically I think daemons/inner optimizers/whatever you want to call them are going to be the main safety problem.