But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)
It sounds to me like we agree here, I don’t want to put too much weight on “most”.
Is this true?
It is true in the sense that you don’t have any theoretical guarantees, and in the sense that it also often fails to work in practice.
Aren’t NN implicitly ensembles of vast number of models?
They probably are, to some extent. However, in practice, you often find that neural networks make very confident (and wrong) predictions for out-of-distribution inputs, in a way that seems to be caused by them projecting some spurious correlation. For example, you train a network to distinguish different types of tanks, but it learns to distinguish night from day. You train an agent to collect coins, but it learns to go to the right. You train a network to detect criminality, but it learns to detect smiles. Adversarial examples could also be cast as an instance of this phenomenon. In all of these cases, we have a situation where there are multiple patterns in the data that fit a given training objective, but where a neural network ends up giving an unreasonably large weight to some of these patterns at the expense of other plausible patterns. I thus think its fair to say that—empirically—neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution. It may be that this problem mostly goes away with a sufficiently large amount of sufficiently varied data, but it seems hard to get high confidence in that.
Also, does ensembling 5 NNs help?
In practice, this does not seem to help very much.
If we’re conservative over a million models, how will we ever do anything?
I mean, there can easily be cases where we assign a very low probability to any given “complete” model of a situation, but where we are still able assign a high probability to many different partial hypotheses. For example, if you randomly sample a building somewhere on earth, then your credence that the building has a particular floor plan might be less than 1 in 1,000,000 for each given floor plan. However, you could still assign a credence of near-1 that the building has a door, and a roof, etc. To give a less contrived example, there are many ways for the stock market to evolve over time. It would be very foolish to assume that it will evolve according to, e.g., your maximum-likelihood model. However, you can still assign a high credence to the hypothesis that it will grow on average. In many (if not all?) cases, such partial hypotheses are sufficient.
If our prior is importantly different in a way which we think will help, why can’t we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?
In the examples I gave above, the issue is less about the prior and more about the need to keep track of all plausible alternatives (which neural networks do not seem to do, robustly). Using ensembles might help, but in practice this does not seem to work that well.
I again don’t see how bayes ensures you have some non-schemers while ensembling doesn’t.
I also don’t see a good reason to think that a Bayesian posterior over agents should give a large weight to non-schemers. However, that isn’t the use-case here (the world model is not meant to be agentic).
Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem.
So, this depends on how you attempt to create the world model. If you try to do this by training a black-box model to do raw sensory prediction, and then attempt to either extract latent variables from that model, or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working. However, this is in no way the only option. As a very simple example, you could simply train a black-box model to directly predict the values of all latent variables that you need for your definition of harm. This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model “thinks” that harm would occur in a given scenario. As another example, you could build a world model “manually” (with humans and LLMs). Such a model may be interpretable by default. And so on.
I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?
The general strategy may include either of these two approaches. I’m just saying that that the plan does not definitionally rely on the assumption that the wold model is built manually.
Won’t predicting safety specific variables contain all of the difficulty of predicting the world?
That very much depends on what the safety specifications are, and how you want to use your AI system. For example, think about the situations where similar things are already done today. You can prove that a cryptographic protocol is unbreakable, given some assumptions, without needing to have a complete predictive model of the humans that use that protocol. You can prove that a computer program will terminate using a bounded amount of time and memory, without needing a complete predictive model of all inputs to that computer program. You can prove that a bridge can withstand an earthquake of such-and-such magnitude, without having to model everything about the earth’s climate. And so on. If you want to prove something like “the AI will never copy itself to an external computer”, or “the AI will never communicate outside this trusted channel”, or “the AI will never tamper with this sensor”, or something like that, then your world model might not need to be all that detailed. For more ambitious safety specifications, you might of course need a more detailed world model. However, I don’t think there is any good reason to believe that the world model categorically would need to be a “complete” world model in order to prove interesting safety properties.
I thus think its fair to say that—empirically—neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution
Sure, but why will the bayesian model reliably quantify uncertainty OOD? There is also no guarantee of this (OOD).
Whether or not you get reliable uncertainly quanitification will depend on your prior. If you have (e.g.) the NN prior, I expect the uncertainly quantification is similar to if you trained an ensemble.
E.g., you’ll find a bunch of NNs (in the bayesian posterior) which also have the spurious correlation that a trained NN (or ensemble of NNs) would have.
If you have some other prior, why can’t we regularize our NN to match this?
Separately, I guess I’m not that worried about failures in which the network itself doesn’t “understand” what’s going on. So the main issue are cases where the model in some sense knows, but doesn’t report this. (E.g. ELK problems at least broadly speaking.)
I think there are bunch of issues that look sort of like this now, but this will go away once models are smarter enough to automate R&D etc.
I’m not worried about future models murdering us because they were confused and though this would be what we wanted due to a spurious correlation.
(I do have some concerns around jailbreaking, but I also think that will look pretty different and the adversarial case is very different. And there appear to be solutions which are more promising that bayesian ML.)
I think you’re perhaps reading me as being more bullish on Bayesian methods than I in fact am—I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if a Bayesian learner uses the NN prior, then its behaviour should—in the limit—be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:
It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.
And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I’m saying is that there are valid and well-motivated reasons to explore this particular direction.
Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something).
Do something else that makes this actually useful for “improving” OOD generalization in some way.
It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn’t stated any particular hope for (2) which depends on (1).
Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if the Bayesian aspect of the agenda did a specific useful thing like ‘”improve” OOD generalization’ or ‘allow us to control/understand OOD generalization’, then this aspect of the agenda would be useful.
However, I think the Bayesian aspect of the agend won’t do this and thus it won’t add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this—but I disagree and don’t see the story for this.
I agree that “actually use Bayesian methods” sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don’t think it clearly does.
(Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).)
However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head
1-3 don’t seem promising/important to me. (4) would be useful, but I don’t see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there.
To be clear, if the hope is “figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model”, then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.
And so on. If you want to prove something like “the AI will never copy itself to an external computer”, or “the AI will never communicate outside this trusted channel”, or “the AI will never tamper with this sensor”, or something like that, then your world model might not need to be all that detailed.
I agree that you can improve safety by checking outputs in specific ways in cases where we can do so (e.g. requiring the AI to formally verify its code doesn’t have side channels).
The relevant question is whether there are interesting cases where we can’t currently verify the output with conventional tools (e.g. formal software analysis or bridge design), but we can using a world model we’ve constructed.
One story for why we might be able to do this is that the world model is able to generally predict the world as well as the AI.
Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something? IMO no and besides, we don’t need the interpretability as we can just train GPT-3 to do stuff.
So, it either has to be smarter than GPT-3 or have a wildly different capability profile.
I can’t really image any plausible stories where this world model doesn’t end up having to actually be smart while the world model is also able massively improve the frontier of what we can check.
I’m not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything—depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale—why should it not also be true on a medium or large scale?
For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don’t have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).
Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something?
I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don’t think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3′s capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.
In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).
(I think most of the hard-to-handle risk from scheming comes from cases where we can’t easily make smarter AIs which we know aren’t schemers. If we can get another copy of the AI which is just as smart but which has been “de-agentified”, then I don’t think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a “world-model” vs “agent” distinction isn’t going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)
or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working
TBC, I would count it as solving an ELK problem if you constructed an interpretable world model which allows you to extract out the latents you want.
This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model “thinks” that harm would occur in a given scenario.
Won’t this depend on the generalization properties of the model?
E.g., we can always train a model to predict if a diamond is in the vault on some subset of examples which we can label.
From my perspective the core difficulty is either:
Treacherous turns (or similar high-stakes failures)
Note being able to properly identify the latents in actual deployment cases
Perhaps you were discussing the case where we magically have access to all the latents and don’t need to worry about high-stakes failures? In this case, I agree there are no problems, but the baseline proposal of RLHF while looking at the latents also works perfectly fine.
What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don’t have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we’re verifying.
However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.
So, suppose we’re training a chess-playing AI, and we want to be sure that it cannot be defeated in n moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in n moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn’t think of.
Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let’s ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in n moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.
This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.
It sounds to me like we agree here, I don’t want to put too much weight on “most”.
It is true in the sense that you don’t have any theoretical guarantees, and in the sense that it also often fails to work in practice.
They probably are, to some extent. However, in practice, you often find that neural networks make very confident (and wrong) predictions for out-of-distribution inputs, in a way that seems to be caused by them projecting some spurious correlation. For example, you train a network to distinguish different types of tanks, but it learns to distinguish night from day. You train an agent to collect coins, but it learns to go to the right. You train a network to detect criminality, but it learns to detect smiles. Adversarial examples could also be cast as an instance of this phenomenon. In all of these cases, we have a situation where there are multiple patterns in the data that fit a given training objective, but where a neural network ends up giving an unreasonably large weight to some of these patterns at the expense of other plausible patterns. I thus think its fair to say that—empirically—neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution. It may be that this problem mostly goes away with a sufficiently large amount of sufficiently varied data, but it seems hard to get high confidence in that.
In practice, this does not seem to help very much.
I mean, there can easily be cases where we assign a very low probability to any given “complete” model of a situation, but where we are still able assign a high probability to many different partial hypotheses. For example, if you randomly sample a building somewhere on earth, then your credence that the building has a particular floor plan might be less than 1 in 1,000,000 for each given floor plan. However, you could still assign a credence of near-1 that the building has a door, and a roof, etc. To give a less contrived example, there are many ways for the stock market to evolve over time. It would be very foolish to assume that it will evolve according to, e.g., your maximum-likelihood model. However, you can still assign a high credence to the hypothesis that it will grow on average. In many (if not all?) cases, such partial hypotheses are sufficient.
In the examples I gave above, the issue is less about the prior and more about the need to keep track of all plausible alternatives (which neural networks do not seem to do, robustly). Using ensembles might help, but in practice this does not seem to work that well.
I also don’t see a good reason to think that a Bayesian posterior over agents should give a large weight to non-schemers. However, that isn’t the use-case here (the world model is not meant to be agentic).
So, this depends on how you attempt to create the world model. If you try to do this by training a black-box model to do raw sensory prediction, and then attempt to either extract latent variables from that model, or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working. However, this is in no way the only option. As a very simple example, you could simply train a black-box model to directly predict the values of all latent variables that you need for your definition of harm. This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model “thinks” that harm would occur in a given scenario. As another example, you could build a world model “manually” (with humans and LLMs). Such a model may be interpretable by default. And so on.
The general strategy may include either of these two approaches. I’m just saying that that the plan does not definitionally rely on the assumption that the wold model is built manually.
That very much depends on what the safety specifications are, and how you want to use your AI system. For example, think about the situations where similar things are already done today. You can prove that a cryptographic protocol is unbreakable, given some assumptions, without needing to have a complete predictive model of the humans that use that protocol. You can prove that a computer program will terminate using a bounded amount of time and memory, without needing a complete predictive model of all inputs to that computer program. You can prove that a bridge can withstand an earthquake of such-and-such magnitude, without having to model everything about the earth’s climate. And so on. If you want to prove something like “the AI will never copy itself to an external computer”, or “the AI will never communicate outside this trusted channel”, or “the AI will never tamper with this sensor”, or something like that, then your world model might not need to be all that detailed. For more ambitious safety specifications, you might of course need a more detailed world model. However, I don’t think there is any good reason to believe that the world model categorically would need to be a “complete” world model in order to prove interesting safety properties.
Sure, but why will the bayesian model reliably quantify uncertainty OOD? There is also no guarantee of this (OOD).
Whether or not you get reliable uncertainly quanitification will depend on your prior. If you have (e.g.) the NN prior, I expect the uncertainly quantification is similar to if you trained an ensemble.
E.g., you’ll find a bunch of NNs (in the bayesian posterior) which also have the spurious correlation that a trained NN (or ensemble of NNs) would have.
If you have some other prior, why can’t we regularize our NN to match this?
(Maybe I’m confused about this?)
Separately, I guess I’m not that worried about failures in which the network itself doesn’t “understand” what’s going on. So the main issue are cases where the model in some sense knows, but doesn’t report this. (E.g. ELK problems at least broadly speaking.)
I think there are bunch of issues that look sort of like this now, but this will go away once models are smarter enough to automate R&D etc.
I’m not worried about future models murdering us because they were confused and though this would be what we wanted due to a spurious correlation.
(I do have some concerns around jailbreaking, but I also think that will look pretty different and the adversarial case is very different. And there appear to be solutions which are more promising that bayesian ML.)
I think the distinction between these two cases often can be somewhat vague.
Why do you think that the adversarial case is very different?
I think you’re perhaps reading me as being more bullish on Bayesian methods than I in fact am—I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if a Bayesian learner uses the NN prior, then its behaviour should—in the limit—be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:
It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.
And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I’m saying is that there are valid and well-motivated reasons to explore this particular direction.
Insofar as the hope is:
Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something).
Do something else that makes this actually useful for “improving” OOD generalization in some way.
It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn’t stated any particular hope for (2) which depends on (1).
I agree that if the Bayesian aspect of the agenda did a specific useful thing like ‘”improve” OOD generalization’ or ‘allow us to control/understand OOD generalization’, then this aspect of the agenda would be useful.
However, I think the Bayesian aspect of the agend won’t do this and thus it won’t add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this—but I disagree and don’t see the story for this.
I agree that “actually use Bayesian methods” sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don’t think it clearly does.
(Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).)
1-3 don’t seem promising/important to me. (4) would be useful, but I don’t see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there.
To be clear, if the hope is “figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model”, then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.
I agree that you can improve safety by checking outputs in specific ways in cases where we can do so (e.g. requiring the AI to formally verify its code doesn’t have side channels).
The relevant question is whether there are interesting cases where we can’t currently verify the output with conventional tools (e.g. formal software analysis or bridge design), but we can using a world model we’ve constructed.
One story for why we might be able to do this is that the world model is able to generally predict the world as well as the AI.
Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something? IMO no and besides, we don’t need the interpretability as we can just train GPT-3 to do stuff.
So, it either has to be smarter than GPT-3 or have a wildly different capability profile.
I can’t really image any plausible stories where this world model doesn’t end up having to actually be smart while the world model is also able massively improve the frontier of what we can check.
I’m not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything—depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale—why should it not also be true on a medium or large scale?
For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don’t have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).
I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don’t think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3′s capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.
Worth noting here that I’m mostly talking about Bengio’s proposal wrt to the bayes related arguments.
And I agree that the world model isn’t meant to be a schemer, but it’s not as though we can guarantee that without some additional property...
(Such as ensuring that the world model is interpretable.)
In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).
(I think most of the hard-to-handle risk from scheming comes from cases where we can’t easily make smarter AIs which we know aren’t schemers. If we can get another copy of the AI which is just as smart but which has been “de-agentified”, then I don’t think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a “world-model” vs “agent” distinction isn’t going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)
TBC, I would count it as solving an ELK problem if you constructed an interpretable world model which allows you to extract out the latents you want.
Won’t this depend on the generalization properties of the model?
E.g., we can always train a model to predict if a diamond is in the vault on some subset of examples which we can label.
From my perspective the core difficulty is either:
Treacherous turns (or similar high-stakes failures)
Note being able to properly identify the latents in actual deployment cases
Perhaps you were discussing the case where we magically have access to all the latents and don’t need to worry about high-stakes failures? In this case, I agree there are no problems, but the baseline proposal of RLHF while looking at the latents also works perfectly fine.
What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don’t have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we’re verifying.
However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.
So, suppose we’re training a chess-playing AI, and we want to be sure that it cannot be defeated in n moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in n moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn’t think of.
Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let’s ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in n moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.
This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.