Ebenezer Dukakis comments on My AI Model Delta Compared To Yudkowsky

Ebenezer Dukakis 13 Jun 2024 12:19 UTC
1 point
0

If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology.

Was using a metaphorical “you”. Probably should’ve said something like “gradient descent will find a way to read the next token out of the QFT-based simulation”.

Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those?

I suppose I should’ve said various documents are IID to be more clear. I would certainly guess they are.

Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting?

Generally speaking, yes.

And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?

Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.

In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?

(To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it’s worth stress-testing alignment proposals under these sort of extreme scenarios, but I’m not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model’s performance would drop on data generated after training, and that would hurt the company’s bottom line, and they would have a strong financial incentive to fix it. So I don’t know if thinking about this is a comparative advantage for alignment researchers.)

BTW, the point about documents being IID was meant to indicate that there’s little incentive for the model to e.g. retrodict the coordinates of the server storing a particular document—the sort of data that could aid and incentivize omniscience to a greater degree.

In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
- dxu 13 Jun 2024 19:52 UTC
  2 points
  0
  Parent
  
  Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
  
  (Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase “validation set” in my earlier reply to you, so it’s not a matter of guessing someone’s password—I’m quite familiar with these concepts, as someone who’s implemented ML models myself.)
  
  Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they’re literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of “IID” you seem to appeal to elsewhere in your comment—that it’s possible to test the model’s generalization performance on some held-out subset of data from the same source(s) it was trained on.
  
  In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn’t (and can’t be) captured by “IID” validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model’s generalization performance, and (2) cease to correlate off-distribution, during deployment.
  
  This can be caused by what you call “omniscience”, but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
  
  If you want to call that “omniscience”, you can, although note that strictly speaking the model is still just working from inferences from training data. It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
  
  In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
  
  You’re close; I’d say the concern is slightly worse than that. It’s that the “future data” never actually comes into existence, at any point. So the source of distributional shift isn’t just “the data is generated at the wrong time”, it’s “the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated”.
  
  (This would be the case e.g. with any concept of “human approval” that came from a literal physical human or group of humans during training, and not after the system was deployed “in the wild”.)
  
  In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
  
  The problem is that “vanilla” abstractions are not the most predictively useful possible abstractions, if you’ve got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
  - Ebenezer Dukakis 14 Jun 2024 3:32 UTC
    1 point
    0
    Parent
    
    QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
    
    Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam’s Razor, I expect the internal ontology to be biased towards that of an English-language speaker.
    
    It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
    
    I’m imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it’s being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern?
    
    I’m still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to ‘random’ failures on predicting future/counterfactual data in a way that’s fairly obvious.