A math and computer science graduate interested in machine and animal cognition, philosophy of language, interdisciplinary ideas, etc.
Ben Amitay
BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others’ (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.
The main problem with that mechanism is that you liking chocolate will probably leak as “its good for me too to eat chocolate”, not “its good for me too when beren eat chocolate”—which is more likely to cause conflict then coordination, if there is only that much chocolate.
I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn’t have to be that strong.
[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details—as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, as you wrote having a model of other agents is obviously generally useful, and it may require having an approximation of both their worlds models and value functions as part of your world model. Unless you have huge amounts of data and compute, you are going to reuse your own world model as theirs, with small corrections on top. But this is about your world model, not your value function.
[The part that help your argument. Epistemic status: Many speculative details, but ones that I find pretty convincing, at least before multiplying their probabilities] Except having the value function of other agents in your world model, and having the mechanisim for predicting their action as part of your world-model-update, is basically replicating computations that you already have in your actor and critic, in a more general form. Your original actor and critique are then likely to simplify to “do the things that my model of myself would, and value the results as much as my model of myself would” + some corrections. In that stage, if the “some corrections” part is not too heavy, you may have some confusion of the kind that you described. Of course, it will still be optimized against.
Thanks for the reply.
To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I’m not an error theorist but something like “meta-error theorist”—think that people do try to claim something, but not sure how that thing could map to external reality. )
Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)
https://astralcodexten.substack.com/p/you-dont-want-a-purely-biological
The thing that Scott is desperately trying to avoid being read out of context.
Also, pedophilia is probably much more common than anyone think (just like any other unaccepted sexual variation). And probably just like many heterosexuals feel little touches of homosexual desire, many “non-pedophiles” feel something sexual-ish toward children at least sometimes.
And if we go there—the age of concent is (justifiably) much higher than the age that requires any psychological anomaly to desire. More directly: many many old men that have no attraction to 10-yo girls do have for 14-yo and maybe younger.
(I hope it is clear enough that nothing I wrote here is meant to have any moral implications around concent—only about compassion)
I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is “that question don’t make any sense. Let me replace it with ‘what I want to do’ or some equivalent”. Or at best “that question don’t make any sense. raise ValueError(‘pun intended’)”
Was eventually convinced in most of your points, and added a long mistakes-list in the end of the post. I would really appreciate comments on the list, as I don’t feel fully converged on the subject yet
The basic idea seem to me interesting and true, but I think some important ingredients are missing, or more likely missing in my understanding of what you say:
-
It seem like you upper-bound the abstractions we may use by basically the information that we may access (actually even higher, assuming you do not exclude what the neighbour does behind closed doors). But isn’t this bound very loose? I mean, it seem like every pixel of my sight count as “information at a distance”, and my world model is much much smaller.
-
is time treated the like space? From the one hand it seem like it have to if we want to abstract colour from sequence of amplitudes, but it also feels meaningfully different.
-
is the punch to define objects as blankets with much more information inside than may be viewed from far outside?
-
the part where all the information that may be lost in the next layer is assumed to already been lost, seen to assume symmetries—are those explicit part of the project?
-
in practice, there seem to be information loss and practically-indeterminism in all scales. E.g. when i go further from a picture i keep losing details. Wouldn’t it make more sense to talk about how far (in orders of magnitude) information travel rather than saying that n either does it does not go to infinity?
Sorry about my English, hope it was clear enough
-
Didn’t know about the problems setting. So cool!
Some random thought, sorry if none is relevant:
I think my next step towards optimality would have been not to look for an optimal agent but for an optimal act of choosing the agent—as action optimality is better understood than agent optimality. Than I would look at stable mixed equilibria to see if any of them is computable. If any is, I’ll be interested at the agent that implement it (ie randomise another agent and then simulate it)
BTW now that I think about it I see that allowing the agent to randomise is probably strongly related to allowing the agent to not be fully transparent about its program, as it may induce uncertainty about which other agent it is going to simulate.
Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:
About the loss function for next word prediction—my point was that I’m not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.
About solving discrete representations with architectural change—I think that I meant only that the representation is easy and not the training, but anyway I agree that training it may be hard or at least require non-standard methods.
About the inductive logic and describing pictures in low-resolution: I made the same communication mistake in both, which is to consider things that are ridiculously highly regulated as not part of the hypothesis space at all. There probably is a logical formula that describe the probability of a given image to be a cat, to every degree of precision. I claim that will will never be able to find or represent that formula, because it is so regulated against. And that this is the price that the theory forced us to pay for the generalisation.
Directionally agree, but: A) Short period of trade before we become utterly useless is not much comfort. B) Trade is a particular case of bootstrapping influence on what an agent value to influence on their behaviour. The other major way of doing that is blackmail—which is much more effective in many circumstances, and would have been far more common if the State didn’t blackmail us to not blackmail each other, to honour contacts, etc.
BTW those two points are basically how many people afraid that capitalism (i.e. our trade with super human organisations) may go wrong: A) Automaton may make us less and less economically useful. B) enough money may give an organisation the ability to blackmail—private army, or more likely influence on governmental power.
Assuming that automation here mean AI, this is basically hypothesising a phase in which the two kinds of super human agents (AI and companies) are still incentivized to cooperate with each other but not with us.
Some sort points:
“human-level question answering is believed to be AI-complete”—I doubt that. I think that we consistently far overestimate the role of language in our cognition, and how much we can actually express using language. The simplest example that come to mind is trying to describe a human face to an AI system with no “visual cortex” in a way that would let it generate a human image (e.g. hex representation of pixels). For that matter, try to describe something less familiar than a human face to a human painter in hope that they can paint it.
“GPT… already is better than humans at next-word prediction”—somewhat besides the point, but I do not think that the loss function that we use in training is actually the one that we care about. We don’t care that match about specific phrasing, and use the “loss” of how match the content make sense, is true, is useful… Also, we are probably much better at implicit predictions than in explicit predictions, in ways that make us underperform in many tests.
Language Invites Mind Projection—anecdotally, I keep asking ChatGPT to do things that I know it would suck at, because I just can’t bring myself to internalise the existence of something that is so fluent, so knowledgeable and so damn stupid at the same time.
Memorization & generalisation—just noting that it is a spectrum rather than a dichotomy, as compression ratios are. Anyway, the current methods don’t seem to generalise well enough to overcome the sparsity of public data in some domains—which may be the main bottleneck in (e.g.) RL anyway.
“This, in turn, suggests a data structure that is discrete and combinatorial, with syntax trees, etc, and neural networks do (according to the argument) not use such representations”—let’s spell the obvious objection—it is obviously possible to implement discrete representations over continuous representations. This is why we can have digital computers that are based on electrical currents rather than little rocks. The problem is just that keeping it robustly discrete is hard, and probably very hard to learn. I think that problem may be solved easily with minor changes of architecture though, and therefore should not effect timelines.
Inductive logic programming—generalise well in a much more restricted hypothesis space, as one should expect based on learning theory. The issue is that the real world is too messy for this hypothesis space, which is why it is not ruled by mathematicians/physicists. Is may be useful as an augmentation for a deep-learning agent though, the way that calculators are useful for humans.
Maybe we should think explicitly about what work is done by the concept of AGI, but I do not feel like calling GPT an AGI does anything interesting to my world model. Should I expect ChatGPT to beat me at chess? It’s next version? If not—is it due to shortage of data or compute? Will it take over the world? If not—may I conclude that the next AGI wouldn’t?
I understand why the bar-shifting thing look like motivated reasoning, and probably most of it actually is, but it deserves much more credit that you give it. We have an undefined concept of “something with virtually all the cognitive abilities of a human, that can therefore do whatever a human can”, and some dubious assumptions like “if it can sensibly talk about everything, it can probably understand everything”. Than we encounter ChatGPT, and it is amazing at speaking, except giving a strong impression of talking to an NPC. NPC who know lots of stuff and can even sort-of-reason in very constrained ways, do basic programming and be “creative” as in writing poetry—but is sub-human at things like gathering useful information, inferring people’s goals, etc. So we conclude that some cognitive ability is still missing, and try to think how to correct for that.
Now, I do not care to call GPT an AGI, but you will have to invent a name for the super-AGI things that we try to achieve next, and know to be possible because humans exist.
Last point: if we change the name of “world model” to “long-term memory”, we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I’m which direction).
(continuation of the same comment—submitted by mistake and cannot edit...) Assuming modules A,B are already “talking” and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B
Hi, great post. Modularity was for a while on the top of my list of things-missing-for-capabilities, and it is interesting to see it’s relevance to safety too.
Some points about the hypothesised modules:
I did some introspection lately to learn about how I walk, and now suspect that the scm is not so much a separate module, but more like synchronization locks in computing—as if for example there where a lock for each of my muscles, and when two usually-unconscious movements try to use the same one, one of them seem to be blocked—notifying my consciousness when it is important enough. I dare to hypothesize farther that there is a similar lock for every variable that my movement may control—moving my car is treated the same as moving my finger. Actually, it may be even easier than that—the first shard may write into the world model a prediction about my future movement. Than when the second shard will try to do the same, the result will be “inconsistency notification”—same as if it was a surprising observation.
in general, I think that the modules will automatically have incentive to use the same protocols as long as their subject of communication is not too far removed.
Abstract and general are spectra. I agree that the maximally-specific is not good at spreading—but neither is the maximally-general
I have similar but more geometric way of thinking about it. I think of the distribution of properties as a topography of many mountains and valleys. Then we get hierarchical clustering as mountains with multiple tops, and for each cluster we get the structure of a lower dimensional manifold by looking only at the directions for which the mountain is relatively wide and flat.
Of course, the underlying geometry and as a result the distribution density are themselves subjective and dependant on what we care about—pixel-by-pixel or atom-by-atom comparison would not yield similarity between trees even of the same species
I also find the question very interesting, but have different intuition about what travel father. I think that in general, concrete things are actually quite similar wherever there are humans, at least in the distances that where relevant for most of our history. If I am a Judean I know what a cow looks like, and every other Hebrew speaker know too, and almost every speaker of any language similar to Hebrew knows too—though maybe they have a little different variant of cow. From the other hand, if I’m a Hindu starting a new religion that is about how to get enlightenment—chances are that in the next greenstone there would be 4 competing schools with mutually exclusive understanding of the word “enlightenment”. The reason is that we generally synchronize our language around shared experience of the concrete, and have much less degrees of freedom when conceptualising it.
I think we have much more disagreement about psychology than about AI, though I admit to low certainty about the psychology too.
About AI, my point was that in understand the problem, the training loop take roughly the role of evolution and the model take that off the evolved agent—with implications to comparison of success, and possibly to identifying what’s missing. I did refer to the fact that algorithmically we took ideas from the human brain to the training loop, and it therefore make sense for it to be algorithmically more analogous to the brain. Given that clarification—do you still mostly disagree? (If not—how do you recommend to change the post and make it clearer?)
Adding “short term memory” to the picture is interesting, but then it’s there any mechanism for it to become long-term?
About the psychology: I do find the genetic bottleneck argument intuitively convincing, but think that we have reasons to distrust this intuition. There is often huge disparity between data in its most condensed form, and data in a form that is convenient to use in deployment. Think about the difference in length between a code written in functional/declarative language, and it’s assembly code. I have literally no intuition as to what can be done with 10 megabytes of condensed python—but I guess that it is more than enough to automate a human, if you know what code to write. While there probably is a lot of redundancy in the genome, it seem as least likely that there is huge redundancy of synapses, as their use is not just to store information, but mostly to fit the needed information manipulations.
And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals—as the terminal goals of each of us is a noisy approximation of evolution’s “goal” of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)