A math and computer science graduate interested in machine and animal cognition, philosophy of language, interdisciplinary ideas, etc.
Ben Amitay
So imagine hearing the audio version—no images there
Agents synchronization
I like the general direction of LLMs being more behaviorally “anthropomorphic”, so hopefully will look into the LLM alignment links soon :-)
The useful technique is...
Agree—didn’t find a handle that I understand well enough in order to point at what I didn’t.
We have here a morally dubious decision
I think my problem was with sentences like that—there is a reference to a decision, but I’m not sure whether to a decision mentioned in the article or in one of the comments.
the scenario in this thread
Didn’t disambiguate it for me though I feel like it should.
I am familiar with the technical LW terms separately, so Ill probably understand their relevance once the reference issue is resolved.
I didn’t understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)
I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).
Don’t you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective—from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth the destruction of the information that we contain—the way humans would try to avoid reducing biodiversity for scientific reasons if not others.
In that sense I prefer Eliezer’s “you are made of atoms that it needs for something else”—but it may take long time before it have better things to do with those specific atoms and no easier atoms to use.
I meant to criticize moving too far toward “do no harm” policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it.
Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions (“bringing to safety” mindset) - though I admit that I didn’t watch it closely, and it may not be obvious from the outside what kind of work is done and not published.
[Edit: “moving toward ‘do no harm’”—“moving to” was a grammar mistake that make it contrary to position you stated above—sorry]
I think that is an example of the huge potential damage of “security mindset” gone wrong. If you can’t save your family, as in “bring them to safety”, at least make them marginally safer.
(Sorry for the tone of the following—it is not intended at you personally, who did much more than your fair share)
Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you say that most of your probability of survival comes from the possibility that you are wrong—trying to protect your family is trying to at least optimize for such miracle.
There is no safe way out of a war zone. Hiding behind a rock is not therfore the answer.
I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can’t ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.
More on the meta level: “This sort of works, but not enough to solve it.”—do you mean “not enough” as in “good try but we probably need something else” or as in “this is a promising direction, just solve some tractable downstream problem”?
“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
[Question] Training for corrigability: obvious problems?
This is an important distinction, that show in its cleanest form in mathematics—where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy—you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers—that have multiple axiomatic definitions and corresponding constructions.
In science, there is the classic “are wails fish?”—which is mostly about whether to look at their construction/mechanism (genetics, development, metabolism...) or their patterns of interaction with their environment (the behavior of swimming and the structure that support it). That example also emphasize that we natural language simplly don’t respect this distinction, and consider both internal structure and outside relations as legitimate “coordinates in thingspace” that may be used together to identify geometrically-natural categories.
A others said, it mostly made me update in the direction of “less dignity”—my guess is still that it is more misaligned than agentic/deceptive/carefull and that it is going to be disconnected from the internet for some trivial ofence before it does anything xrisky; but its now more salient to me that humanity will not miss any reasonable chance of doom until something bad enough happen, and will only survive if there is no sharp left turn
We agree 😀
What do you think about some brainstorming in the chat about how to use that hook?
Since I became reasonably sure that I understand your position and reasoning—mostly changing it.
That was good for my understanding of your position. My main problem with the whole thing though is in the use the word “bad”. I think it should be taboo at least until we establish a shared meaning.
Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word “bad”. I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our parents angry… In my view, the argument is than basically:
I want to avoid my suffering, and now generally person p want to avoid person p suffering. Therfore suffering is “to be avoided” in general, therefore suffering is “thing my parents will punish for”, therefore avoid creating suffering.
When written that way, it doesn’t seem more logical than is opposite.
Let me clarify that I don’t argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don’t see similar epistemic mechanism in the case of morality.
Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don’t see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.
I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?
It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word “dog”. That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result. Every reflection and generalization that we do is ultimately about that, and can achieve something meaningful because of that.
I do not see the analogous story for moral reflection.
It can work by generalizing existing capabilities. My understanding of the problem is that it can not get the benefits of extra RL training because training to better choose what to remember is to tricky—it involves long range influence, and estimating the opportunity cost of fetching one thing and not another, etc. Those problems are probability solvable, but not trivial.