Mathematician turned alignment researcher. Probably happy to chat about math, current ML, or long-term AI thoughts.
The basics—Nathaniel Monson (nmonson1.github.io)
Nathaniel Monson
Question for Jacob: suppose we end up getting a single, unique, superintelligent AGI, and the amount it cares about, values, and prioritizes human welfare relative to its other values is a random draw with probability distribution equal to how much random humans care about maximizing their total number of direct descendents.
Would you consider that an alignment success?
Thanks for writing this! I strongly appreciate a well-thought out post in this direction.
My own level of worry is pretty dependent on a belief that we know and understand shaping NN behaviors much better than we do (values/goals/motivations/desires) (although I don’t think eg chatGPT has any of the latter in the first place). Do you have thoughts on the distinction between behaviors and goals? In particular, do you feel like you have any evidence we know how to shape/create/guide goals and values, rather than just behaviors?
I don’t think the end result is identical. If you take B, you now have evidence that, if a similar situation arises again, you won’t have to experience excruciating pain. Your past actions and decisions are relevant evidence of future actions and decisions. If you take drug A, your chance of experiencing excruciating pain at some point in the future goes up (at least your subjective estimation of the probability should probably go up at least a bit.) I would pay a dollar to lower my best rational estimate of the chance of something like that happening to me—wouldn’t you?
In the dual interest of increasing your pleasantness to interact with and your epistemic rationality, I will point out that your last paragraph is false. You are allowed to care about anything and everything you may happen to care about or choose to care about. As an aspiring epistemic rationalist, the way in which you are bound is to be honest with yourself about message-description lengths, and your own values and your own actions, and the tradeoffs they reflect.
If a crazy person holding a gun said to you (and you believed) “i will shoot you unless you tell me that you are a yellow dinosaur named Timothy”, your epistemic rationality is not compromised by lying to save your life (as long as you are aware it is a lie). Similarly, if you value human social groups, whether intrinsically or instrumentally, you are allowed to externally use longer-than-necessary description lengths if you so choose without any bit of damage to your own epistemic rationality. You may worry that you damage the epistemic rationality of the group or its members, but damaging a community by using the shortest description lengths could also do damage to its epistemic rationality.
My understanding of the etymology of “toe the line” is that it comes from the military—all the recuits in a group lining up , with their toes touching (but never over!) a line. Hence “I need you all to toe the line on this” means “do exactly this, with military precision”
I think I would describe both of those as deceptive, and was premising on non-deceptive AI.
If you think “nondeceptive AI” can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).
I think the people I know well over 65 (my parents, my surviving grandparent, some professors) are trying to not get COVID—they go to stores only in off-peak hours, avoid large gatherings, don’t travel much. These seem like basically worth-it decisions to me (low benefit, but even lower cost). This means that their chance of getting COVID is much much higher when, eg, seeing relatives who just took a plane flight to see them.
I agree that the flu is comparably worrisome, and it wouldn’t make sense to get a COVID booster but not a flu vaccine.
Those doesn’t necessarily seem correct to me. If, eg, OpenAI develops a super intelligent, non deceptive AI, then I’d expect some of the first questions they’d ask it to be of the form “are there questions which we would regret asking you, according to our own current values? How can we avoid asking you those while still getting lots of use and insight from you? What are some standard prefaces we should attach to questions to make sure following through on your answer is good for us? What are some security measures that we can take to make sure our users lives are generally improved by interacting with you? What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?” Etc.
Surely your self-estimated chance of exposure and number of high-risk people you would in turn expose should factor in somewhere? I agree with you for people who aren’t traveling, but someone who, eg, flies into a major conference and then is visiting a retirement home the week after is doing a different calculation.
When I started trying to think rigorously about this a few months ago, I realized that I don’t have a very good definition of world model. In particular, what does it mean to claim a person has a world model? Given a criteria for an LLM to have one, how confident am I that most people would satisfy the criteria?
I think it is 2-way, which is why many (almost all?) Alignment researchers have spent a significant amount of time looking at ML models and capabilities, and have guesses about where those are going.
In that case, I believe your conjecture is trivially true, but has nothing to do with human intelligence or Bengio’s statements. In context, he is explicitly discussing low dimensional representations of extremely high dimensional data, and the things human brains learn to do automatically (I would say analogously to a single forward pass).
If you want to make it a fair fight, you either need to demonstrate a human who learns to recognize primes without any experience of the physical world (please don’t do this) or allow an ML model something more analogous to the data humans actually receive, which includes math instruction, interacting with the world, many brain cycles, etc
I agree with your entire first paragraph. It doesn’t seem to me that you have addressed my question though. You are claiming that this hypothesis “implies that machine learning alone is not a complete path to human-level intelligence.” I disagree. If I try to design an ML model which can identify primes, is it fair for me to give it some information equivalent to the definition (no more information than a human who has never heard of prime numbers has)?
If you allow that it is fair for me to do so, I think I can probably design an ML model which will do this. If you do not allow this, then I don’t think this hypothesis has any bearing on whether ML alone is “a complete path to human-level intelligence.” (Unless you have a way of showing that humans who have never received any sensory data other than a sequence of “number:(prime/composite)label” pairs would do well on this.)
“implies that machine learning alone is not a complete path to human-level intelligence.”
I don’t think this is even a little true, unless you are using definitions of human level intelligence and machine learning which are very different than the ideas I have of them.
If you have a human who has never heard of the definition of prime numbers, how do you think they would do on this test? Am I allowed to.supply my model with something equivalent to the definition?
Have you looked into new angeles? Action choices are cooperative, with lots of negotiation. Each player is secretly targeting another player, and wins if they end with more points than their target (so you could have a 6 player game where the people who ended with most, and 4th and 5th most win, while 2nd, 3rd, and 6th lose.)
This comment confuses me.
Why is Tristan in quotes? Do you not believe it’s his real name?
What is the definition of the community you’re referring to?
I don’t think I see any denigration happening—what are you referring to?
What makes someone an expert or an imposter in your eyes? In the eyes of the community?
I clicked the link in the second email quite quickly—i assumed it was a game/joke, and wanted to see what would happen. If I’d actually thought I was overriding people’s preferences, I… probably would have still clicked because I don’t think I place enormous value on people’s preferences for holiday reasons, and I would have enjoyed being the person who determined it.
There are definitely many circumstances where I wouldn’t unilaterally override a majority. I should probably try to figure out what the principles behind those are.
I have a strong preference for non-ironic epistemic status. Can you give one?
If the review panel recommends a paper for a spotlight, there is a better than 50% chance a similarly-constituted review panel would have rejected the paper from the conference entirely:
https://blog.neurips.cc/2021/12/08/the-neurips-2021-consistency-experiment/
In the spirit of “no stupid questions”, why not have the AI prefer to have the button in the state that it believes matches my preferences?
I’m aware this fails against AIs that can successfully act highly manipulative towards humans, but such an AI is already terrifying for all sorts of other reasons, and I think the likelihood of this form of corrigibility making a difference given such an AI is quite low.
Is the answer roughly “we don’t care about the off-button specifically that much, we care about getting the AI to interact with human preferences which are changeable without changing them”?