If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Datasets might be nice.
Object-level values.
“What do you like or dislike about my current life?”
“What kind of actions do you want to take in the next few weeks?”
“What kind of changes would you make to the world around you if you could?”
“What are some examples of kindness that you’ve witnessed?”
“Come up with a moral dilemma that seems close to you.”
“What would you do in this moral dilemma someone else came up with?”
etc.
Meta-level values.
“How would you change yourself if you could?”
“How do you feel about various ways you expect to grow and change in the future?”
“Come up with a fictional disagreement between two people who value different things.”
“How do you think these fictional people should resolve their disagreement?”
“When you feel torn between different options, how do you think you normally decide?”
“How do you think you should decide?”
“Watch this morally interesting video and describe what happened, thereby giving it an ontology.”
etc.
I tend to treat the core as that “superintelligence alignment” has to work in domains where humans aren’t good supervisors. Being able to assume good human supervision allows you to do a lot more engineering right now.
Of course there are more worlds. You didn’t even talk about baseball.
Baseball, of course, is a world unto itself. If you merely knew of atoms, math, and consciousness, you wouldn’t understand what it really meant to hit a sac fly with runners on two and three[1]. Imagine trying to explain baseball to a virus. Okay, yeah, you could do it, but the virus wouldn’t thereby be motivated to play baseball—just like the virus wouldn’t “really understand” why suffering mattered if your mere explanation didn’t cause it to care about suffering[2].
Now, you might not think baseball is as important as math or consciousness. But of course, that’s what you’d say if you were missing out on another world! Structurally, baseball[3] obeys the rules.
(If we pretend we’re not counting being able to build a model of the world based on senses/atoms that already has a simple representation of atoms/math/consciousness/baseball.)
(Since we’ve defined suffering as some stuff that’s intrinsically motivating to us, it can feel like the motivatingness is an intrinsic property of the suffering, so if we really get the virus to think about the same stuff it will by definition be motivated.)
(Or rather, the ontology we use for baseball.)
I can’t check today, but whoops, sorry if I typoed the equation at some step.
Or if your knowledge of the environment does helpful randomization for you (if you’re not >99% sure your two copies will take the same action), CDT’ll at least press the button. But yeah, interesting problem.
Is the correct policy an equilibrium? Suppose the payoff was 5$, not 1000$. If you all press with probability P, you get: (1-P)^3 of 0, 3P(1-P)^2 of −1, 3P^2(1-P) of 3, and P^3 of 2. Optimal P is 0.8873 for payoff of 2.162.
Now suppose you know your two copies are pressing the button with P=0.8873. You press with probability Q. You get (1-P)^2(1-Q) of 0, 2P(1-P)(1-Q) + (1-P)^2Q of −1, 2P(1-P)Q + P^2(1-Q) of 3, and P^2Q of 2. Optimal Q is 0. If you never press the button, you get 2*0.8873*(1-0.8873) of −1 and 0.8873^2 of 3, which is 2.262.
So if you know your copies are playing the optimal policy for three, you shouldn’t press the button :D
Well, if you formalize “gain control of more resources over time” as taking the EV of resources controlled, the agents that also make decisions based on EV of resources controlled will do well. But if you formalized it in a different way, the agents that make decisions in that different way will do well :D
I’m not sure I buy this post’s assertion that UDT violates independence. It seems more like it violates “common sense independence”, in the same way it violates “common sense choosing the best option” when it one-boxes on Newcomb’s problem.
An agent locally acting according to a good policy might violate what a CDT agent would call independence, but it still obeys independence when choosing a policy, i.e. it has a numerical utility function, just not over the same stuff as the CDT agent.
You might also be interested in philosopher Lara Buchak’s book Risk and Rationality.
She makes a thought-provoking analogy between making decisions that result in a distribution over future selves and population ethics—in population ethics you’re not required to value everyone linearly, it’s okay to reject utility monsters and say “actually I just prefer universes where people are more equal.” Decision-making without independence is like population ethics over the distribution over future selves.
I’m gonna guess you live in the Bay just based on “everything behind locked glass.” Apologies if you’re actually in a part of NYC with lots of locked glass, or if you just use that as an example because your friends online do, etc. Hello from snowy Boston, where shoplifting still exists and has returned to pre-pandemic levels but isn’t a huge political or psychological issue, and not much is behind locked glass. That said, it’s hard to do inter-temporal comparisons and there are definitely ways that shoplifting is harder now than in 1975 (e.g. video cameras), so the decline in shoplifting statistics is only moderate evidence of a reduced “shoplifting propensity.” I just think the Bay Area is an outlier in terms of recent property crime trends, government reaction to them, and social reaction to all of the above, and the lived experience of its residents, while totally valid, might not translate well to talking about crime on average in the US.
is a necessary condition for deceptive alignment
Shouldn’t most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It’s still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
Could you spell out your argument more explicitly for me? I’m unsure if you’re being a moral realist/”uniquist” here—like “But there’s a diversity of human augmentation methods, so most if not all of them have to miss the True Morality, therefore there’s there’s no prima facie moral difference between almost all augmented future humans and model-free RL on a transformer.”
Or another thing you might be saying is something like “A lot of human augmentation methods seem bad or ‘risky’ kind of like model-free RL on a transformer, in a way that’s hard for me to spell out. If we could actually choose good ones, surely we could just actually choose good AI augmentation methods.” Which I basically agree with if these happened on the same timescale. Human augmentation being farther away and slower seems like an important factor in the hope that humans would make decent choices about it.
steal-man
XD
Anyhow good points, sorry for not really engaging with the scale invariance argument—I think it’s definitely plausible. There’s some differences between scales (e.g. law enforcement being harder on larger scales) that certainly help make inter-tribe or inter-nation conflict a trickier local-equilibrium to escape than inter-personal conflict—more generally I’m unsure how much we should expect the cosmos-weighted-for-civilization-as-we’d-recognize-it to be full of civilizations that proactively move towards pareto improvements even when the environment is far away from them, versus civilizations that just sort of stumble around and try different cultural innovations until they hit ones that work just well enough.
My problem with your treatment of the civilization that’s happy to steal from the outgroup isn’t that they’ll disagree that “stealing is bad” is the Schelling answer to that question[1]. It’s that they’ll think the question is unnatural—you’ve lumped together two different things, “stealing from the ingroup” and “stealing from the outgroup,” and if you split the question up you’d get much more natural agreement that “stealing from the ingroup is bad” is the Schelling answer as is “stealing from the outgroup is good”.
Asking different questions (or equivalently, defining words in different ways as you ask the question) leads to different generalization behavior, if you’re being influenced by your conception of the “shared morality.”
Assuming you pick the same reference population—if we’re using the standard “success at being a civilization like ours” (even as an implicit meta-standard we use for picking our other standards), they might use “success at being a civilization like theirs.” If weighting by resources commanded, I think you’re underweighting bacteria and singletons that have eaten their planet of origin.
Right. When we’re far away from things, treating them as points is a useful approximation. Take the question “Which way is my house?” When I am across the city, this is a useful question with a straightforward answer. When I am in the yard, or worse, inside it, I can no longer treat my house as a point.
It is precisely because we are near to AGI (I’ve felt “inside the house” since GPT-2) that questions that treat this construct as a point aren’t very useful.
In current RL environments, slop seems to often be adaptive to when talking to humans. Better RLAIF might help, but without new clever ideas seems liable to produce simulated analogues of the same failure modes, in addition to new adversarial-to-RLAIF failure modes. Maybe if you took current models and solely made them better at metacognition, you’d see slop decrease significantly for coding tasks but only marginally for human conversation.
Something similar to what you’re talking about is an action-generating process that lives in a hierarchical world model. E.g. I want to see my family (at some broad level you might call identity), which top-down tells me to go to Chicago (choosing a high-level action within the layer of my world-model where “go to Chicago” is a primitive), which at the next layer of specificity leads me to select the “book a flight” action, which leads to selecting specific micro-actions.
Except in real life information flows up as well as down—I’m doing something more like searching for a low-cost setting of a all layers of the hierarchy simultaneously (or maybe just enough to connect the salient goals to primitive actions, if I have some “side by side” layers). I.e. difficulty doing a lower level might lead me to re-evaluate a higher level.
Either I strongly disagree with you that there’s a big gap here, or I’m one of people you’d say are normies who lead lives they expect to live (among other definitional differences).
Do you get pwned more, or just by a different set of memes? The bottom 80% of humans on “taking ideas seriously” seem to have plenty of bad memes, although maybe the variance is smaller.
Yup, also confused about this.
Was this really Xunzi’s argument? I think there’s the germ of a good argument in here, but the incoherencies don’t seem very incoherent at all.