AI Alignment researcher. Alternatively known as Pi Rogers.
Morphism
Emotions can be treated as properties of the world, optimized with respect to constraints like anything else. We can’t edit our emotions directly but we can influence them.
Oh no I mean they have the private key stored on the client side and decrypt it there.
Ideally all of this is behind a nice UI, like Signal.
I mean, Signal messenger has worked pretty well in my experience.
But safety research can actually disproportionally help capabilities, e.g. the development of RLHF allowed OAI to turn their weird text predictors into a very generally useful product.
I could see embedded agency being harmful though, since an actual implementation of it would be really useful for inner alignment
Some off the top of my head:
Outer Alignment Research (e.g. analytic moral philosophy in an attempt to extrapolate CEV) seems to be totally useless to capabilities, so we should almost definitely publish that.
Evals for Governance? Not sure about this since a lot of eval research helps capabilities, but if it leads to regulation that lengthens timelines, it could be net positive.
Edit: oops i didn’t see tammy’s comment
Idea:
Have everyone who wants to share and recieve potentially exfohazardous ideas/research send out a 4096-bit RSA public key.
Then, make a clone of the alignment forum, where every time you make a post, you provide a list of the public keys of the people who you want to see the post. Then, on the client side, it encrypts the post using all of those public keys. The server only ever holds encrypted posts.
Then, users can put in their own private key to see a post. The encrypted post gets downloaded to the user’s machine and is decrypted on the client side. Perhaps require users to be on open-source browsers for extra security.
Maybe also add some post-quantum thing like what Signal uses so that we don’t all die when quantum computers get good enough.
Should I build this?
Is there someone else here more experienced with csec who should build this instead?
Is this a massive exfohazard? Should this have been published?
Yikes, I’m not even comfortable maximizing my own CEV.
What do you think of this post by Tammy?
Where is the longer version of this? I do want to read it. :)
Well perhaps I should write it :)
Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn’t RL environments for AI cause the same or perhaps a different set of irrationalities?
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases. I think Yudkowsky and/or Hanson has written about this.
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
If we give the AI no reason to self-deceive, the natural instrumentally convergent incentive is to not self-deceive, so it won’t self-deceive.
Again, though, I’m not super confident in this. Deep deception or similar could really screw us over.
Also, how does RL fit into QACI? Can you point me to where this is discussed?
I have no idea how Tammy plans to “train” the inner-aligned singleton on which QACI is implemented, but I think it will be closer to RL than SL in the ways that matter here.
But we could have said the same thing of SBF, before the disaster happened.
I would honestly be pretty comfortable with maximizing SBF’s CEV.
Please explain your thinking behind this?
TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).
Sorrry if I was unclear there.
It’s not, because some moral theories are not compatible with EU maximization.
I’m pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong.
And I think this uncertainty problem can be solved by forcing utility bounds.
I’m 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call “human values”) if they were rational enough and had good enough decision theory.
If I’m wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.
I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist’s curse. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim)
I agree that moral uncertainty is a very hard problem, but I don’t think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization.
To solve (2), I think we should try to adapt something like the Hippocratic principle to work for QACI, without requiring direct reference to a human’s values and beliefs (the sidestepping of which is QACI’s big advantage over PreDCA). I wonder if Tammy has thought about this.
What about the following:
My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I’m a multiverse-wide realityfluid-weighted average utilitarian.
I’m not sure how correct this is, but it’s possible.
Edit log:
2024-04-30 19:31 CST: Footnote formatting fix and minor grammar fix.
20:40 CST: “The problem is...” --> “Alignment is...”
22:17 CST: Title changed from “All we need is a pointer” to “The formal goal is a pointer”
OpenAI is not evil. They are just defecting on an epistemic prisoner’s dilemma.
Maybe some kind of simulated long-reflection type thing like QACI where “doing philosophy” basically becomes “predicting how humans would do philosophy if given lots of time and resources”
Yes, amount of utopiastuff across all worlds remains constant, or possibly even decreases! But I don’t think amount-of-utopiastuff is the thing I want to maximize. I’d love to live in a universe that’s 10% utopia and 90% paperclips! I much prefer that to a 90% chance of extinction and a 10% chance of full-utopia. It’s like insurance. Expected money goes down, but expected utility goes up.
Decision theory does not imply that we get to have nice things, but (I think) it does imply that we get to hedge our insane all-or-nothing gambles for nice things, and redistribute the nice things across more worlds.
I think this is only true if we are giving the AI a formal goal to explicitly maximize, rather than training the AI haphazardly and giving it a clusterfuck of shards. It seems plausible that our FAI would be formal-goal aligned, but it seems like UAI would be more like us unaligned humans—a clusterfuck of shards. Formal-goal AI needs the decision theory “programmed into” its formal goal, but clusterfuck-shard AI will come up with decision theory on its own after it ascends to superintelligence and makes itself coherent. It seems likely that such a UAI would end up implementing LDT, or at least something that allows for acausal trade across the Everett branches.
Fixed it! Thanks! It is very confusing that half the time people talk about loss functions and the other half of the time they talk about utility functions
Solution to 8 implemented in python using zero self-reference, where you can replace f with code for any arbitrary function on string x (escaping characters as necessary):
f=”x+‘\\n’+x”
def ff(x):
return eval(f)
(lambda s : print(ff(‘f=’+chr(34)+f+chr(34)+chr(10)+‘def ff(x):’+chr(10)+chr(9)+‘return eval(f)‘+chr(10)+s+‘(’+chr(34)+s+chr(34)+‘)’)))(“(lambda s : print(ff(‘f=’+chr(34)+f+chr(34)+chr(10)+‘def ff(x):’+chr(10)+chr(9)+‘return eval(f)‘+chr(10)+s+‘(’+chr(34)+s+chr(34)+‘)’)))”)edit: fixed spoiler tags
Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.
The word “preference” is overloaded (and so are related words like “want”). It can refer to one of two things:
How you want the world to be i.e. your terminal values e.g. “I prefer worlds in which people don’t needlessly suffer.”
What makes you happy e.g. “I prefer my ice cream in a waffle cone”
I’m not sure how we should distinguish these. So far, my best idea is to call the former “global preferences” and the latter “local preferences”, but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime. Does anyone have a better name for this distinction?
I think we definitely need to distinguish them, however, because they often disagree, and most “values disagreements” between people are just disagreements in local preferences, and so could be resolved by considering global preferences.
I may write a longpost at some point on the nuances of local/global preference aggregation.
Example: Two alignment researchers, Alice and Bob, both want access to a limited supply of compute. The rest of this example is left as an exercise.