Drew the shoggoth and named notkilleveryoneism.
Tetraspace
I’d like beta access. My main use case is that I intend to write up some thoughts on alignment (Manifold gives 40% that I’m proud of a write-up, I’d like that number up), and this would be helpful for literature review and finding relevant existing work. Especially so because a lot of the public agent foundations work is old and migrated from the old alignment forum, where it’s low-profile compared to more recent posts.
AI isn’t dangerous because of what experts think, and the arguments that persuaded the experts themselves are not “experts think this”. It would have been a misleading argument for Eliezer in 2000 being among the first people to think about it in the modern way, or for people who weren’t already rats in maybe 2017 before GPT was in the news and when AI x-risk was very niche.
I also have objections to its usefulness as an argument; “experts think this” doesn’t give me any inside view of the problem by which I can come up with novel solutions that the experts haven’t thought of. I think this especially comes up if the solutions might be precise or extreme; if I was an alignment researcher, “experts think this” would tell me nothing about what math I should be writing, and if I was a politician, “experts think this” would be less likely to get me to come up with solutions that I think would work rather than solutions that are compromising between the experts coalition and my other constituents.
So, while it is evidence (experts aren’t anticorrelated with the truth), there’s better reasoning available that’s more entangled with the truth and gives more precise answers.
I learned this lesson looking at the conditional probabilities of candidates winning given they were nominated in 2016, where the candidates with less than about 10% chance of being the nominee had conditional probabilities with noise between 0 and 100%. And this was on the thickly traded real-money markets of Betfair! I personally engage in, and also recommend, just kinda throwing out any conditional probabilities that look like this, unless you have some reason to believe it’s not just noise.
Another place this causes problems is in the infinitely-useful-if-they-could-possibly-work decision markets, where you want to be able to evaluate counterfactual decisions, except these are counterfactuals so you don’t make the decision so there’s no liquidity and it can take any value.
Obeying it would only be natural if the AI thinks that the humans are more correct than the AI would ever be, after gathering all available evidence, where “correct” is given by the standards of the definition of the goal that the AI actually has, which arguendo is not what the humans are eventually going to pursue (otherwise you have reduced the shutdown problem to solving outer alignment, and the shutdown problem is only being considered under the theory that we won’t solve outer alignment).
An agent holding a belief state that given all available information it will still want to do something other than the action it will think is best then is anti-natural; utility maximisers would want to take that action.
This is discussed on Arbital as the problem of fully updated deference.
This ends up being pretty important in practise for decision markets (“if I choose to do X, will Y?”), where by default you might e.g. only make a decision if it’s a good idea (as evaluated by the market), and therefore all traders will condition on the market having a high probability which is obviously quite distortionary.
I replied on discord that I feel there’s maybe something more formalisable that’s like:
reality runs on math because, and is the same thing as, there’s a generalised-state-transition function
because reality has a notion of what happens next, realityfluid has to give you a notion of what happens next, i.e. it normalises
the idea of a realityfluid that doesn’t normalise only comes to mind at all because you learned about R^n first in elementary school instead of S^n
which I do not claim confidently because I haven’t actually generated that formalisation, and am posting here because maybe there will be another Lesswronger’s eyes on it that’s like “ah, but...”.
Not unexpected! I think we should want AGI to, at least until it has some nice coherent CEV target, explain at each self-improvement step exactly what it’s doing, to ask for permission for each part of it, to avoid doing anything in the process that’s weird, to stop when asked, and to preserve these properties.
Even more recently I bought a new laptop. This time, I made the same sheet, multiplied the score from the hard drive by because 512 GB is enough for anyone and that seemed intuitively the amount I prioritised extra hard drive space compared to RAM and processor speed, and then looked at the best laptop before sharply diminishing returns set in; this happened to be the HP ENVY 15-ep1503na 15.6″ Laptop—Intel® Core™ i7, 512 GB SSD, Silver. This is because I have more money now, so I was aiming to maximise consumer surplus rather than minimise the amount I was spending.[1]
Surprisingly, it came with a touch screen! That’s just the kind of nice thing that laptops do nowadays, because as I concluded in my post, everything nice about laptops correlates with everything else so high/low end is an axis it makes sense to sort things on. Less surprisingly, it came with a graphics card, because ditto.
Unfortunately this high-end laptop is somewhat loud; probably my next one will be less loud, up to including an explicit penalty for noise.
- ^
It would have been predictable, however, at the time that I bought that new laptop, that I would have had that much money at a later date. Which means that I should have just skipped straight to consumer surplus maxxing.
- ^
It would be evidence at all. Simple explanation: if we did observe a glitch, that would pretty clearly be evidence we were in a simulation. So by conservation of expected evidence, non-glitches are evidence against.
I don’t think it’s quite that; a more central example I think would be something like a post about extrapolating demographic trends to 2070 under the UN’s assumptions, where then justifying whether or not 2070 is a real year is kind of a different field.
, as a mathematical structure, is smarter than god and perfectly aligned to ; the value of will never actually be because is more objectively rational, or because you made a typo and it knows you meant to say ; and no matter how complicated the mapping is from to it will never fall short of giving the that gives the highest value of .
Which is why in principle you can align a superior being, like , or maybe like a superintelligence.
“The AI does our alignment homework” doesn’t seem so bad—I don’t have much hope for it, but because it’s a prosaic alignment scheme so someone trying to implement it can’t constrain where Murphy shows up, rather than because it’s an “incoherent path description”.
A concrete way this might be implemented is
A language model is trained on a giant text corpus to learn a bunch of adaptations that make it good at math, and then fine-tuned for honesty. It’s still being trained at a safe and low level of intelligence where honesty can be checked, so this gets a policy that produces things that are mostly honest on easy questions and sometimes wrong and sometimes gibberish and never superhumanly deceptive.[1]
It’s set to work producing conceptually crisp pieces of alignment math, things like expected utility theory or logical inductors, slowly on inspectable scratchpads and so on, with the dumbest model that can actually factor scientific research[1], with human research assistants to hold their hand if that lets you make the model dumber. It does this, rather than engineering, because this kind of crisp alignment math is fairly uniquely pinned down so it can be verified, and it’s easier to generate compared to any strong pivotal engineering task where you’re competing against humans on their own ground so you need to be smarter than humans, so while it’s operating in a more dangerous domain it’s using a safer level of intelligence.[1]
The human programmers then use this alignment math to make an corrigible thingy that has dangerous levels of intelligence that does difficult engineering and doesn’t know about humans, while this time knowing what they’re doing. Getting the crisp alignment math from parallelisable language models helps a lot and gives them a large lead time, because a lot of it’s the alignment version of backprop where it would have took a surprising amount of time to discover otherwise.
This all happens at safe-ish low-ish levels of intelligence (such a model would probably be able to autonomously self-replicate on the internet, but probably not reverse protein folding, which means that all the ways it could be dangerous are “well don’t do that”s as long as you keep the code secret[1]), with the actual dangerous levels of optimisation being done by something made by the humans using pieces of alignment math which are constrained down to a tiny number of possibilities.
EDIT 2023-07-25: A longer debate that I think is worth reading about the model that leads it to being an incoherent path description between Holden Karnofsky (pro) and Nate Soares (against) is here; I hadn’t read this as of writing this.
- ^
Unless it isn’t; it’s a giant pile of tensors, how would you know? But this isn’t special to this use case.
The alignment, safety and interpretability is continuing at full speed, but if all the efforts of the alignment community are sufficient to get enough of this to avoid the destruction of the world in 2042, and AGI is created in 2037, then at the end you get a destroyed world.
It might not be possible in real life (List of Lethalities: “we can’t just decide not to build AGI”), and even if possible it might not be tractable enough to be worth focusing any attention on, but it would be nice if there was some way to make sure that AGI happens after alignment is sufficient at full speed (EDIT: or, failing that, to happen later, so if alignment goes quickly that takes the world from bad outcomes to good outcomes, instead of bad outcomes to bad outcomes).
80,000 Hours’ job board lets you filter by city. As of the time of writing, roles in their AI Safety & Policy tag are 61⁄112 San Francisco, 16⁄112 London, 35⁄112 other (including remote).
There are about 8 billion people, so your 24,000 QALYs should be 24,000,000.
I don’t mean to say that it’s additional reason to respect him as an authority or accept his communication norms above what you would have done for other reasons (and I don’t think people particularly are here), just that it’s the meaning of that jokey aside.
Maybe you got into trouble for talking about that because you are rude and presumptive?
I think this is just a nod to how he’s literally Roko, for whom googling “Roko simulation” gives a Wikipedia article on what happened last time.
What, I wonder, shall such an AGI end up “thinking” about us?
IMO: “Oh look, undefended atoms!” (Well, not in that format. But maybe you get the picture.)
You kind of mix together two notions of irrationality:
(1-2, 4-6) Humans are bad at getting what they want (they’re instrumentally and epistemically irrational)
(3, 7) Humans want complicated things that are hard to locate mathematically (the complexity of value thesis)
I think only the first one is really deserving of the name “irrationality”. I want what I want, and if what I want is a very complicated thing that takes into account my emotions, well, so be it. Humans might be bad at getting what they want, they might be mistaken a lot of the time about what they want and constantly step on their own toes, but there’s no objective reason why they shouldn’t want that.
Still, when up against a superintelligence, I think that both value being fragile and humans being bad at getting what they want count against humans getting anything they want out of the interaction:
Superintelligences are good at getting what they want (this is really what it means to be a superintelligence)
Superintelligences will have whatever goal they have, and I don’t think that there’s any reason why this goal would be anything to do with what humans want (the orthogonality thesis; the goals that a superintelligence has are orthogonal to how good it is at achieving them)
This together adds up to a superintelligence sees humans using resources that it could be using for something else (and it would want them to use them for something else, not just what the humans are trying to do but more, because it has its own goals), and because it’s good at getting what it wants it gets those resources, which is very unfortunate for the humans.
Fast/Slow takeoff