Fundamentally, the story was about the failure cases of trying to make capable systems that don’t share your values safe by preventing specific means by which its problem solving capabilities express themselves in scary ways. This is different to what you are getting at here, which is having those systems actually operationally share your values. A well aligned system, in the traditional ‘Friendly AI’ sense of alignment, simply won’t make the choices that the one in the story did.
Veedrac
I was finding it a bit challenging to unpack what you’re saying here. I think, after a reread, that you’re using ‘slow’ and ‘fast’ in the way I would use ‘soon’ and ‘far away’ (aka. referring to the time it will occur from the present). Is this read about correct?
If ‘Opt into Petrov Day’ was aside something other than a big red ominous button, I would think the obvious answer is that it’s a free choice and I’d be positively inclined towards it. Petrov Day is a good thing with good side effects, quite unlike launching nuclear weapons.
It is confusing to me that it is beside a big red ominous button. On the one hand, Petrov’s story is about the value of caution. To quote a top comment from an older Petrov Day,
Petrov thought the message looked legit, but noticed there were clues that it wasn’t.
On the other hand, risk-taking is good, opting in to good things is good, and if one is taking Petrov Day to mean ‘don’t take risks if they look scary’ I think one is taking an almost diametrically wrong message from the story.
All that said, for now I am going to fall for my own criticism and not press the big red ominous button around Petrov Day.
I’m interested in having Musk-company articles on LessWrong if it can be done while preserving LessWrong norms. I’m a lot less interested in it if it means bringing in sarcasm, name calling, and ungrounded motive-speculation.
if, judging by looking at some economical numbers, poverty already doesn’t exist for centuries, why do we feel so poor
Let’s not forget that people who read LW, often highly intelligent and having well-paying jobs such as software development
This underlines what I find so incongruous about EY’s argument. I think I genuinely felt richer as a child eating free school meals in the UK but going to a nice school and whose parents owned a house than I do as an obscenely-by-my-standards wealthy person in San Francisco. I’m hearing this elaborate theory to explain why social security doesn’t work when I have lived through and seen in others clear evidence that it can and it does. If the question “why hasn’t a factor-100 increase in productivity felt like a factor-100 increase in productivity?” was levied at my childhood specifically, my response is that actually it felt like exactly that.
By the standards of low earning households my childhood was probably pretty atypical and I don’t mean to say there aren’t major systemic issues, especially given the number of people locked into bad employment, people with lives destroyed by addiction, people who struggle to navigate economic systems, people trapped in abusive or ineffectual families, etc. etc. etc. I really don’t want to present a case just based on my lived experience, even including those I know living various lives under government assistance. But equally I think someone’s lived experience of being wealthy in San Francisco and seeing drug addicts on the street is also not seeing an unbiased take of what social security does for poverty.
Eg. a moderately smart person asking it to do something else by trying a few prompts. We’re getting better at this for very simple properties but I still consider it unsolved there.
Reply to https://twitter.com/krishnanrohit/status/1794804152444580213, too long for twitter without a subscription so I threw it here, but do please treat it like a twitter comment.
rohit: Which part of [the traditional AI risk view] doesn’t seem accounted for here? I admit AI safety is a ‘big tent’ but there’s a reason they’re congregated together.
You wrote in your list,
the LLMs might start even setting bad objectives, by errors of omission or commission. this is a consequence of their innards not being the same as people (either hallucinations or just not having world model or misunderstanding the world)
In the context of traditional AI risk views, this misses the argument. Roughly the concern is instead like so:
ASI is by definition very capable of doing things (aka. selecting for outcomes), in at least all the ways collections of humans can. It is both theoretically true and observably the case in reality that when things are selected for, a bunch of other things that aren’t that are traded off, and that the stronger something is selected for, the more stuff ends up traded against, incidentally or not.
We should expect any ASI to have world-changing effects, and for those effects to trade off strongly against other things. There is a bunch of stuff we want that we don’t want traded off (eg. being alive).
The first problem is that we don’t know how to put any preferences into an AI such that it’s robust to even trivial selection pressure, not in theory, not in practice on existing models, and certainly not in ways that would apply to arbitrary systems that indirectly contain ML models but aren’t constrained by those models’ expressed preferences.
The second problem is that there are a bunch of instrumental goals (not eg. lying, but eg. continuing to have causal effect on the world) that are useful to almost all goals, and that are concrete examples of why an ASI would want to disempower humans. Aka. almost every thing that could plausibly be called an ASI will be effective at doing a thing, and the natural strategies for doing things involve not failing at them in easily-foreseeable ways.
Stuff like lying is not the key issue here. It often comes up because people say ‘why don’t we just ask the AI if it’s going to be bad’ and the answer is basically code for ‘you don’t seem to understand that we are talking about something that is trying to do a thing and is also good at it.’
Similarly for ‘we wouldn’t even know why it chooses outcomes, or how it accomplishes them’ — these are problematic because they are yet another reason to rule out simple fixes, not because they are fundamental to the issue. Like, if you understand why a bridge falls down, you can make a targeted fix and solve that problem, and if you don’t know then probably it’s a lot harder. But you can know every line of code of Stockfish (pre-NNUE) and still not have a chance against it, because Stockfish is actively selecting for outcomes and it is better at selecting them than you.
“LLMs have already lied to us” from the traditional AI risk crowd is similarly not about LLM lying being intrinsically scary, it is a yell of “even here you have no idea what you are doing, even here you have these creations you cannot control, so how in the world do you expect any of this to work when the child is smarter than you and it’s actually trying to achieve something?”
Veedrac’s Shortform
It took me a good while reading this to figure out whether it was a deconstruction of tabooing words. I would have felt less so if the post didn’t keep replacing terms with ones that are both no less charged and also no more descriptive of the underlying system, and then start drawing conclusions from the resulting terms’ aesthetics.
With regards to Yudkowsky’s takes, the key thing to keep in mind is that Yudkowsky started down his path by reasoning backwards from properties ASI would have, not from reasoning forward from a particular implementation strategy. The key reason to be concerned that outer optimization doesn’t define inner optimization isn’t a specific hypothesis about whether some specific strategy with neural networks will have inner optimizers, it’s because ASI will by necessity involve active optimization on things, and we want our alignment techniques to have at least any reason to work in that regime at all.
There is no ‘the final token’ for weights not at the final layer.
Because that is where all the gradients flow from, and why the dog wags the tail.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.
You’re at token i in a non-final layer. Which token’s output are you optimizing for? i+1?
By construction a decoder-only transformer is agnostic over what future token it should be informative to within the context limit, except in the sense that it doesn’t need to represent detail that will be more cheaply available from future tokens.
As a transformer is also unrolled in the context dimension, the architecture itself is effectively required to be generic both in what information it gathers and where that information is used. Bias towards next token prediction is not so much a consequence of reward in isolation, but of competitive advantage: at position i, the network has an advantage in predicting i+1 over the network at previous locations by having more recent tokens, and an advantage over the network at future tokens by virtue of still needing to predict token i+1. However, if a token is more predictive of some abstract future token than the next token precisely, say it’s a name that might be referenced later, one would expect the dominant learnt effect to be non-myopically optimizing for later use in some timestamp-invariant way.
If they appear to care about predicting future tokens, (which they do because they are not myopic and they are imitating agents who do care about future states which will be encoded into future tokens), it is solely as a way to improve the next-token prediction.
I think you’re just fundamentally misunderstanding the backwards pass in an autoregressive transformer here. Only a very tiny portion of the model is exclusively trained on next token prediction. Most of the model is trained on what might be called instead, say, conditioned future informativity.
I greatly appreciate the effort in this reply, but I think it’s increasingly unclear to me how to make efficient progress on our disagreements, so I’m going to hop.
If you say “Indeed it’s provable that you can’t have a faster algorithm than those O(n^3) and O(n^4) approximations which cover all relevant edge cases accurately” I am quite likely to go on a digression where I try to figure out what proof you’re pointing at and why you think it’s a fundamental barrier, and it seems now that per a couple of your comments you don’t believe it’s a fundamental barrier, but at the same time it doesn’t feel like any position has been moved, so I’m left rather foggy about where progress has been made.
I think it’s very useful that you say
I’m not saying that AI can’t develop useful heuristic approximations for the simulation of gemstone-based nano-mechanical machinery operating in ultra-high vacuum. I’m saying that it can’t do so as a one-shot inference without any new experimental work
since this seems like a narrower place to scope our conversation. I read this to mean:
You don’t know of any in principle barrier to solving this problem,
You believe the solution is underconstrained by available evidence.
I find the second point hard to believe, and don’t really see anywhere you have evidenced it.
As a maybe-relevant aside to that, wrt.
You’re saying that AI could take the garbage and by mere application of thought turn it into something useful. That’s not in line with the actual history of the development of useful AI outputs.
I think you’re talking of ‘mere application of thought’ like it’s not the distinguishing feature humanity has. I don’t care what’s ‘in line with the actual history’ of AI, I care what a literal superintelligence could do, and this includes a bunch of possibilities like:
Making inhumanly close observation of all existing data,
Noticing new, inhumanly-complex regularities in said data,
Proving new simplifying regularities from theory,
Inventing new algorithms for heuristic simulation,
Finding restricted domains where easier regularities hold,
Bifurcating problem space and operating over each plausible set,
Sending an interesting email to a research lab to get choice high-ROI data.
We can ignore the last one for this conversation. I still don’t understand why the others are deemed unreasonable ways of making progress on this task.
I appreciated the comments on time complexity but am skipping it because I don’t expect at this point that it lies at the crux.
Thanks, I appreciate the attempt to clarify. I do though think there’s some fundamental disagreement about what we’re arguing over here that’s making it less productive than it could be. For example,
The fact that this has been an extremely active area of research for over 80 years with massive real-world implications, and we’re no closer to finding such a simplified heuristic.
I think both:
Lack of human progress doesn’t necessarily mean the problem is intrinsically unsolvable by advanced AI. Humans often take a bunch of time before proving things.
It seems not at all the case that algorithmic progress isn’t happening, so it’s hardly a given that we’re no closer to a solution unless you first circularly assume that there’s no solution to arrive at.
If you’re starting out with an argument that we’re not there yet, this makes me think more that there’s some fundamental disagreement about how we should reason about ASI, more than your belief being backed by a justification that would be convincing to me had only I succeeded at eliciting it. Claiming that a thing is hard is at most a reason not to rule out that it’s impossible. It’s not a reason on its own to believe that it is impossible.
With regard to complexity,
I failed to understand the specific difference with protein folding. Protein folding is NP-hard, which is significantly harder than O(n³).
I failed to find the source for the claim that O(n³) or O(n⁴) are optimal. Actually I’m pretty confused how this is even a likely concept; surely if the O(n³) algorithm is widely useful then the O(n⁴) proof can’t be that strong of a bound on practical usefulness? So why is this not true of the O(n³) proof as well?
It’s maybe true that protein folding is easier to computationally verify solutions to, but first, can you prove this, and second, on what basis are you claiming that existing knowledge is necessarily insufficient to develop better heuristics than the ones we already have? The claim doesn’t seem to complete to me.
It’s magical thinking to assume that an AI will just one-shot this into existence.
Please note that I’ve not been making the claim that ASI could necessarily solve this problem. I have been making the claim that the arguments in this post don’t usefully support the claim that it can’t. It is true that largely on priors I expect it should be able to, but my priors also aren’t particularly useful ones to this debate and I have tried to avoid making comments that are dependent on them.
And what reason do you have for thinking it can’t be usefully approximated in some sufficiently productive domain, that wouldn’t also invalidly apply to protein folding? I think it’s not useful to just restate that there exist reasons you know of, I’m aiming to actually elicit those arguments here.
Given Claude is not particularly censored in this regard (in the sense of refusing to discuss the subject), I expect the jailbreak here to only serve as priming.
Well yes, nobody thinks that existing techniques suffice to build de-novo self-replicating nano machines, but that means it’s not very informative to comment on the fallibility of this or that package or the time complexity of some currently known best approach without grounding in the necessity of that approach.
One has to argue instead based on the fundamental underlying shape of the problem, and saying accurate simulation is O(n⁷) is not particularly more informative to that than saying accurate protein folding is NP. I think if the claim is that you can’t make directionally informative predictions via simulation for things meaningfully larger than helium then one is taking the argument beyond where it can be validly applied. If the claim is not that, it would be good to hear it clearly stated.
Could you quote or else clearly reference a specific argument from the post you found convincing on that topic?
LeelaKnightOdds has convincing beaten both Awonder Liang and Anish Giri at 3+2 by large margins, and has an extremely strong record at 5+3 against people who have challenged it.
I think 15+0 and probably also 10+0 would be a relatively easy win for Magnus based on Awonder, a ~150 elo weaker player, taking two draws at 8+3 and a win and a draw at 10+5. At 5+3 I’m not sure because we have so little data at winnable time controls, but wouldn’t expect an easy win for either player.
It’s also certainly not the case that these few-months-old networks running a somewhat improper algorithm are the best we could build—it’s known at minimum that this Leela is tactically weaker than normal and can drop endgame wins, even if humans rarely capitalize on that.