How much longer did your timelines get?
Noosphere89
Of course, there are rumors and claims that deep learning is hitting a wall, which seems important for your timelines if true.
I have become convinced that nanotech computers are likely way weaker and quite a bit more impractical than Drexler thought, and have also moved up my probability of Drexler just being plain wrong about the impact of nanotech, which if true suggests that the future value may have been overestimated.
The reason why I’m stating this now is because I got a link in discord that talks about why nanotech computers are overrated, and the reason I consider this important is if this generalizes to other nanotech concepts, this suggests that a lot of the future value may have been overestimated based on overestimating nanotech’s capabilities:
It’s not surprising that a lot of people don’t want to define physics while believing in physicalism, because properly explaining the equations that describe the physical world would take quite a long time, let alone describing what’s actually going on in physics, and it would require a textbook minimum to make this work.
I don’t buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building.
Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public.
See here for more:
The answers to this question is actually 2 things:
-
This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI.
-
This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff.
More here:
But the solutions are intentionally going to make AI safe without relying on alignment.
-
The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.
I’d say the big factor that makes AI controllable right now is that the compute necessary to build AI that can do very good AI research to automate R&D and then the economy is locked behind TSMC/NVidia and ASML, and their processes are both nearly irreplaceable and very expensive to make, so it’s way easier to intervene on the checkpoints requiring AI development than gain-of function research.
Yeah, this theory definitely needs far better methodologies for testing this theory, and while I wouldn’t be surprised by at least part of the answer/solution to the Hard Problem or problems of Consciousness being that we have unnecessarily conflated various properties that occur in various humans in the word consciousness because of political/moral reasons, whereas AIs don’t automatically have all the properties of humans here, so we should create new concepts for AIs, it’s still methodologically bad.
But yes, this post at the very least relies on a theory that hasn’t been tested, and while I suspect it’s at least partially correct, the evidence in the conflationary alliances post is basically 0 evidence for the proposition.
Nor do we have the ability to bend probabilities arbitrarily for arbitrary statements, which was a core power in Gurren Lagann movies, if I recall correctly.
This part IMO is a crux, in that I don’t truly believe an objective measure/magical reality fluid can exist in the multiverse, if we allow the concept to be sufficiently general, ruining both probability and expected value/utility theory in the process.
Heck, in the most general cases, I don’t believe any coherent measure exists at all, which basically ruins probability and expected utility theory at the same time.
Maybe we have some deeper disagreement here. It feels plausible to me that there is a measure of “realness” in the Multiverse that is an objective fact about the world, and we might be able to figure it out.
The most important thing to realize about AI safety is that basically all versions of practically safe AI must make certain assumptions that no one does a specific action (mostly related to misuse reasons, but for some specific plans, can also be related to misalignment reasons).
Another way to say it is that I believe that in practice, these two categories are the same category, such that basically all work that’s useful in the field will require someone not to do something, so the costs of sharing are practically 0, and the expected value of sharing insights is likely very large.
Specifically, I’m asserting that these 2 categories are actually one category for most purposes:
Actually make AI safe and another, sadder but easier field of “Make AI safe as long as no one does the thing.”
My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind.
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it’s ignorance rather than making stuff up, so at this point, I’ve come to agree that o1′s training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.
What do you mean? I don’t get what you are saying is convincing.
I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety? This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
I think this is the crux.
To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
So for now, what I suspect is that o1′s safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
I think the distinction is made to avoid confusing capability and alignment failures here.
I agree that it doesn’t satisfy the user’s request.
-
More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
Yeah, this is the biggest issue of OpenAI for me, in that they aren’t trying to steer too hard against deception.
The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
To be clear, I’m not stating that it’s hard to get the AI to value what we value, but it’s not so brain-dead easy that we can make the AI find moral reality and then all will be well.
Not always, but I’d say often.
I’d also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
To be clear, I’m not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.
In this case, it would mean the convergence to preserve your current values.
The answer to this is that we’d rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.
That said, while the methodology isn’t sound, I wouldn’t be surprised if there was in fact a real conflationary alliance around the term, since the term is used in a context where deciding if someone is conscious or not (like uploads) have pretty big moral and political ramifications too, so there are pressures for the word to be politicized and not truth-tracking.