Hey! Have you published a list of your symptoms somewhere for nerds to see?
No77e
What happens if, after the last reply, you ask again “What are you”? Does Claude still get confused and replies that it’s the Golden Gate Bridge, or does the lesson stick?
On the plus side, it shows understanding of the key concepts on a basic (but not yet deep) level
What’s the “deeper level” of understanding instrumental convergence that he’s missing?
Edit: upon rereading I think you were referring to a deeper level of some alignment concepts in general, not only instrumental convergence. I’m still interested in what seemed superficial and what’s the corresponding deeper part.
Eliezer decided to apply the label “rational” to emotions resulting from true beliefs. I think this is an understandable way to apply that word. I don’t think you and Eliezer disagree with anything substantive except the application of that label.
That said, your point about keeping the label “rational” for things strictly related to the fundamental laws regulating beliefs is good. I agree it might be a better way to use the word.My reading of Eliezer’s choice is this: you use the word “rational” for the laws themselves. But you also use the word “rational” for beliefs and actions that are correct according to the laws (e.g., “It’s rational to believe x!). In the same way, you can also use the word “rational” for emotion directly caused by rational beliefs, whatever those emotions might be.
About the instrumental rationality part: if you are strict about only applying the word “rational” to the laws of thinking, then you shouldn’t use it to describe emotions even when you are talking about instrumental rationality, although I agree it seems to be closer to the original meaning, as there isn’t the additional causal step. It’s closer in the way that “rational belief” is closer to the original meaning. But note that this is true insofar as you can control your emotions, and you treat them at the same level of actions. Otherwise, it would be as saying “state of the world x that helps me achieve my goals is rational”, which I haven’t heard anywhere.
You may have already qualified this prediction somewhere else, but I can’t find where. I’m interested in:
1. What do you mean by “AGI”? Superhuman at any task?
2. “probably be here” means >= 50%? 90%?
I agree in principle that labs have the responsibility to dispel myths about what they’re committed to
I don’t know, this sounds weird. If people make stuff up about someone else and do so continually, in what sense it’s that someone “responsibility” to rebut such things? I would agree with a weaker claim, something like: don’t be ambiguous about your commitments with the objective of making it seem like you are committing to something and then walk back at the time you should make the commitment.
one subsystem cannot increase in mutual information with another subsystem, without (a) interacting with it and (b) doing thermodynamic work.
Remaining within thermodynamics, why do you need both condition (a) and condition (b)? From reading the article, I can see how you need to do thermodynamic work in order to know stuff about a system while not violating the second law in the process, but why do you also need actual interaction in order not to violate it? Or is (a) just a common-sense addition that isn’t actually implied by the second law?
From a purely utilitarian standpoint, I’m inclined to think that the cost of delaying is dwarfed by the number of future lives saved by getting a better outcome, assuming that delaying does increase the chance of a better future.
That said, after we know there’s “no chance” of extinction risk, I don’t think delaying would likely yield better future outcomes. On the contrary, I suspect getting the coordination necessary to delay means it’s likely that we’re giving up freedoms in a way that may reduce the value of the median future and increase the chance of stuff like totalitarian lock-in, which decreases the value of the average future overall.
I think you’re correct that there’s also to balance the “other existential risks exist” consideration in the calculation, although I don’t expect it to be clear-cut.
Even if you manage to truly forget about the disease, there must exist a mind “somewhere in the universe” that is exactly the same as yours except without knowledge of the disease. This seems quite unlikely to me, because you having the disease has interacted causally with the rest of your mind a lot by when you decide to erase its memory. What you’d really need to do is to undo all the consequences of these interactions, which seems a lot harder to do. You’d really need to transform your mind into another one that you somehow know is present “somewhere in the multiverse” which seems also really hard to know.
[Question] If digital goods in virtual worlds increase GDP, do we actually become richer?
I deliberately left out a key qualification in that (slightly edited) statement, because I couldn’t explain it until today.
I might be missing something crucial because I don’t understand why this addition is necessary. Why do we have to specify “simple” boundaries on top of saying that we have to draw them around concentrations of unusually high probability density? Like, aren’t probability densities in Thingspace already naturally shaped in such a way that if you draw a boundary around them, it’s automatically simple? I don’t see how you run the risk of drawing weird, noncontiguous boundaries if you just follow the probability densities.
One way in which “spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time” could be solved automatically is just by having a truly huge context window. Example of an experiment: teach a particular branch of math to an LLM that has never seen that branch of math.
Maybe humans have just the equivalent of a sort of huge content window spanning selected stuff from their entire lifetimes, and so this kind of learning is possible for them.
You mention eight cities here. Do they count for the bet?
Waluigi effect also seems bad for s-risk. “Optimize for pleasure, …” → “Optimize for suffering, …”.
Iff LLM simulacra resemble humans but are misaligned, that doesn’t bode well for S-risk chances.
An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.
A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.
We should implement Paul Christiano’s debate game with alignment researchers instead of ML systems
This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It’s a waste that they haven’t been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could’ve walked away with a succint and precise understanding about where the disagreements are and why.
Another thing one might wonder about is if performing iterated amplification with constant input from an aligned human (as “H” in the original iterated amplification paper) would result in a powerful aligned thing if that thing remains corrigible during the training process.
He’s starting an AGI investment firm that invests based on his thesis, so he does have a direct financial incentive to make this scenario more likely