AIS student, self-proclaimed aspiring rationalist, very fond of game theory.
”The only good description is a self-referential description, just like this one.”
momom2
Five minutes of thought on how this could be used for capabilities:
- Use behavioral self-awareness to improve training data (e.g. training on this dataset increases self-awareness of code insecurity, so it probably contains insecure code that can be fixed before training on it).
- Self-critique for iterative improvement within a scaffolding (already exists, but this work validates the underlying principles and may provide further grounding).It sure feels like behavioral self-awareness should work just as well for self capability assessments as for safety topics, and that this ought to be usable to improve capabilities but my 5 minutes are up and I don’t feel particularly threatened by what I found.
In general, given concerns that safety-intended work often ends up boosting capabilities, I would appreciate systematically including a section on why the authors believe their work is unlikely to have negative externalities.
(If you take time to think about this, feel free to pause reading and write your best solution in the comments!)
How about:
- Allocating energy everywhere to either twitching randomly or collecting nutrients. Assuming you are propelled by the twitching, this follows the gradient if there’s one.
- Try to grow in all directions. If there are no outside nutrients to fuel this growth, consume yourself. In this manner, regenerate yourself in the direction of the gradient.
- Try to grab nutrients from all directions. If there are nutrients, by reaction you will be propelled towards it so this moves in the direction of the gradient.Update after seeing the solution of B. subtilis: Looks like I had the wrong level of abstraction in mind. Also, I didn’t consider group solutions.
Contra 2:
ASI might provide a strategic advantage of a kind which doesn’t negatively impact the losers of the race, e.g. it increases GDP by x10 and locks competitors out of having an ASI.
Then, losing control of the ASI could [not being able of] posing an existential risk to the US.
I think it’s quite likely this is what some policymakers have in mind: some sort of innovation which will make everything better for the country by providing a lot cheap labor and generally improving productivity, the way we see AI applications do right now but on a bigger scale.Comment on 3:
Not sure who your target audience is; I assume it would be policymakers, in which case I’m not sure how much weight that kind of argument has? I’m not a US citizen, but from international news I got the impression that current US officials would rather relish the option to undermine the liberal democracy they purport to defend.
From the disagreement between the two of you, I infer there is yet debate as to what environmentalism means. The only way to be a true environmentalist then is to make things as reversible as possible until such time as an ASI can explain what the environmentalist course of action regarding the Sun should be.
The paradox arises because the action-optimal formula mixes world states and belief states.
The [action-planning] formula essentially starts by summing up the contributions of the individual nodes as if you were an “outside” observer that knows where you are, but then calculates the probabilities at the nodes as if you were an absent-minded “inside” observer that merely believes to be there (to a degree).
So the probabilities you’re summing up are apples and oranges, so no wonder the result doesn’t make any sense. As stated, the formula for action-optimal planning is a bit like looking into your wallet more often, and then observing the exact same money more often. Seeing the same 10 dollars twice isn’t the same thing as owning 20 dollars.
If you want to calculate the utility and optimal decision probability entirely in belief-space (i.e. action-optimal), then you need to take into account that you can be at X, and already know that you’ll consider being at X again when you’re at Y.
So in belief space, your formula for the expected value also needs to take into account that you’ll forget, and the formula becomes recursive. So the formula should actually be:Explanation of the terms in order of appearance:
If we are in X and CONTINUE, then we will “expect the same value again” when we are in Y in the future. This enforces temporal consistency.
If we are in X and EXIT, then we should expect 0 utility
If we are in Y and CONTINUE, then we should expect 1 utility
If we are in Y and EXIT, then we should expect 4 utility We also know that a must be 1 / (1 + p), because when driving n times, you’re in X for n times, and in Y for p * n times.
Under that constraint, we get that The optimum here is at p=2/3 with an expected utility of 4⁄3, which matches the planning-optimal formula.
[Shamelessly copied from a comment under this video by xil12323.]
Having read Planecrash, I do not think there is anything in this review that I would not have wanted to know before reading the work (which is the important part of what people consider “spoilers” for me).
Top of the head like when I’m trying to frown too hard
distraction had no effect on identifying true propositions (55% success for uninterrupted presentations, vs. 58% when interrupted); but did affect identifying false propositions (55% success when uninterrupted, vs. 35% when interrupted)
If you are confused by these numbers (why so close to 50%? Why below 50%) it’s because participants could pick four options (corresponding to true, false, don’t know and never seen).
You can read the study, search for keyword “The Identification Test”.
I don’t see what you mean by the grandfather problem.
I don’t care about the specifics of who spawns the far future generation; whether it’s Alice or Bob I am only considering numbers here.
Saving lives now has consequences for the far future insofar as current people are irrepleceable: if they die, no one will make more children to compensate, resulting in a lower total far future population. Some deaths are less impactful than others for the far future.
That’s an interesting way to think about it, but I’m not convinced; killing half the population does not reduce the chance of survival of humanity by half.
In terms of individuals, only the last <.1% matter (not sure about the order of magnitude, but in any case it’s small as a proportion of the total).
It’s probably more useful to think in terms of events (nuclear war, misaligned ASI → prevent war, research alignment) or unsurvivable conditions (radiation, killer robots → build bunker, have kill switch) that can prevent humanity from recovering from a catastrophe.
Yes, that’s the first thing that was talked about in my group’s discussion on longtermism. For the sake of the argument, we were asked to assume that the waste processing/burial choice amounted to a trade in lives all things considered… but the fact that any realistic scenario resembling this thought experiment would not be framed like that is the central part of my first counterargument.
I enjoy reading any kind of cogent fiction on LW, but this one is a bit too undeveloped for my tastes. Perhaps be more explicit about what Myrkina sees in the discussion which relates to our world?
You don’t have to always spell earth-shattering revelations out loud (in fact it’s best to let the readers reach the correct conclusion by themselves imo), but there needs to be enough narrative tension to make the conclusion inevitable; as it stands, it feels like I can just meh my way out of thinking more than 30s on what the revelation might be, the same way Tralith does.
Thanks, it does clarify, both on separating the instantiation of an empathy mechanism in the human brain vs in AI and on considering instantiation separately from the (evolutionary or training) process that leads to it.
I was under the impression that empathy explained by evolutionary psychology as a result of the need to cooperate with the fact that we already had all the apparatus to simulate other people (like Jan Kulveit’s first proposition).
(This does not translate to machine empathy as far as I can tell.)
I notice that this impression is justified by basically nothing besides “everything is evolutionary psychology”. Seeing that other people’s intuitions about the topic are completely different is humbling; I guess emotions are not obvious.
So, I would appreciate if you could point out where the literature stands on the position you argue against, Jan Kulveit’s or mine (or possibly something else).
Are all these takes just, like, our opinion, man, or is there strong supportive evidence for a comprehensive theory of empathy (or is there evidence for multiple competing theories)?
I do not find this post reassuring about your approach.
Your plan is unsound; instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis, and the implicit theory of change is lacunar.
Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims) ; I think it’s likely to discredit safety movements and raise attention in counterproductive ways.
The figures you quote are false (the median from AI Impacts is 5%) or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias), so I think it’s fair to call them lies.
Your explanations for what you say in the press release sometimes don’t make sense! You conflate AGI and self-modifying systems, your explanation for “eventually” does not match the sentence.
Your arguments are based on wrong premises—it’s easy to check that your facts such as “they are not following the scientific method” are plain wrong. It sounds like you’re trying to smear OpenAI and Sam Altman as much as possible without consideration for whether what you’re saying is true.
I am appalled to see this was not downvoted into oblivion! My best guess is that people feel that there are not enough efforts going towards stopping AI and did not read the post and the press release to check that you have good reason motivating your actions.
I agree with the broad idea, but I’m going to need a better implementation.
In particular, the 5 criteria you give are insufficient because the example you give scores well on them, and is still atrocious: if we decreed that “black people” was unacceptable and should be replaced by “black peoples”, it would cause a lot of confusion on account of how similar the two terms are and how ineffective the change is.
The cascade happens because of a specific reason, and the change aims at resolving that reason. For example, “Jap” is used as a slur, and not saying it shows you don’t mean to use a slur. For black people/s, I guess the reason would be something like not implying that there is a single black people, which only makes sense in the context of a specialized discussion.
I can’t adhere to the criteria you proposed because they don’t work, and I don’t want to bother thinking that deep about every change of term on an everyday basis, so I’ll keep on using intuition to choose when to solve respectability cascades for now.
For deciding when to trigger a respectability cascade, your criteria are interesting for having any sort of principled approach, but I’m still not sure they outperform unconstrained discussion on the subject (which I assume is the default alternative for anyone who cares enough about deliberately triggering respectability cascades to have read your post in the first place).
A lot of your AI-risk reason to support Harris seems to hinge on this, which I find very shaky. How wide are your confidence intervals here?
My own guesses are much more fuzzy. According to your argument, if my intuition was .2 vs .5, then it’s an overwhelming case for Harris but I’m unfamiliar enough with the topic that it could easily be the reverse.
I would greatly appreciate more details on how you reach your numbers (and if they’re vibes, reason whether to trust those vibes).
Alternatively, I feel like I should somehow discount the strength of the AI-risk reason based on how likely I think these numbers are to more or less hold true, but I don’t know a principled way to do it.
Seems like you need to go beyond arguments of authority and stating your conclusions and instead go down to the object-level disagreements. You could say instead “Your argument for ~X is invalid because blah blah” and if Jacob says “Your argument for the invalidity of my argument for ~X is invalid because blah blah” then it’s better than before because it’s easier to evaluate argument validity than ground truth.
(And if that process continues ad infinitam, consider that someone who cannot evaluate the validity of the simplest arguments is not worth arguing with.)
It’s thought-provoking.
Many people here identify as Bayesians, but are as confused as Saundra by the troll’s questions, which indicates that they’re missing something important.
It wasn’t mine. I did grow up in a religious family, but becoming a rationalist came gradually, without sharp divide with my social network. I always figured people around me were making all sorts of logical mistakes though, and noticed very early deep flaws in what I was taught.
Thanks for writing this! I was unaware of the Chinese investment, which explains another recent information which you did not include but I think is significant: Nvidia’s stock plummeted 18% today.