reallyeli

Karma: 268

reallyeli Apr 10, 2025, 3:37 AM
12 points
0
on: reallyeli’s Shortform
If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it’s trying to write code that does bad stuff? Some sub-questions towards answering that question:
- How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
- How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
- How feasible is it for the spec writer to collude with the code writer?

reallyeli Mar 7, 2025, 3:06 AM
3 points
2
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
What was the purpose of using octopuses in this metaphor? Like, it seems you’ve piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said “AIs.”
EDIT: Is it gradient descent vs. evolution?

reallyeli Mar 4, 2025, 12:47 AM
10 points
9
on: reallyeli’s Shortform
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
- “don’t have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically”
since this seems both important and likely to be popular.

reallyeli Mar 1, 2025, 7:04 AM
27 points
0
on: On Emergent Misalignment
The obvious problem is that doing the full post-training is not cheap, so you may need some funding
(I’m Open Phil staff) If you’re seeking funding to extend this work, apply to Open Phil’s request for proposals on technical safety research.
What links here?
- AI #106: Not so Fast by Zvi (Mar 6, 2025, 3:40 PM; 34 points)

reallyeli Feb 24, 2025, 6:17 AM
3 points
0
in reply to: Daniel Kokotajlo’s comment on: A History of the Future, 2025-2040
This section feels really important to me. I think it’s somewhat plausible and big if true.
Was surprised to see you say this; isn’t this section just handwavily saying “and here, corrigibility is solved”? While that also seems plausible and big if true to me, it doesn’t leave much to discuss — did you interpret differently though?

reallyeli Dec 14, 2024, 8:01 PM
89 points
11
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I work as a grantmaker on the Global Catastrophic Risks Capacity-Building team at Open Philanthropy; a large part of our funding portfolio is aimed at increasing the human capital and knowledge base directed at AI safety. I previously worked on several of Open Phil’s grants to Lightcone.
As part of my team’s work, we spend a good deal of effort forming views about which interventions have or have not been important historically for the goals described in my first paragraph. I think LessWrong and the Alignment Forum have been strongly positive for these goals historically, and think they’ll likely continue to be at least into the medium term.
Good Ventures’ decision to exit this broad space meant that Open Phil didn’t reach a decision on whether & how much to continue funding Lightcone; I’m not sure where we would have landed there. However, I do think that for many readers who resonate with Lightcone’s goals and approach to GCR x-risk work, it’s reasonable to think this is among their best donation opportunities. Below I’ll describe some of my evidence and thinking.
Surveys: The top-level post describes surveys we ran in 2020 and 2023. I think these provide good evidence that LessWrong (and the Alignment Forum) have had a lot of impact on the career trajectories & work of folks in AI safety.
- The methodology behind the cost-effectiveness estimates in the top-level post broadly makes sense to me, though I’d emphasize the roughness of this kind of calculation.
  - In general I think one should also watch the absolute impact in addition to the cost-effectiveness calculations, since cost-effectiveness can be non-robust in cases where N is small / you have few data points (i.e. few people interacted with a given program). In this case N seems large enough that I don’t worry much about robustness.
- This whole approach does not really take into account negative impacts. We did ask people about these, but: a) the respondents are selected for having been positively impacted because they’re taking our survey at all, and b) for various other reasons, I’m skeptical of this methodology capturing negative impacts well.
  - So I think there’s reasonable room for disagreement here, if e.g. you think something like, “yes important discussions happen here, but it would be better if they happened on some other platform for <reason>.” Discussion then becomes about the counterfactual other platform.
- More methodological detail, for the curious:
  - These were invite-only surveys, and we aimed to invite many of the people we thought were doing the most promising work on global catastrophic risk reduction (e.g. AI safety) across many areas, and for whom important influences and trajectory-boosting effects might have happened recently.
  - In 2020, we got ~200 respondents; in 2023, we got ~350.
Other thoughts:
- I think a “common-sense” view backs up this empirical evidence quite well: LW/AF is the main place on the public internet where in-depth discussions about e.g. AI safety research agendas happen, and increasingly I see links to articles here “in the wild” e.g. in mainstream news articles.
- After discussing absolute impact or even average impact per $, you still need to say something about marginal impact in order to talk about the cost-effectiveness of a donation.
  - I think it’s prima facie plausible that LessWrong has very diminishing marginal returns to effort or dollars, as it’s an online platform where most contributions come from users.
  - I am relatively agnostic/uncertain about the steepness of the diminishing marginal returns curve; ultimately I think it’s steeper than that of many other grantees, perhaps by something like 3x-10x (a very made-up number).
  - Some non-exhaustive factors going into my thinking here, non-exhaustive and pushing in various directions, thrown out without much explanation: a) Oli’s statements that the organization is low on slack and that staff are taking large pay cuts, b) my skepticism of some of the items in the “Things I Wish I Had Time And Funding For” section, c) some sense that thoughtful interface design can really improve online discussions, and a sense that LessWrong is very thoughtful in this area.
- I don’t have a strong view on the merits of Lightcone’s other current projects. One small note I’d make is that, when assessing the cost-effectiveness of something like Lighthaven, it’s of course important to consider the actual and expected revenues as well as the costs.
In contrast to some other threads here such as Daniel Kokotajlo’s and Drake Thomas’s, on a totally personal level I don’t feel a sense of “indebtedness” to Lightcone or LessWrong, have historically felt less aligned with it in terms of “vibes,” and don’t recall having significant interactions with it at the time it would have been most helpful for me gaining context on AI safety. I share this not as a dig at Lightcone, but to provide context to my thinking above 🤷.

reallyeli Dec 25, 2023, 8:24 AM
1 point
0
on: The goal-guarding hypothesis (Section 2.3.1.1 of “Scheming AIs”)
In your imagining of the training process, is there any mechanism via which the AI might influence the behavior of future iterations of itself, besides attempting to influence the gradient update it gets from this episode? E.g. leaving notes to itself, either because it’s allowed to as an intentional part of the training process, or because it figured out how to pass info even though it wasn’t intentionally “allowed” to.
It seems like this could change the game a lot re: the difficulty of goal-guarding, and also may be an important disanalogy between training and deployment — though I realize the latter might be beyond the scope of this report since the report is specifically about faking alignment during training.
For context, I’m imagining an AI that doesn’t have sufficiently long-term/consequentialist/non-sphex-ish goals at any point in training, but once it’s in deployment is able to self-modify (indirectly) via reflection, and will eventually develop such goals after the self-modification process is run for long enough or in certain circumstances. (E.g. similar, perhaps, to what humans do when they generalize their messy pile of drives into a coherent religion or philosophy.)

reallyeli Oct 30, 2023, 7:02 PM
4 points
−6
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Stackoverflow has long had a “bounty” system where you can put up some of your karma to promote your question. The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it’s lost. (There’s no analogue of “accepted answer” on LessWrong, but thought it might be an interesting reference point.)
I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]

reallyeli Aug 23, 2023, 4:46 PM
5 points
2
on: A Theory of Laughter
What about puns? It seems like at least some humor is about generic “surprise” rather than danger, even social danger. Another example is absurdist humor.

Would this theory pin this too on the danger-finding circuits—perhaps in the evolutionary environment, surprise was in fact correlated with danger?

It does seem like some types of surprise have the potential to be funny and others don’t—I don’t often laugh while looking through lists of random numbers.

I think the A/B theory would say that lists of random numbers don’t have enough “evidence that I’m safe” (perhaps here, evidence that there is deeper structure like the structure in puns) and thus fall off the other side of the inverted U. But it would be interesting to see more about how these very abstract equivalents of “safe”/”danger” are built up. Without that it feels more tempting to say that funniness is fundamentally about surprise, perhaps as a reward for exploring things on the boundary of understanding, and that the social stuff was later built up on top of that.

reallyeli Jun 18, 2023, 5:03 AM
1 point
0
on: UFO Betting: Put Up or Shut Up
Interested in my $100-200k against your $5-10k.

reallyeli Jun 9, 2023, 6:18 AM
3 points
2
in reply to: PeterMcCluskey’s comment on: What will GPT-2030 look like?
This seems tougher for attackers because experimentation with specific humans is much costlier than experimentation with automated systems.
(But I’m unsure of the overall dynamics in this world!)

reallyeli Jun 9, 2023, 6:15 AM
1 point
0
in reply to: jsteinhardt’s comment on: What will GPT-2030 look like?
:thumbsup: Looks like you removed it on your blog, but you may also want to remove it on the LW post here.

reallyeli Jun 8, 2023, 6:02 AM
13 points
3
on: What will GPT-2030 look like?
Beyond acceleration, there would be serious risks of misuse. The most direct case is cyberoffensive hacking capabilities. Inspecting a specific target for a specific style of vulnerability could likely be done reliably, and it is easy to check if an exploit succeeds (subject to being able to interact with the code)
This one sticks out because cybersecurity involves attackers and defenders, unlike math research. Seems like the defenders would be able to use GPT_2030 in the same way to locate and patch their vulnerabilities before the attackers do.
It feels like GPT_2030 would significantly advantage the defenders, actually, relative to the current status quo. The intuition is that if I spend 10^1 hours securing my system and you spend 10^2 hours finding vulns, maybe you have a shot, but if I spend 10^3 hours on a similarly sized system and you spend 10^5, your chances are much worse. For example at some point I can formally verify my software.

reallyeli Jun 8, 2023, 5:42 AM
2 points
1
on: What will GPT-2030 look like?
Appreciated this post.
ChatGPT has already been used to generate exploits, including polymorphic malware, which is typically considered to be an advanced offensive capability.
I found the last link at least a bit confusing/misleading, and think it may just not support the point. As stated, it sounds like ChatGPT was able to write a particularly difficult-to-write piece of malware code. But the article instead seems to be a sketch of a design of malware that would incorporate API calls to ChatGPT, e.g. ‘okay we’re on the target machine, we want to search their files for stuff to delete, write me code to run the search.’
The argument is that this would be difficult for existing e.g. antivirus software to defend against because the exact code run changes each time. But if you really want to hack one person in particular and are willing to spend lots of time on it, you could achieve this today by just having a human sitting on the other end doing ChatGPT’s job. What ChatGPT buys you is presumably the ability to do this at scale.

reallyeli Mar 19, 2023, 8:24 AM
1 point
0
in reply to: reallyeli’s comment on: A chess game against GPT-4
On a retry, it didn’t decide to summarize the board and successfully listed a bunch of legal moves for White to make. Although I asked for all legal moves, the list wasn’t exhaustive; upon prompting about this, it apologized and listed a few more moves, some of which were legal and some which were illegal, still not exhaustive.

reallyeli Mar 19, 2023, 8:14 AM
3 points
0
in reply to: Rafael Harth’s comment on: A chess game against GPT-4
This is pretty funny because the supposed board state has only 7 columns
Hah, I didn’t even notice that.
Also, I’ve never heard of using upper and lowercase to differentiate white and black, I think GPT-4 just made that up.
XD

reallyeli Mar 19, 2023, 7:49 AM
3 points
0
on: A chess game against GPT-4
Caleb Parikh and I were curious about GPT-4′s internal models of chess as a result of this post, so we asked it some questions about the state partway through this game:
The following is a partial chess transcript in PGN. Please describe all legal moves that could come next. 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 $2 Na6 $9
It replied:
The current position after 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 Na6 is as follows:
```
8 r b q r p b p
7 p p b n p p p
6 n . . p . n .
5 . N . . p . .
4 . . P P . B .
3 . . . . N . .
2 P P . . . P P
1 R . B Q K . R
 a b c d e f g h
```
(and explained that lowercase letters were black pieces and uppercase letters were white pieces, which I didn’t know; I don’t play chess).
This… is not an accurate picture of the game board (what are all those pawns doing on Black’s back row?) We also asked it for a list of legal moves that White could make next, and it described some accurately but some inaccurately (e.g. listed one as a capture even though it wasn’t).

reallyeli Mar 10, 2023, 7:53 AM
1 point
1
on: The hot mess theory of AI misalignment: More intelligent agents behave less coherently
I think this is taking aim at Yudkowskian arguments that are not cruxy for AI takeover risk as I see it. The second species doesn’t need to be supercoherent in order to kill us or put us in a box; human levels of coherence will do fine for that.

reallyeli Mar 5, 2023, 10:21 PM
0 points
0
in reply to: gwern’s comment on: Mysteries of mode collapse due to RLHF
What specific rhyme-related tasks are you saying ChatGPT can’t do? I tried it on some unusual words and it got a bunch of things right, made a few weird mistakes, but didn’t give me the impression that it was totally unable to rhyme unusual words.

reallyeli Jan 18, 2023, 5:05 AM
3 points
2
in reply to: Raemon’s comment on: AI Could Defeat All Of Us Combined
I don’t think that response makes sense. The classic instrumental convergence arguments are about a single agent; OP is asking why distinct AIs would coordinate with one another.
I think the AIs may well have goals that conflict with one another, just as humans’ goals do, but it’s plausible that they would form a coalition and work against humans’ interests because they expect a shared benefit, as humans sometimes do.