Edouard Harris

Karma: 581

Co-founder @ Gladstone AI.

Contact: edouard@gladstone.ai

Website: eharr.is

Edouard Harris Jan 13, 2025, 2:06 PM
3 points
2
in reply to: gwern’s comment on: Policymakers don’t have access to paywalled articles
Yeah that could be doable. Dylan’s pretty natsec focused already so I would guess he’d take a broad view of the ROI from something like this. From what I hear he is already in touch with some of the folks who are in the mix, which helps, but the core goal is to get random leaf node action officers this access with minimum friction. I think an unconditional discount to all federal employees probably does pass muster with the regs, though of course folks would still be paying something out of pocket. I’ll bring this up to SA next time we talk to them though, it might move the needle. For all I know, they might even be doing it already.

Edouard Harris Jan 12, 2025, 2:19 PM
7 points
2
in reply to: gwern’s comment on: Policymakers don’t have access to paywalled articles
Because of another stupid thing, which is that U.S. depts & agencies have strong internal regs against employees soliciting and/or accepting gifts other than in carefully carved out exceptional cases. For more on this, see, e.g., 5 CFR § 2635.204, but this isn’t the only such reg. In practice U.S. government employees at all levels are broadly prohibited from accepting any gift with a market value above 20 USD for example. (As you’d expect this leads to a lot of weird outcomes, including occasional hilarious minor diplomatic incidents with inexperienced foreign counterparties who have different gift giving norms.)

Edouard Harris Jan 11, 2025, 8:51 PM
10 points
3
on: Policymakers don’t have access to paywalled articles
Yep, can confirm this is true. And this often leads to shockingly stupid outcomes, such as key action officers at the Office of [redacted] in the Department of [redacted] not reading SemiAnalysis because they’d have to pay for their subscriptions out of pocket.

Edouard Harris Jan 11, 2025, 4:01 PM
LW: 7 AF: 6
6
AF
on: What’s the short timeline plan?
This is a great & timely post.

Edouard Harris Mar 21, 2024, 9:23 PM
38 points
0
on: On the Gladstone Report
Thanks very much for writing this. We appreciate all the feedback across the board, and I think this a well done and in-depth write up.
On the specific numerical thresholds in the report (i.e., your Key Proposal section), I do need to make one correction that also applies to most of Brooks’s commentary. All the numerical thresholds mentioned in the report, and particularly in that subsection, are solely examples and not actual recommendations. They are there only to show how one can calculate self-consistent licensing thresholds under the principles we recommend. They are not themselves recommendations. We had to do it this way for the same reason we propose granting fairly broad rule-setting flexibility to the regulatory entity. The field is changing so quickly that any concrete threshold risks being out of date (for one reason or the other) in very short order. We would have liked to do otherwise, but that is not a realistic expectation for a report that we expect to be digested over the course of several months.
To avoid precisely this misunderstanding, the report states in several places that those very numbers are, in fact, only examples for illustration. A few screencaps of those disclaimers are below, but there are several others. Of course we could have included even more, but beyond a certain point one is simply adding more length to what you correctly point out is already quite a sizeable document. Note that the Time article, in the excerpt you quoted, does correctly note and acknowledge that the Tier 3 AIMD threshold is there as an example (emphasis added):
the report suggests, as an example, that the agency could set it just above the levels of computing power used to train current cutting-edge models like OpenAI’s GPT-4 and Google’s Gemini.
Apart from this, I do think overall you’ve done a good and accurate job of summarizing the document and offering sensible and welcome views, emphasis, and pushback. It’s certainly a long report, so this is a service to anyone who’s looking to go one or two levels deeper than the Executive Summary. We do appreciate you giving it a look and writing it up.

Edouard Harris Apr 21, 2023, 4:49 PM
LW: 1 AF: 1
0
AF
in reply to: Jsevillamol’s comment on: Announcing Epoch’s dashboard of key trends and figures in Machine Learning
Gotcha, that makes sense!

Edouard Harris Apr 20, 2023, 10:21 PM
LW: 1 AF: 1
0
AF
on: Announcing Epoch’s dashboard of key trends and figures in Machine Learning
Looks awesome! Minor correction on the cost of the GPT-4 training run: the website says $40 million, but sama confirmed publicly that it was over $100M (and several news outlets have reported the latter number as well).

Edouard Harris Nov 18, 2022, 2:59 PM
LW: 3 AF: 2
2
AF
in reply to: Neel Nanda’s comment on: Inverse scaling can become U-shaped
Done, a few days ago. Sorry thought I’d responded to this comment.

Edouard Harris Nov 9, 2022, 4:34 PM
LW: 4 AF: 3
0
AF
in reply to: Ethan Perez’s comment on: Inverse scaling can become U-shaped
Excellent context here, thank you. I hadn’t been aware of this caveat.

Edouard Harris Oct 20, 2022, 12:55 PM
LW: 3 AF: 2
0
AF
in reply to: Alex Flint’s comment on: Misalignment-by-default in multi-agent systems
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.

Edouard Harris Oct 14, 2022, 9:06 PM
LW: 1 AF: 1
0
AF
in reply to: Noosphere89’s comment on: Misalignment-by-default in multi-agent systems
Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)

Edouard Harris Oct 14, 2022, 7:11 PM
LW: 6 AF: 4
0
AF
in reply to: Alex Flint’s comment on: Misalignment-by-default in multi-agent systems
I think you might have reversed the definitions of $α_{H A}$ and $β_{H A}$ in your comment,^[1] but otherwise I think you’re exactly right.
To compute $β_{H A}$ (the correlation coefficient between terminal values), naively you’d have reward functions $R_{H} (s)$ and $R_{A} (s)$ , that respectively assign human and AI rewards over every possible arrangement of matter $s$ . Then you’d look at every such reward function pair over your joint distribution $D_{H A}$ , and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
And to compute $α_{H A}$ (the correlation coefficient between instrumental values), you’re correct that some of the arrangements of matter $s$ will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each $s$ , and you can read the correlation right off the alignment plot!
1. ^
  Looking again at the write-up, it would have made more sense for us to define $α_{H A}$ as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn’t occur to us. Sorry for the confusion.

Edouard Harris Oct 14, 2022, 6:54 PM
LW: 3 AF: 2
2
AF
in reply to: Alex Flint’s comment on: Misalignment-by-default in multi-agent systems
Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn’t have much useful to say in a case like this one.
We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.
On the other hand, from other results I’ve seen anecdotally, I suspect that if you gave one of the agents a purely random policy (i.e., take a random legal action at each state) and assigned the other agent some reasonable reward function distribution over material, you’d stand a decent chance of correctly identifying high-POWER states with high-mobility board positions.
You might also be interested in this comment by David Xu, where he discusses mobility as a measure of instrumental value in chess-playing.

Edouard Harris Oct 14, 2022, 6:37 PM
LW: 4 AF: 3
0
AF
in reply to: Alex Flint’s comment on: Instrumental convergence in single-agent systems
Thanks for you comment. These are great questions. I’ll do the best I can to answer here, feel free to ask follow-ups:
1. On pre-committing as a negotiating tactic: If I’ve understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out?
  
  It won’t surprise you that I think the answer is highly context-dependent, and that I’m not sure which way it would actually shake out in your example with Fred and Bob and the $5000. But interestingly, we did in fact discover an instance of this class of “sacrificial” strategies in our experiments!
  
  You can check out the example in Part 3 if you’re interested. But briefly, what happens is that when the agents get far-sighted enough, one of them realizes that there is instrumental value in having the option to bottle up the other agent in a dead-end corridor (i.e., constraining that other agent’s options). But it can only actually do this by positioning itself at the mouth of the corridor (i.e., sacrificing its own options). Here is a full-size image of both agents’ POWERs in this situation. You can see from the diagram that Agent A prefers to preserve its own options over constraining Agent H’s options in this case. But crucially, Agent A values the option of being able to constrain Agent H’s options.
  
  In the language of your negotiating example, there is instrumental value in preserving one’s option to pre-commit. But whether actually pre-committing is instrumentally valuable or not depends on the context.
2. On babies being more powerful than adults: Yes, I think your reasoning is right. And it would be relatively easy to do this experiment! All you’d need would be to define a “death” state, and set your transition dynamics so that the agent gets sent to the “death” state after N turns and can never escape from it afterwards. I think this would be a very interesting experiment to run, in fact.
3. On paperclip maximizers: This is a very deep and interesting question. One way to think about this schematically might be: a superintelligent paperclip maximizer will go through a Phase One, in which it accumulates its POWER; and then a Phase Two in which it spends the POWER it’s accumulated. During the accumulation phase, the system might drive towards a state where (without loss of generality) the Planet Earth is converted into a big pile of computronium. This computronium-Earth state is high-POWER, because it’s a common “way station” state for paperclip maximizers, thumbtack maximizers, safety pin maximizers, No. 2 pencil maximizers, and so on. (Indeed, this is what high POWER means.)
  
  Once the system has the POWER it needs to reach its final objective, it will begin to spend that POWER in ways that maximize its objective. This is the point at which the paperclip, thumbtack, safety pin, and No. 2 pencil maximizers start to diverge from one another. They will each push the universe towards sharply different terminal states, and the more progress each maximizer makes towards its particular terminal state, the fewer remaining options it leaves for itself if its goal were to suddenly change. Like a male praying mantis, a maximizer ultimately sacrifices its whole existence for the pursuit of its terminal goal. In other words: zero POWER should be the end state of a pure X-maximizer!^[1]
  
  My story here is hypothetical, but this is absolutely an experiment on can do (at small scale, naturally). The way to do it would be to run several rollouts of an agent, and plot the POWER of the agent at each state it visits during the rollout. Then we can see whether most agent trajectories have the property where their POWER first goes up (as they, e.g., move to topological junction points) and then goes down (as they move from the junction points to their actual objectives).
Thanks again for your great questions. Incidentally, a big reason we’re open-sourcing our research codebase is to radically lower the cost of converting thought experiments like the above into real experiments with concrete outcomes that can support or falsify our intuitions. The ideas you’ve suggested are not only interesting and creative, they’re also cheaply testable on our existing infrastructure. That’s one reason we’re excited to release it!
1. ^
  Note that this assumes the maximizer is inner aligned to pursue its terminal goal, the terminal goal is stable on reflection, and all the usual similar incantations.

Edouard Harris Oct 12, 2022, 5:20 PM
LW: 4 AF: 3
0
AF
in reply to: Algon’s comment on: Instrumental convergence in single-agent systems
Yes, I think this is right. It’s been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a “correct” model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.
We’ve focused on simple gridworlds here, partly because they’re visual, but also because they’re tractable. But I suspect there’s a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.

Edouard Harris Sep 5, 2022, 10:26 PM
LW: 1 AF: 1
0
AF
in reply to: TurnTrout’s comment on: The shard theory of human values
Got it. That makes sense, thanks!

Edouard Harris Sep 5, 2022, 1:33 PM
LW: 7 AF: 5
2
AF
on: The shard theory of human values
This is really interesting. It’s hard to speak too definitively about theories of human values, but for what it’s worth these ideas do pass my intuitive smell test.
One intriguing aspect is that, assuming I’ve followed correctly, this theory aims to unify different cognitive concepts in a way that might be testable:
- On the one hand, it seems to suggest a path to generalizing circuits-type work to the model-based RL paradigm. (With shards, which bid for outcomes on a contextually activated basis, being analogous to circuits, which contribute to prediction probabilities on a contextually activated basis.)
- On the other hand, it also seems to generalize the psychological concept of classical conditioning (Pavlov’s salivating dog, etc.), which has tended to be studied over the short term for practical reasons, to arbitrarily (?) longer planning horizons. The discussion of learning in babies also puts one in mind of the unfortunate Little Albert Experiment, done in the 1920s:
For the experiment proper, by which point Albert was 11 months old, he was put on a mattress on a table in the middle of a room. A white laboratory rat was placed near Albert and he was allowed to play with it. At this point, Watson and Rayner made a loud sound behind Albert’s back by striking a suspended steel bar with a hammer each time the baby touched the rat. Albert responded to the noise by crying and showing fear. After several such pairings of the two stimuli, Albert was presented with only the rat. Upon seeing the rat, Albert became very distressed, crying and crawling away.

[...]

In further experiments, Little Albert seemed to generalize his response to the white rat. He became distressed at the sight of several other furry objects, such as a rabbit, a furry dog, and a seal-skin coat, and even a Santa Claus mask with white cotton balls in the beard.
A couple more random thoughts on stories one could tell through the lens of shard theory:
- As we age, if all goes well, we develop shards with longer planning horizons. Planning over longer horizons requires more cognitive capacity (all else equal), and long-horizon shards do seem to have some ability to either reinforce or dampen the influence of shorter-horizon shards. This is part of the continuing process of “internally aligning” a human mind.
- Introspectively, I think there is also an energy cost involved in switching between “active” shards. Software developers understand this as context-switching, actively dislike it, and evolve strategies to minimize it in their daily work. I suspect a lot of the biases you might categorize under “resistance to change” (projection bias, sunk cost fallacy and so on) have this as a factor.
I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren’t fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., “earn money in the stock market” or some such) isn’t agentic by some definition?
Anyway, great post. Looking forward to more.

Edouard Harris Jun 28, 2022, 12:23 PM
LW: 10 AF: 6
7
AF
on: Announcing Epoch: A research organization investigating the road to Transformative AI
Nice. Congrats on the launch! This is an extremely necessary line of effort.

Edouard Harris Jun 15, 2022, 2:56 PM
LW: 1 AF: 1
0
AF
in reply to: DaemonicSigil’s comment on: AGI Ruin: A List of Lethalities
Interesting. The specific idea you’re proposing here may or may not be workable, but it’s an intriguing example of a more general strategy that I’ve previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a “platform” that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.
Essentially:
1. A “platform” for work in domain X is something that removes key constraints that would otherwise have consumed human time and effort when working in X. This allows humans to explore solutions in X they wouldn’t have previously — whether because they’d considered and rejected those solution paths, or because they’d subconsciously trained themselves not to look in places where the initial effort barrier was too high. Thus, developing an excellent platform for X allows humans to accelerate progress in domain X relative to other domains, ceteris paribus. (Every successful platform company does this. e.g., Shopify, Amazon, etc., make valuable businesses possible that wouldn’t otherwise exist.)
2. For certain carefully selected domains X, a platform for X may plausibly be relatively easier to secure & validate than an agent that’s targeted at some specific task x ∈ X would be. (Not easy; easier.) It’s less risky to validate the outputs of a platform and leave the really dangerous last-mile stuff to humans, than it would be to give an end-to-end trained AI agent a pivotal command in the real world (i.e., “melt all GPUs”) that necessarily takes the whole system far outside its training distribution. Fundamentally, the bet is that if humans are the ones doing the out-of-distribution part of the work, then the output that comes out the other end is less likely to have been adversarially selected against us.
(Note that platforms are tools, and tools want to be agents, so a strategy like this is unlikely to arise along the “natural” path of capabilities progress other than transiently.)
There are some obvious problems with this strategy. One is that point 1 above is no help if you can’t tell which of the solutions the humans come up with are good, and which are bad. So the approach can only work on problems that humans would otherwise have been smart enough to solve eventually, given enough time to do so (as you already pointed out in your example). If AI alignment is such a problem, then it could be a viable candidate for such an approach. Ditto for a pivotal act.
Another obvious problem is that capabilities research might benefit from the similar platforms that alignment research can. So actually implementing this in the real world might just accelerate the timeline for everything, leaving us worse off. (Absent an intervention at some higher level of coordination.)
A third concern is that point 2 above could be flat-out wrong in practice. Asking an AI to build a platform means asking for generalization, even if it is just “generalization within X”, and that’s playing a lethally dangerous game. In fact, it might well be lethal for any useful X, though that isn’t currently obvious to me. e.g., AlphaFold2 is a primitive example of a platform that that’s useful and non-dangerous, though it’s not useful enough for this.
On top of all that, there are all the steganographic considerations — AI embedding dangerous things in the tool itself, etc. — that you pointed out in your example.
But this strategy still seems like it could bring us closer to the Pareto frontier for critical domains (alignment problem, pivotal act), than it would be to directly train an AI to do the dangerous action.

Edouard Harris Feb 21, 2022, 6:02 PM
LW: 1 AF: 1
AF
in reply to: Alex Flint’s comment on: Alignment versus AI Alignment
Yep, I’d say I intuitively agree with all of that, though I’d add that if you want to specify the set of “outcomes” differently from the set of “goals”, then that must mean you’re implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it’s a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it’s a complete description of the features of the universe that the system can perceive).
This mapping from outcomes to goals won’t be injective for any real embedded system. But in the unrealistic limit where your system is so capable that it has a “perfect ontology” — i.e., its perception apparatus can resolve every outcome / microstate from any other — then this mapping converges to the identity function, and the system’s set of possible goals converges to its set of possible outcomes. (This is the dualistic case, e.g., AIXI and such. But plausibly, we also should expect a self-improving systems to improve its own perception apparatus such that its effective goal-set becomes finer and finer with each improvement cycle. So even this partition over goals can’t be treated as constant in the general case.)