Alex Lawsen

Karma: 499

AI grantmaking at Open Philanthropy.

I used to give careers advice for 80,000 hours.

Alex Lawsen Sep 8, 2022, 6:07 AM
2 points
0
in reply to: Emrik’s comment on: alexrjl’s Shortform
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.

Alex Lawsen Sep 7, 2022, 10:31 AM
4 points
2
in reply to: Alex Lawsen ’s comment on: alexrjl’s Shortform
Clarifying something based on a helpful DM:
- I know Steganography implies something like deliberate intent.
- I don’t think there’s currently deliberate intent.
- I also don’t think deliberate intent matters, but I know this is controversial and haven’t yet articulated why.
- Fleshing out why I want to push hard against ‘deliberateness’ mattering is going to take more time than I expect to have this week, and it’s possible writing anything about this before I had time to do it was a mistake.
- I think it’s pretty non-obvious, and would put more than 20% on me concluding that I’m totally wrong about this specific thing in the process of trying to write out my intuition.

Alex Lawsen Sep 7, 2022, 8:38 AM
10 points
0
on: alexrjl’s Shortform
Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:

Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I’ve been thinking about this in the abstract for a while but somehow I didn’t see what was right in front of me until now.
I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don’t want this to drop out of my head.
I’ll try to write something proper up soon, but the sketch of what happened is:
- It’s been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
- And that this could be done in such a way that it wouldn’t look obvious to readers.
- And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
- Then today I was thinking about LLM+scratchpad as my MVP for ‘scale is all you need’, and I was wondering about how small a scratchpad you’d need, where one possibility was you don’t need one at all you just remove the first words in the context window as you run out of space.
- And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
- The whole discipline of prompt engineering is a demonstration that most of the instructions we’re giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
- Which is obviously miles away from being the case. I haven’t thought about how to formalise it, but I suspect that almost all of the instructions we’re actually giving in prompts are ‘under the surface’, rather than being ‘natural language interpretable’.
- I’m embarrassed by how obvious this realisation is in hindsight, but whatever, it’s there now. Now to work out what to do about it.
(I don’t think I’d have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)

Alex Lawsen Sep 4, 2022, 1:16 PM
LW: 1 AF: 1
0
AF
in reply to: elifland’s comment on: Simulators
Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you’re very against pushing the RL SoTA.

Alex Lawsen Sep 4, 2022, 9:47 AM
3 points
0
on: alexrjl’s Shortform
I’m confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it’s clearly a product of the same thought process (and mind(s?)).

On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there’s a very real sense in which we could replace the post with Stockfish is a multiverse generator, Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of ‘yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.’

There’s a slight subtlety here in that in the chess and go cases, the structure won’t strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don’t know how much this changes things.

Alex Lawsen Sep 4, 2022, 9:28 AM
LW: 15 AF: 7
6
AF
on: Simulators
Thanks for writing this up! I’ve found this frame to be a really useful way of thinking about GPT-like models since first discussing it.

In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the ‘other methods’ section of ‘Novel methods of process/agent specification’). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI’s current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.
It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:
- Something that seems relatively benign/unexciting—fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn’t really change which simulacra are accessible to the simulator.
- More worrying things—particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
- Things which could go either way and seem very high stakes—the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.

Alex Lawsen Aug 29, 2022, 2:23 PM
10 points
7
on: alexrjl’s Shortform
Idea I want to flesh out into a full post:
Changing board-game rules as a test environment for unscoped consequentialism.
- The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
  - If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
  - The moment at which I become aware of these other options seems critical.
- It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
  - My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
  - I think this provides a pretty great analogy to expanding the scope of your action set.
  - There’s also some relevance to training/deployment mismatches.
  - If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
- This might provide insight/the opportunity to experiment on (to flesh out in depth):
  - Myopia
  - Deception (if we select away from agents who make these illegal moves)
  - useful bounds on consequentialism
  - More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.

alexrjl’s Shortform

Alex Lawsen Aug 29, 2022, 2:23 PM

3 points

14 comments LW link

Thoughts on ‘List of Lethalities’

Alex Lawsen Aug 17, 2022, 6:33 PM

27 points

0 comments10 min readLW link

Alex Lawsen Jul 28, 2022, 4:04 AM
1 point
0
on: Seeking beta readers who are ignorant of biology but knowledgeable about AI safety
Interested.

AI safety level: don’t typically struggle to follow technical conversations with full time researchers, though am not a full time researcher.

Bio: last studied it 14 years ago. Vaguely aware miosis and mitosis are different but couldn’t define either without Google.

Alex Lawsen Jul 24, 2022, 2:13 AM
8 points
7
on: What are the most effective ways to reduce climate change impact, per dollar, in a delayed-AGI timeline?
Founders Pledge’s research is the best in the game here. If you want to make a recommendation that’s for a specific charity rather than a fund, Clean Air Task Force seemed sensible every time I spoke to them, and have been around for a while.

Alex Lawsen Jun 7, 2022, 10:43 AM
4 points
in reply to: Aryeh Englander’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
AXRP—Excellent interviews with a variety of researchers. Daniel’s substantial own knowledge means that the questions he asks are often excellent, and the technical depth is far better than anything else that’s available in audio, given the difficulty of autoreaders on papers or the alignment forum finding it difficult to handle actual maths.

Alex Lawsen Jun 7, 2022, 10:40 AM
2 points
in reply to: Aryeh Englander’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
This talk from Joe Carlsmith. - Hits at several of the key ideas really directly given the time and technical background constraints. Like Rob’s videos, implies an obvious next step for people interested in learning more, or who are suspicious of one of the claims (reading Joe’s actual report, maybe even the extensive discussion of it on here).

Alex Lawsen Jun 7, 2022, 10:37 AM
1 point
1
in reply to: Aryeh Englander’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
The Alignment Problem—Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more.

Edit: Someone strong-downvoted this, I’d find it pretty useful to know why. To be clear, by ‘why’ I mean ‘why does this rec seem bad’, rather than ‘why downvote’. If it’s the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that’s easier for any reason.

Alex Lawsen Jun 7, 2022, 10:37 AM
8 points
in reply to: Aryeh Englander’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
Rob Miles’s youtube channel, see this intro. Also his video on the stop button problem for Computerphile.
- Easily accessible, entertaining, videos are low cost for many people to watch, and they often end up watching several.

Alex Lawsen Jun 7, 2022, 10:31 AM
33 points
7
in reply to: michael_mjd’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k’s case, also many other problems).

Noting a conflict of interest—I work for 80,000 hours and know of but haven’t used AISS. This post is in a personal capacity, I’m just flagging publicly available information rather than giving an insider take.
What links here?
- Yonatan Cale's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by Aryeh Englander (Jun 9, 2022, 9:53 PM; 1 point)

Alex Lawsen Jun 6, 2022, 9:28 PM
1 point
in reply to: Nicholas Schiefer’s comment on: Prize for Alignment Research Tasks
I love this idea mostly because it would hugely improve screen reader options for alignment research.

Alex Lawsen Jun 1, 2022, 6:30 AM
3 points
in reply to: Alex Lawsen ’s comment on: The case for becoming a black-box investigator of language models
Some initial investigation here, along with a response from the author of the original claim.

Alex Lawsen May 31, 2022, 6:52 PM
1 point
on: The case for becoming a black-box investigator of language models
This recent discovery about DALLE-2 seems like it might provide interesting ideas for experiments in this vein.

Alex Lawsen May 27, 2022, 5:59 AM
6 points
in reply to: Rob Bensinger’s comment on: Beware boasting about non-existent forecasting track records
Yes, https://metaculusextras.com/points_per_question

It has its own problems in terms of judging ability. But it does exist.

Alex Lawsen

Idea I want to flesh out into a full post:

Changing board-game rules as a test environment for unscoped consequentialism.

alexrjl’s Shortform

Thoughts on ‘List of Lethal­ities’

Thoughts on ‘List of Lethalities’