Jessica Rumbelow

Karma: 1,061

AI interpretability researcher

Jessica Rumbelow 7 Mar 2023 14:22 UTC
2 points
0
in reply to: 1a3orn’s comment on: Introducing Leap Labs, an AI interpretability startup
Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing.

Jessica Rumbelow 7 Mar 2023 14:20 UTC
5 points
0
in reply to: Jay Bailey’s comment on: Introducing Leap Labs, an AI interpretability startup
This isn’t set in stone, but likely we’ll monetise by selling access to the interpretability engine, via an API. I imagine we’ll offer free or subsidised access to select researchers/orgs. Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

Jessica Rumbelow 7 Mar 2023 14:16 UTC
11 points
8
in reply to: Søren Elverlin’s comment on: Introducing Leap Labs, an AI interpretability startup
Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. “this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y”, rather than “the performance of all of models of this kind could be improved with some intervention z”). I’m thinking about this a lot at the moment and would welcome further input.

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 16:16 UTC

99 points

14 Feb 2023 10:17 UTC

90 points

Jessica Rumbelow 6 Feb 2023 21:18 UTC
4 points
0
in reply to: ChrisCundy’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

6 Feb 2023 19:09 UTC

109 points

5 Feb 2023 22:02 UTC

664 points

Jessica Rumbelow 5 Feb 2023 21:06 UTC
4 points
0
in reply to: neverix’s comment on: SolidGoldMagikarp (plus, prompt generation)
Interesting! Can you give a bit more detail or share code?

Jessica Rumbelow 5 Feb 2023 21:00 UTC
2 points
0
in reply to: neverix’s comment on: SolidGoldMagikarp (plus, prompt generation)
Interesting, thanks. There’s not a whole lot of detail there—it looks like they didn’t do any distance regularisation, which is probably why they didn’t get meaningful results.

Jessica Rumbelow 5 Feb 2023 20:58 UTC
4 points
0
in reply to: lsusr’s comment on: SolidGoldMagikarp (plus, prompt generation)
I’ll check with Matthew—it’s certainly possible that not all tokens in the “weird token cluster” elicit the same kinds of responses.

Jessica Rumbelow 5 Feb 2023 19:24 UTC
3 points
0
in reply to: mic’s comment on: SolidGoldMagikarp (plus, prompt generation)
Not yet, but there’s no reason why it wouldn’t be possible. You can imagine microscope AI, for language models. It’s on our to-do list.

Jessica Rumbelow 5 Feb 2023 12:11 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: SolidGoldMagikarp (plus, prompt generation)
Yep, aside from running forward prop n times to generate an output of length n, we can just optimise the mean probability of the target tokens at each position in the output—it’s already implemented in the code. Although, it takes way longer to find optimal completions.