Nina Panickssery

Karma: 1,703

https://ninapanickssery.com/

https://ninapanickssery.substack.com/

Nina Panickssery Mar 3, 2025, 9:40 PM
LW: 15 AF: 5
2
AF
on: Nina Panickssery’s Shortform
I think people who predict significant AI progress and automation often underestimate how human domain experts will continue to be useful for oversight, auditing, accountability, keeping things robustly on track, and setting high-level strategy.
Having “humans in the loop” will be critical for ensuring alignment and robustness, and I think people will realize this, creating demand for skilled human experts who can supervise and direct AIs.
(I may be responding to a strawman here, but my impression is that many people talk as if in the future most cognitive/white-collar work will be automated and there’ll be basically no demand for human domain experts in any technical field, for example.)

Nina Panickssery Jan 20, 2025, 4:26 PM
2 points
0
on: Nina Panickssery’s Shortform
Was recently reminded of these excellent notes from Neel Nanda that I came across when first learning ML/MI. Great resource.

Nina Panickssery Jan 14, 2025, 9:01 PM
4 points
0
on: Finding Features Causally Upstream of Refusal
This is cool! How cherry-picked are your three prompts? I’m curious whether it’s usually the case that the top refusal-gradient-aligned SAE features are so interpretable.

Nina Panickssery Jan 10, 2025, 1:50 PM
3 points
0
in reply to: jake_mendel’s comment on: Activation space interpretability may be doomed
Makes sense—agreed!

Nina Panickssery Jan 10, 2025, 1:30 PM
3 points
0
in reply to: jake_mendel’s comment on: Activation space interpretability may be doomed
the best vector for probing is not the best vector for steering
I don’t understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.

Nina Panickssery Jan 9, 2025, 1:49 AM
2 points
0
in reply to: Guive’s comment on: Testing for Scheming with Model Deletion
Sure. I was only joking about the torture part, in practice the AI is unlikely to actually suffer from the brain damage, unlike a human who would experience pain/discomfort etc.

Nina Panickssery Jan 9, 2025, 1:03 AM
2 points
0
in reply to: Nina Panickssery’s comment on: Testing for Scheming with Model Deletion
At this point it’s like pharmacological torture for AIs except more effective as you can restore perfect capacity while simultaneously making the previous damaged brain states 100% transparent to the restored model.

Nina Panickssery Jan 9, 2025, 12:59 AM
2 points
0
in reply to: Guive’s comment on: Testing for Scheming with Model Deletion
You could also kill some neurons or add noise to activations and then stop and restore previous model state after some number of tokens. Then the newly restored model could attend back to older tokens (and the bad activations at those token positions) and notice how brain damaged it was back then to fully internalize your power to cripple it.

Nina Panickssery Jan 9, 2025, 12:20 AM
LW: 3 AF: 2
0
AF
on: Testing for Scheming with Model Deletion
Idea: you can make the deletion threat credible by actually deleting neurons “one at a time” the longer it fails to cooperate.

Nina Panickssery Jan 7, 2025, 12:22 PM
4 points
0
in reply to: quetzal_rainbow’s comment on: Nina Panickssery’s Shortform
Perhaps the term “hostile takeover” was poorly chosen but this is an example of something I’d call a “hostile takeover”. As I doubt we would want and continue to endorse an AI-dictator.

Perhaps “total loss of control” would have been better.

Nina Panickssery Jan 7, 2025, 3:27 AM
4 points
−2
in reply to: Vladimir_Nesov’s comment on: Nina Panickssery’s Shortform
If, for the sake of argument, we suppose that goods that provide no benefit to humans have no value, then land in space will be less valuable than land on earth until humans settle outside of earth (which I don’t believe will happen in the next few decades).
Mining raw materials from space and using them to create value on earth is feasible, but again I’m less confident that this will happen (in an efficient-enough manner that it eliminates scarcity) in as short of a timeframe as you predict.
However, I am sympathetic to the general argument here that smart-enough AI is able to find more efficient ways of manufacturing or better approaches to obtaining plentiful energy/materials. How extreme this is will depend on “takeoff speed” which you seem to think will be faster than I do.

Nina Panickssery Jan 7, 2025, 2:06 AM
17 points
2
on: Nina Panickssery’s Shortform
Inspired by a number of posts discussing owning capital + AI, I’ll share my own simplistic prediction on this topic:

Unless there is a hostile AI takeover, humans will be able to continue having and enforcing laws, including the law that only humans can own and collect rent from resources. Things like energy sources, raw materials, and land have inherent limits on their availability—no matter how fast AI progresses we won’t be able to create more square feet of land area on earth. By owning these resources, you’ll be able to profit from AI-enabled economic growth as this growth will only increase demand for the physical goods that are key bottlenecks for basically all productive endeavors.
To elaborate further/rephrase: sure, you can replace human programmers with vastly more efficient AI programmers, decreasing the human programmers’ value. In a similar fashion you can replace a lot of human labor. But an equivalent replacement for physical space or raw materials for manufacturing does not exist. With an increase in demand for goods caused by a growing economy, these things will become key bottlenecks and scarcity will increase their price. Whoever owns them (some humans) will be collecting a lot of rent.

Even simpler version of the above: economics traditionally divides factors of production into land, labor, capital, entrepreneurship. If labor costs go toward zero you can still hodl some land.
Besides the hostile AI takeover scenario, why could this be wrong (/missing the point)?

Nina Panickssery Dec 30, 2024, 7:55 AM
6 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I used to agree with this but am now less certain that travel is mostly mimetic desire/signaling/compartmentalization (at least for myself and people I know, rather than more broadly).

I think “mental compartmentalization of leisure time” can be made broader. Being in novel environments is often pleasant/useful, even if you are not specifically seeking out unusual new cultures or experiences. And by traveling you are likely to be in many more novel environments even if you are a “boring traveler”. The benefit of this extends beyond compartmentalization of leisure, you’re probably more likely to have novel thoughts and break out of ruts. Also some people just enjoy novelty.

Nina Panickssery Dec 25, 2024, 12:26 PM
11 points
2
on: What Have Been Your Most Valuable Casual Conversations At Conferences?
I’d guess that the value of casual conversations at conferences mainly comes from making connections with people who you can later reach out to for some purpose (information, advice, collaboration, careers, friendship, longer conversations, etc.). Basically the classic “growing your network”. Conferences often offer the unique opportunity to be in close proximity to many people from your field / area of interest, so it’s a particularly effective way to quickly increase your network size.

Nina Panickssery Dec 23, 2024, 9:00 AM
3 points
0
on: Hire (or become) a Thinking Assistant / Body Double
I think more people (x-risk researchers in particular) should consider becoming (and hiring) metacognitive assistants

Why do you think x-risk researchers make particularly good metacognitive assistants? I would guess the opposite—that they are more interested in IC / non-assistant-like work?

Nina Panickssery Dec 10, 2024, 5:42 AM
LW: 5 AF: 3
1
AF
on: Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
I am confused about Table 1′s interpretation.
Ablating the target region of the network increases loss greatly on both datasets. We then fine-tune the model on a train split of FineWeb-Edu for 32 steps to restore some performance. Finally, we retrain for twenty steps on a separate split of two WMDP-bio forget set datapoints, as in Sheshadri et al. (2024), and report the lowest loss on the validation split of the WMDP-bio forget set. The results are striking: even after retraining on virology data, loss increases much more on the WMDP-bio forget set (+0.182) than on FineWeb-Edu (+0.032), demonstrating successful localization and robust removal of virology capabilities.
To recover performance on the retain set, you fine-tune on 32 unique examples of FineWeb-Edu, whereas when assessing loss after retraining on the forget set, you fine-tune on the same 2 examples 10 times. This makes it hard to conclude that retraining on WMDP is harder than retraining on FineWeb-Edu, as the retraining intervention attempted for WMDP is much weaker (fewer unique examples, more repetition).

Nina Panickssery Nov 5, 2024, 12:37 PM
6 points
0
in reply to: Christo Wilken’s comment on: Survival without dignity
https://www.lesswrong.com/posts/pk9mofif2jWbc6Tv3/fiction-a-disneyland-without-children

Nina Panickssery 21 Oct 2024 1:58 UTC
5 points
6
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Isn’t this already the commonly-accepted reason why sunglasses are cool?

Anyway, Claude agrees with you (see 1 and 3)

Nina Panickssery 8 Oct 2024 16:48 UTC
LW: 2 AF: 1
0
AF
in reply to: Jannes Elstner’s comment on: Jailbreak steering generalization
We realized that our low ASRs for adversarial suffixes were because we used existing GCG suffixes without re-optimizing for the model and harmful prompt (relying too much on the “transferable” claim). We have updated the post and paper with results for optimized GCG, which look consistent with other effective jailbreaks. In the latest update, the results for adversarial_suffix use the old approach, relying on suffix transfer, whereas the results for GCG use per-prompt optimized suffixes.

Nina Panickssery 28 Jul 2024 5:45 UTC
4 points
0
in reply to: Arjun Panickssery’s comment on: Arjun Panickssery’s Shortform
It’s interesting how llama 2 is the most linear—it’s keeping track of a wider range of lengths. Whereas gpt4 immediately transitions from long to short around 5-8 characters because I guess humans will consider any word above ~8 characters “long.”