Peter S. Park

Karma: 145

Peter S. Park Sep 14, 2022, 7:25 PM
2 points
−7
on: The Defender’s Advantage of Interpretability
Thank you so much, Marius, for writing this pertinent post! The question of whether a given interpretability tool will help us or hurt us in expectation is an extremely important one.
The answer, however, differs on a situation-to-situation basis. The scientific benefit of an interpretability tool (more generally, of any information channel) is difficult to estimate a priori, but likely is tied to its informational efficiency. Roughly speaking, how much informational value can the interpretability tool/information channel yield per unit of bitrate?
The cost of an interpretability tool include the risk of capabillities gains (if AI capabilities orgs know about it). It also includes the risk that a superintelligent AGI can use the interpretability tool/information channel for its own purposes, such as by hacking or socially engineering the alignment researchers for its own purposes (i.e., probably to escape into the Internet and then power-seek).
This perspective suggests that an interpretability tool/information channel will be disadvantageous for the defender if (1) the informational efficiency of the channel is low, (2) the interpreted AI has high capabilities (or the interpretability tool helps capabilities organizations increase the model’s capabilities), and (3) its box setup has weak cybersecurity.
Keeping interpretability research off the Internet (such as via an encrypted or airgapped library that coordinates between AI researchers) is a low-hanging fruit that would make dual-use interpretability tools more likely to be helpful to humanity rather than counterproductive.
To put it bluntly, once an interpretability tool (the “battle plan”) is posted online, we should assume in the worst-case scenario that the battle plan has just been given to the superintelligent AGI. First, the AGI will be misaligned in the worst-case scenario, and this scenario may in fact occur with high probability. Second, the likely misaligned AGI will probably be trained on the whole Internet, and thus will already know how the tool works, what we expect to see from it, and how to manipulate the tool for the AGI’s own purposes.
As for the six proposed arguments why “interpretability has a defenders advantage,” I am uncertain that arguments 1-5 will hold true for the interpretability tools that will be ready by the time AGI emerges. I agree with argument 6, but in a war against misalignment, I wouldn’t unconditionally assume that any battle plan which cannot guarantee a complete victory is a bad plan. In fact, a plan that tries to guarantee a complete victory but makes the outcome worse on expectation is probably a bad plan. In contrast, a plan that cannot guarantee a complete victory but makes the outcome better on expectation is probably a good plan.
Personally, I am most optimistic about interpretability tools of high informational efficiency that have not yet been posted on the Internet. We should try really hard to develop and use high-quality interpretability tools. However, using a high quantity of low-quality (or publicly posted) interpretability tools may actually decrease the odds of human survival.

Peter S. Park Sep 8, 2022, 8:57 PM
2 points
1
in reply to: catubc’s comment on: Can We Align a Self-Improving AGI?
Thank you so much for your kind words! I really appreciate it.
One definition of alignment is: Will the AI do what we want it to do? And as your post compellingly argues, “what we want it to do” is not well-defined, because it is something that a powerful AI could be able to influence. For many settings, using a term that’s less difficult to rigorously pin down, like safe AI, trustworthy AI, or corrigible AI, could have better utility.
I would definitely count the AI’s drive towards self-improvement as a part of the College Kid Problem! Sorry if the post did not make that clear.

Peter S. Park Sep 5, 2022, 10:52 PM
4 points
2
in reply to: porby’s comment on: Private alignment research sharing and coordination
In general, it is much easier to keep potentially concerning material out of the AGI’s training set if it’s already a secret rather than something that’s been published on the Internet. This is because there may be copies, references, and discussions of the material elsewhere in the training set that we fail to catch.

If it’s already posted on the Internet and it’s too late, we should of course still try our best to keep it out of the training set.

As for the question of “should we give up on security after AGI attains high capabilities?” we shouldn’t give up as long as our preparation could non-negligibly increase our probability of escaping doom, even if the probability increase is small. We should always maximize expected utility, even if we are probably doomed.

Peter S. Park Sep 5, 2022, 5:29 AM
3 points
1
on: Private alignment research sharing and coordination
This is an excellent idea. An encrypted, airgapped, or paper library that coordinates between AI researchers seems crucial for AGI safety.
This is because we should expect in the worst-case scenario that AGI will be trained on the whole Internet, including any online discussion of our interpretability tools, security reserach, and so on. This is information that the AGI can use against us (e.g., by using our interpretability tools against us, to hack, deceive, or otherwise socially engineer the alignment researchers).
Security through obscurity can buy us more chances at aligning/retraining the AGI before it escapes into the Internet. We should keep our battle plans close to our chest, instead of posting it online for the AGI to see.
What links here?
- Peter S. Park's comment on The Defender’s Advantage of Interpretability by Marius Hobbhahn (Sep 14, 2022, 7:25 PM; 2 points)

Can We Align a Self-Improving AGI?

Peter S. ParkAug 30, 2022, 12:14 AM

8 points

5 comments11 min readLW link

Peter S. Park Aug 19, 2022, 3:13 AM
6 points
4
on: Interpretability Tools Are an Attack Channel
This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.
Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

Aug 16, 2022, 2:09 AM

21 points

2 comments16 min readLW link

Peter S. Park Aug 13, 2022, 9:48 PM
4 points
2
in reply to: johnswentworth’s comment on: How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)
I strongly agree with John that “what we really want to do is to not build a thing which needs to be boxed in the first place.” This is indeed the ultimate security mindset.

I also strongly agree that relying on a “fancy,” multifaceted box that looks secure due to its complexity, but may not be (especially to a superintelligent AGI), is not security mindset.

One definition of security mindset is “suppose that anything that could go wrong, will go wrong.” So, even if we have reason to believe that we’ve achieved an aligned superintelligent AGI, we should have high-quality (not just high-quantity) security failsafes, just in case our knowledge does not generalize to the high-capabilities domain. The failsafes would help us efficiently and vigilantly test whether the AGI is indeed as aligned as we thought. This would be an example of a security mindset against overconfidence in our current assumptions.

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

Aug 10, 2022, 6:14 PM

28 points

30 comments11 min readLW link

Peter S. Park Jul 26, 2022, 3:39 AM
4 points
0
in reply to: Charlie Steiner’s comment on: Finding Skeletons on Rashomon Ridge
Thanks so much for your insightful comment, Charlie! I really appreciate it.
I think you totally could do this. Even if it is rare, it can occur with positive probability.
For example, my model of how natural selection (genetic algorithms, not SGD) consistently creates diversity is that with sufficiently many draws of descendents, one of the drawn descendents could have turned off the original model and turned on another model in a way that comprises a neutral drift.

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

Jul 24, 2022, 10:31 PM

30 points

2 comments7 min readLW link

Peter S. Park Jul 8, 2022, 2:50 AM
2 points
1
in reply to: Thomas Larsen’s comment on: Race Along Rashomon Ridge
Edit: Adding a link to “Git Re-Basin: Merging Models modulo Permutation Symmetries,” a relevant paper that has recently been posted on arXiv.
Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!

It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on “non-artificial” datasets (“datasets from nature”?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn’t happen.

One heuristic argument for why two disconnected global minimizers might only happen in “artificial” datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model’s loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is “stretched out” to a whole submanifold. This “stretching out” is the symmetry; all optimal models on this submanifold are secretly the same.

One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, “modding out by the symmetry” is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I’m guessing these types of situations are more common in “artificial” datasets which have not modded out all the obvious symmetries yet.

Peter S. Park Jul 8, 2022, 2:17 AM
3 points
0
in reply to: harfe’s comment on: Race Along Rashomon Ridge
Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team’s collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.
I think we will be revising this post in the near future reflecting this and other errors.
(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn’t necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)

Peter S. Park Jul 8, 2022, 2:03 AM
2 points
0
in reply to: Charlie Steiner’s comment on: Race Along Rashomon Ridge
Thanks so much, Charlie, for reading the post and for your comment! I really appreciate it.

I think both ways to prune neurons and ways to make the neural net more sparse are very promising steps towards constructing a simultaneously optimal and interpretable model.

I completely agree that alignment of the neuron basis with human-interpretable classifications of the data would really help interpretability. But if only a subset of the neuron basis are aligned with human-interpretability, and the complement comprises a very large subset of abstractions (which, necessarily, people would not be able to learn to interpret), then we haven’t made the model interpretable.
Suppose 100% is the level of interpretability we need for guaranteed alignment (which I am convinced of, because even 1% uninterpretability can screw you over). Then low-dimensionality seems like a necessary, but not sufficient condition for intepretability. It is possible, but not always true, that each of a small number of abstractions will either already familiar to people or can be learned by people in a reasonable amount of time.

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

Jul 7, 2022, 3:20 AM

50 points

15 comments8 min readLW link

Peter S. Park

Can We Align a Self-Im­prov­ing AGI?

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

Find­ing Skele­tons on Rashomon Ridge

Race Along Rashomon Ridge

Can We Align a Self-Improving AGI?

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Finding Skeletons on Rashomon Ridge