Jack R

Karma: 302

Jack R Apr 10, 2022, 8:26 PM
2 points
0
in reply to: Kaj_Sotala’s comment on: What an actually pessimistic containment strategy looks like
[ETA: I’m not that sure of the below argument]

Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]

Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you what the truth actually seems to be (given what it knows). The reason is that, as AI develops, programs that are capable of the former thing have constant complexity, but programs that are capable of the latter thing have complexity that grows with the complexity of the AI’s models of the world, and so you should expect that the former is favored by SGD. See this part of the ELK document for a more detailed description of this failure mode.

Jack R Apr 10, 2022, 5:14 AM
9 points
0
on: Worse than an unaligned AGI
Isn’t the worst case one in which the AI optimizes exactly against human values?

Jack R Apr 10, 2022, 3:56 AM
20 points
0
in reply to: Thomas Kwa’s comment on: What an actually pessimistic containment strategy looks like
Maybe Carl meant to link this one

Jack R Apr 10, 2022, 3:52 AM
6 points
0
in reply to: Kaj_Sotala’s comment on: What an actually pessimistic containment strategy looks like
it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don’t understand and so whose behavior/properties you can’t verify to be acceptable. It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).

It’s possible, also, that this is about takeoff speeds, and that you think its plausible that e.g. we can disincentivize deception by punishing the negative consequences it entails (if FOOM, can’t since we’d be dead).

Jack R Apr 8, 2022, 5:27 PM
11 points
0
in reply to: Chris_Leong’s comment on: We Are Conjecture, A New Alignment Research Startup
One thing is that it seems like they are trying to build some of the world’s largest language models (“state of the art models”)

Jack R Apr 7, 2022, 3:48 AM
1 point
0
in reply to: Rob Bensinger’s comment on: Don’t die with dignity; instead play to your outs
Hah! Thanks

Jack R Apr 7, 2022, 2:52 AM
11 points
2
on: Don’t die with dignity; instead play to your outs
It seems to me that it would be better to view the question as “is this frame the best one for person X?” rather than “is this frame the best one?”

Though, I haven’t fully read either of your posts, so excuse any mistakes/confusion.

Jack R Apr 5, 2022, 7:15 AM
3 points
0
on: You get one story detail
Do you have an example of a set of 1-detail stories you now might tell (composed with “AND”)?

Jack R Apr 4, 2022, 6:52 AM
1 point
0
in reply to: TurnTrout’s comment on: Do a cost-benefit analysis of your technology usage
Ah — sorry if I missed that in the post, only skimmed

Jack R Mar 29, 2022, 8:39 AM
8 points
0
on: Do a cost-benefit analysis of your technology usage
Random tip: If you want to restrict apps etc on your iPhone but not know the Screen Time pin, I recommend the following simple system which allows you to not know the password but unlock restrictions easily when needed:
1. Ask a friend to write a 4 digit pin in a small note book (which is dedicated only for this pin)
2. Ask them to punch in the pin to your phone when setting the Screen Time password
3. Keep the notebook in your backpack and never look inside of it, ever
4. If you ever need your phone unlocked, you can walk up to someone, even a stranger, show them the notebook and ask them to punch in the pin to your phone
The system works because having a dedicated physical object that you commit to never look inside is surprisingly doable, for some reason.

Jack R Mar 29, 2022, 3:20 AM
1 point
0
in reply to: Stuart_Armstrong’s comment on: Why I’m co-founding Aligned AI
Thanks for this list!

Though the list still doesn’t strike me as very novel—it feels that most of these conditions are conditions we’ve been shooting for anyways.

E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.

If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won’t be able to have confident beliefs about it or understand it. I don’t have a concrete argument about this though.

Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions ³⁄₄, I’m pretty curious.

Jack R Mar 28, 2022, 10:14 AM
2 points
0
on: What are the top 1-10 posts / sequences / articles / etc. that you’ve found most useful for yourself for becoming “less wrong”?
I (with some help) compiled some of the best rationality essays here.

Jack R Mar 28, 2022, 10:13 AM
1 point
0
in reply to: Stuart_Armstrong’s comment on: Why I’m co-founding Aligned AI
Ping about my other comment—FYI, because I am currently concerned that you don’t have criteria for the innards in mind, I’m less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven’t tried to digest your work much yet).

Jack R Mar 28, 2022, 10:01 AM
1 point
0
in reply to: TLW’s comment on: Game that might improve research productivity
I’d be interested in you sharing any reasons why you think this might fall apart, e.g. any insights you’ve gained from deeper inspection.

Game that might improve research productivity

Jack RMar 26, 2022, 7:00 AM

21 points

3 comments2 min readLW link

Jack R Feb 24, 2022, 9:59 PM
3 points
0
in reply to: Stuart_Armstrong’s comment on: Why I’m co-founding Aligned AI
Oh I see—could you say more about what characteristics you want the innards to have?

Jack R Feb 24, 2022, 3:46 AM
1 point
0
on: Why I’m co-founding Aligned AI
How do you know when you have solved the value extrapolation problem?

One hypothesis I have for what you might say is something like “a training scheme solves the value extrapolation problem when the sequence of inputs that will be seen in deployment by the AI produced by that training scheme leads to outputs which lead to positive outcomes by human lights” though from what I can tell, that’s basically the same as having a training scheme that leads to an “impact aligned” AI*.

If it isn’t this, how is your answer different?

*[ETA: the definition of impact alignment that Evan gives in the linked post technically only refers to an AI “which doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic,” but in my comment above, I meant to refer to what I think is the more relevant property for an AI to have, which I’ll call (impact aligned)_Jack: an agent is (impact aligned)_Jack to the degree that, by human lights, it doesn’t take bad actions and does take good actions.” I think that this is more relevant because Evan’s definition doesn’t distinguish between a rock and an intuitively aligned AI.]

[Question] Have you noticed costs of being anticipatory?

Jack RFeb 14, 2022, 7:15 PM

9 points

2 comments1 min readLW link

Jack R Feb 13, 2022, 11:03 PM
LW: 2 AF: 1
0
AF
in reply to: evhub’s comment on: Clarifying inner alignment terminology
I’m not sure you have addressed Richard’s point—if you keep your current definition of outer alignment, then memorizing the answers to the finite set of data is always a way to score perfect loss, but intuitively doesn’t seem like it would be intent aligned. And if memorization were never intent aligned, then your definition of outer alignment would be impossible.

[Question] Ideas for avoiding optimizing the wrong things day-to-day?

Jack RJan 26, 2022, 10:46 AM

5 points

1 comment1 min readLW link

Jack R

Game that might im­prove re­search productivity

[Question] Have you no­ticed costs of be­ing an­ti­ci­pa­tory?

[Question] Ideas for avoid­ing op­ti­miz­ing the wrong things day-to-day?

Game that might improve research productivity

[Question] Have you noticed costs of being anticipatory?

[Question] Ideas for avoiding optimizing the wrong things day-to-day?