Stuart_Armstrong

Karma: 17,999

Stuart_Armstrong Jul 18, 2022, 8:42 PM
LW: 8 AF: 4
7
AF
on: Humans provide an untapped wealth of evidence about alignment
It is not that human values are particularly stable. It’s that humans themselves are pretty limited. Within that context, we identify the stable parts of ourselves as “our human values”.

If we lift that stability—if we allow humans arbitrary self-modification and intelligence increase—the parts of us that are stable will change, and will likely not include much of our current values. New entities, new attractors.

Stuart_Armstrong Jul 12, 2022, 10:13 AM
LW: 30 AF: 12
7
AF
on: On how various plans miss the hard bits of the alignment challenge
Hey, thanks for posting this!

And I apologise—I seem to have again failed to communicate what we’re doing here :-(

“Get the AI to ask for labels on ambiguous data”

Having the AI ask is a minor aspect of our current methods, that I’ve repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we’re trying to do is:
1. Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
2. Select among these candidates to get a human-survivable ultimate reward functions.
Possible selection processes include being conservative (see here for how that might work: https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy ), asking humans and then extrapolating the process of what human-answering should idealise to (some initial thoughts on this here: https://www.lesswrong.com/posts/BeeirdrMXCPYZwgfj/the-blue-minimising-robot-and-model-splintering), removing some of the candidates on syntactic ground (e.g. wireheading, which I’ve written quite a bit on how it might be syntactically defined). There are some other approaches we’ve been considering, but they’re currently under-developed.

But all those methods will fail if the AI can’t generate human-survivable extrapolations of its reward training data. That is what we are currently most focused on. And, given our current results on toy models and a recent literature review, my impression is that there has been almost no decent applicable research done in this area to date. Our current results on HappyFaces are a bit simplistic, but, depressingly, they seem to be the best in the world in reward-function-extrapolation (and not just for image classification) :-(

Stuart_Armstrong Jul 7, 2022, 1:25 PM
2 points
0
in reply to: Charbel-Raphaël’s comment on: Benchmark: goal misgeneralization/concept extrapolation
We ask them to not cheat in that way? That would be using their own implicit knowledge of what the features are.

Stuart_Armstrong Jul 7, 2022, 1:24 PM
3 points
0
in reply to: Charbel-Raphaël’s comment on: Benchmark: goal misgeneralization/concept extrapolation
I’d say do two challenges: one at a mix rate of 0.5, one at a mix rate of 0.1.

Stuart_Armstrong Jul 4, 2022, 2:10 PM
2 points
in reply to: Edward’s comment on: Assessing Kurzweil predictions about 2019: the results
Thanks!

Stuart_Armstrong Jun 15, 2022, 8:01 PM
2 points
in reply to: gwern’s comment on: Georgism, in theory
I was putting all those under “It would help the economy, by redirecting taxes from inefficient sources. It would help governments raise revenues and hence provide services without distorting the economy.”.

And we have to be careful about a citizen’s dividend; with everyone richer, they can afford higher rents, so rents will rise. Not by the same amount, but it’s not as simple as “everyone is X richer”.

Stuart_Armstrong Jun 15, 2022, 7:27 PM
2 points
in reply to: Stephen Bennett’s comment on: Georgism, in theory
Glad to help. I had the same feeling when I was investigating this—where was the trick?

Stuart_Armstrong Jun 15, 2022, 7:20 PM
2 points
in reply to: Jonathan_Graehl’s comment on: Georgism, in theory
Deadweight loss of taxation with perfectly inelastic supply (ie no deadweight loss at all) and all the taxation allocated to the inelastic supply: https://en.wikipedia.org/wiki/Deadweight_loss#How_deadweight_loss_changes_as_taxes_vary

I added a comment on that in the main body of the post.

Stuart_Armstrong Jun 15, 2022, 7:11 PM
3 points
in reply to: Dagon’s comment on: Georgism, in theory

land were cheaper, landowners wouldn’t use more for themselves (private use) rather than creating and renting more usable housing.

Why would they do that? They still have to pay the land tax at the same rate; if they don’t rent, they have to pay that out of their own pocket.

Land is cheaper to buy, but more expensive to own.

Stuart_Armstrong Jun 5, 2022, 3:09 AM
3 points
in reply to: Joel Burget’s comment on: How to get into AI safety research
Glad your liked it :-)

Stuart_Armstrong May 21, 2022, 5:24 PM
5 points
on: Popular education in Sweden: much more than you wanted to know
Fascinating, thanks.

Stuart_Armstrong May 17, 2022, 8:46 PM
2 points
in reply to: Sam F. Brown’s comment on: GPT-3 and concept extrapolation
I tried to use that approach to teach GPT-3 to solve the problem at the top of this post. As you can see, it kinda worked; GPT-3 grasps that some things need to be reversed, but it then goes a bit off the rails (adding a random “this is a great” to the end of my prompt, with the whole phrase reversed rather than each word; then it starts out reversing the individual words of the sentence, but ends up just completing the sentence instead, using the other common completion—“falls” rather than “stays”. Then when it tries to reverse each individual word, it fails completely, and just reorders/reworks the words a bit).

Reverse the word below:

Word: alphabet

Reasoning:

-Add spaces between letters: a l p h a b e t

-Add numbers: 1:a 2:l 3:p 4:h 5:a 6:b 7:e 8:t

-Reverse numbers and letters: 8:t 7:e 6:b 5:a 4:h 3:p 2:l 1:a

-Remove numbers: t e b a h p l a

-Merge the letters in groups of two: te ba hp la, teba hpla, tebahpla

-Final result: tebahpla

Reverse all words in the following phrase, complete the sentence, then reverse all words in the completed sentence:

ehT niar ni niapS syats ylniam ni eht taerg a si siht

The rain in Spain falls mainly on the plain.

The main plain falls rain in Spain on the.

Stuart_Armstrong May 17, 2022, 2:09 PM
2 points
in reply to: Sam F. Brown’s comment on: GPT-3 and concept extrapolation
Fascinating. Thanks!

Stuart_Armstrong May 5, 2022, 4:30 PM
2 points
in reply to: sj9999’s comment on: Generalised models as a category
Thanks so much! Typos have been corrected.

Stuart_Armstrong Apr 22, 2022, 11:10 AM
2 points
in reply to: gabrielrecc’s comment on: GPT-3 and concept extrapolation
Thanks; very interesting result.

Stuart_Armstrong Apr 20, 2022, 8:40 PM
LW: 5 AF: 5
AF
in reply to: Daniel Kokotajlo’s comment on: GPT-3 and concept extrapolation
The aim of this post is not to catch out GPT-3; it’s to see what concept extrapolation could look like for a language model.

Stuart_Armstrong Apr 20, 2022, 8:38 PM
3 points
in reply to: Jan’s comment on: GPT-3 and concept extrapolation
Possibly! Though it did seem to recognise that the words were spelt backwards. It must have some backwards spelt words in its training data, just not that many.

Stuart_Armstrong Apr 20, 2022, 12:32 PM
2 points
in reply to: Quintin Pope’s comment on: Concept extrapolation: key posts
Thanks for that link. It does seem to correspond intuitively to a lot of the human condition. Though it doesn’t really explain value extrapolation, more the starting point from which humans can extrapolate values. Still a fascinating read, thanks!

Stuart_Armstrong Apr 19, 2022, 8:47 AM
3 points
in reply to: Heighn’s comment on: Different perspectives on concept extrapolation
There is certainly a decent chance.

Stuart_Armstrong Apr 19, 2022, 8:46 AM
2 points
in reply to: Heighn’s comment on: Different perspectives on concept extrapolation
There is certainly a chance.