Mathematician turned alignment researcher. Probably happy to chat about math, current ML, or long-term AI thoughts.
The basics—Nathaniel Monson (nmonson1.github.io)
Nathaniel Monson
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
I agree in the narrow sense of different from bio-evolution, but I think it captures something tonally correct anyway.
I like “evolveware” myself.
I’m not really sure how it ended up there—probably childhood teaching inducing that particular brain-structure? It’s just something that was a fundamental part of who I understood myself to be, and how I interpreted my memories/experiences/sense-data. After I stopped believing in God, I definitely also stopped believing that I existed. Obviously, this-body-with-a-mind exists, but I had not identified myself as being that object previously—I had identified myself as the-spirit-inhabiting-this-body, and I no longer believed that existed.
This is why I added “for the first few”. Let’s not worry about the location, just say “there is a round cube” and “there is a teapot”.
Before you can get to either of these axioms , you need some things like “there is a thing I’m going to call reality that it’s worth trying to deal with” and “language has enough correspondence to reality to be useful”. With those and some similar very low level base axioms in place (and depending on your definitions of round and cube and teapot), I agree that one or another of the axioms could reasonably be called more or less reasonable, rational, probable, etc.
I think when I believed in God, it was roughly third on the list? Certainly before usefulness of language. The first two were something like me existing in time, with a history and memories that had some accuracy, and sense-data being useful.
I don’t think I believe in God anymore—certainly not in the way I used to—but I think if you’d asked me 3 years ago, I would have said that I take it as axiomatic that God exists. If you have any kind of consistent epistemology, you need some base beliefs from which to draw the conclusions and one of mine was the existence of an entity that cared about me (and everyone on earth) on a personal level and was sufficiently more wise/intelligent/powerful/knowledgeable than me that I may as well think of it as infinitely so.
I think the religious people I know who’ve thought deeply about their epistemology take either the existence of God or the reliability of a sort of spiritual sensory modality as an axiom.
While I no longer believe in God, I don’t think I had a perspective any less epistemically rational then than I do now. I don’t think there’s a way to use rationality to pick axioms, the process is inherently arational (for the first few, anyway).
That’s fair. I guess I’m used to linkposts which are either full, or a short enough excerpt that I can immediately see they aren’t full.
I really appreciated both the original linked post and this one. Thank you, you’ve been writing some great stuff recently.
One strategy I have, as someone who simultaneously would like to be truth-committed and also occasionally jokes or teases loved ones (“the cake you made is terrible! No one else should have any, I’ll sacrifice my taste buds to save everyone!”) is to have triggers for entering quaker-mode; if someone asks me a question involving “really” or “actually”, I try to switch my demeanour to clearly sincere, and give a literally honest answer. I… hope? that having an explicit mode of truth this way blunts some of the negatives of frequently functioning as an actor.
Would you say it’s … _cat_egorically impossible?
I actually fundamentally agree with most/all of it, I just wanted a cookie :)
I strongly disagreed with all of this!
.
.
.
(cookie please!)
Glad to, thanks for taking it well.
I think this would have been mitigated by something at the beginning saying “this is an excerpt of x words of a y word post located at url”, so I can decide at the outset to read here, read there, or skip.
Is the reason you didn’t put the entire thing here basically blog traffic numbers?
(I didn’t downvote, but here’s a guess) I enjoyed what there was of it, but I got really irritated by “This is not the full post—for the rest of it, including an in-depth discussion of the evidence for and against each of these theories, you can find the full version of this post on my blog”. I don’t know why this bothers me—maybe because I pay some attention to the “time to read” tag at the top, or because having to click through to a different page feels like an annoyance with no benefit to me.
If you click the link where OP introduces the term, it’s the Wikipedia page for psychopathy. Wiki lists 3 primary traits for it, one of which is DAE
The statement seems like it’s assuming:
-
we know roughly how to build AGI
-
we decide when to do that
-
we use the time between now and then to increase chance of successful alignment
-
if we succeed in alignment early enough, you and your loved ones won’t die
I don’t think any of these are necessarily true, and I think the ways they are false is asymmetric in a manner that favors caution
-
I appreciated your post, (indeed, I found it very moving) and found some of the other comments frustrating as I believe you did. I think, though, that I can see a part of where they are coming from. I’ll preface by saying I don’t have strong beliefs on this myself, but I’ll try to translate (my guess at) their world model.
I think the typical EA/LWer thinks that most charities are ineffective to the point of uselessness, and this is due to them not being smart/rational about a lot of things (and are very familiar with examples like the millennium village). They probably believe it costs roughly 5000 USD to save a life, which makes your line “Many of us are used to the ads that boast of every 2-3 dollars saving a life...” read like you haven’t engaged much with their world. They agree that institutions matter a huge amount and that many forms of aid fail because of bad institutions.
They probably also believe the exact shape of the dose-response curve to treating poverty with direct aid is unknown, but have a prior of it being positively sloped but flatter than we wish. There is a popular rationalist technique of “if x seems like it is helping the problem, just not as much as you wish, try way way more x.” (Eg, light for SAD)
I would guess your post reads to them like someone finding out that the dose-response curve is very flat and and that many charities are ineffective and then writing “maybe the dose-response curve isn’t even positively sloped!” It reads to them like the claim “no (feasible) amount of direct aid will help with poverty” followed by evidence that the slope is not as steep as we all wish. I don’t think any of your evidence suggests aid cannot have a positive effect, just that the amount necessary for that effect to be permanent is quite high.
Add this to your ending by donating money to give directly, and it seems like you either are behaving irrationally, or you agree that it has some marginal positive impact and were preaching to the choir.
As I said, I appreciated it, and the work that goes into making your world model and preparing it for posting, and engaging with commenters. Thank you.
This is more a tangent than a direct response—I think I fundamentally agree with almost everything you wrote—but I dont think virtue ethics requires tossing out the other two (although I agree both of the others require tossing out each other).
I view virtue ethics as saying, roughly, “the actually important thing almost always is not how you act in contrived edge case thought experiments, but rather how how habitually act in day to day circumstances. Thus you should worry less, probably much much less, about said thought experiments, and worry more about virtuous behavior in all the circumstances where deontology and utilitarianism have no major conflicts”. I take it as making a claim about correct use of time and thought-energy, rather than about perfectly correct morality. It thus can extend to ”...and we think (D/U) ethics are ultimately best served this way, and please use (D/U) ethics if one of those corner cases ever shows up” for either deontology or (several versions of) utilitarianism, basically smoothly.
I agree with the first paragraph, but strongly disagree with the idea this is “basically just trying to align to human values directly”.
Human values are a moving target in a very high dimensional space, which needs many bits to specify. At a given time, this needs one bit. A coinflip has a good shot. Also, to use your language, I think “human is trying to press the button” is likely to form a much cleaner natural abstraction than human values generally.
Finally, we talk about getting it wrong being really bad. But there’s a strong asymmetry—one direction is potentially catastrophic, the other is likely to only be a minor nuisance. So if we can bias it in favor of believing the humans probably want to press the button, it becomes even more safe.
If I had clear lines in my mind between AGI capabilities progress, AGI alignment progress, and narrow AI progress, I would be 100% with you on stopping AGI capabilities. As it is, though, I don’t know how to count things. Is “understanding why neural net training behaves as it does” good or bad? (SLT’s goal). Is “determining the necessary structures of intelligence for a given architecture” good or bad? (Some strands of mech interp). Is an LLM narrow or general?
How do you tell, or at least approximate? (These are genuine questions, not rhetorical)
Minor nitpicks: -I read “1 angstrom of uncertainty in 1 atom” as the location is normally distributed with mean <center> and SD 1 angstrom, or as uniformly distributed in solid sphere of radius 1 angstrom. Taken literally, though, “perturb one of the particles by 1 angstrom in a random direction” is distributed on the surface of the sphere (particle is known to be exactly 1 angstrom from <center>). -The answer will absolutely depend on the temperature. (in a neighborhood of absolute zero, the final positions of the gas particles are very close to the initial positions.) -The answer also might depend on the exact starting configuration. While I think most configurations would end up ~50/50 chance after 20 seconds, there are definitely configurations that would be stably strongly on one side.
Nothing conclusive below, but things that might help: -Back-of-envelope calculation said the single uncertain particle has ~(10 million * sqrt(temp in K)) collisions /sec. -If I’m using MSD right (big if!) then at STP, particles move from initial position only by about 5 cm in 20 seconds (cover massive distance, but the brownian motion cancels in expectation.) -I think that at standard temp, this would be at roughly 1⁄50 standard pressure?