I work at the Alignment Research Center (ARC). I write a blog on stuff I’m interested in (such as math, philosophy, puzzles, statistics, and elections): https://ericneyman.wordpress.com/
Eric Neyman
We’ve done some experiments with small reversible circuits. Empirically, a small circuit generated in the way you suggest has very obvious structure that makes it satisfy P (i.e. it is immediately evident from looking at the circuit that P holds).
This leaves open the question of whether this is true as the circuits get large. Our reasons for believing this are mostly based on the same “no-coincidence” intuition highlighted by Gowers: a naive heuristic estimate suggests that if there is no special structure in the circuit, the probability that it would satisfy P is doubly exponentially small. So probably if C does satisfy P, it’s because of some special structure.
Is this a correct rephrasing of your question?
It seems like a full explanation of a neural network’s low loss on the training set needs to rely on lots of pieces of knowledge that it learns from the training set (e.g. “Barack” is usually followed by “Obama”). How do random “empirical regularities” about the training set like this one fit into the explanation of the neural net?
Our current best guess about what an explanation looks like is something like modeling the distribution of neural activations. Such an activation model would end up having baked-in empirical regularities, like the fact that “Barack” is usually followed by “Obama”. So in other words, just as the neural net learned this empirical regularity of the training set, our explanation will also learn the empirical regularity, and that will be part of the explanation of the neural net’s low loss.
(There’s a lot more to be said here, and our picture of this isn’t fully fleshed out: there are some follow-up questions you might ask to which I would answer “I don’t know”. I’m also not sure I understood your question correctly.)
Yeah, I did a CS PhD in Columbia’s theory group and have talked about this conjecture with a few TCS professors.
My guess is that P is true for an exponentially small fraction of circuits. You could plausibly prove this with combinatorics (given that e.g. the first layer randomly puts inputs into gates, which means you could try to reason about the class of circuits that are the same except that the inputs are randomly permuted before being run through the circuit). I haven’t gone through this math, though.
Thanks, this is a good question.
My suspicion is that we could replace “99%” with “all but exponentially small probability in ”. I also suspect that you could replace it with , with the stipulation that the length of (or the running time of V) will depend on . But I’m not exactly sure how I expect it to depend on -- for instance, it might be exponential in .
My basic intuition is that the closer you make 99% to 1, the smaller the number of circuits that V is allowed to say “look non-random” (i.e. are flagged for some advice ). And so V is forced to do more thorough checks (“is it actually non-random in the sort of way that could lead to P being true?”) before outputting 1.
99% is just a kind-of lazy way to sidestep all of these considerations and state a conjecture that’s “spicy” (many theoretical computer scientists think our conjecture is false) without claiming too much / getting bogged down in the details of how the “all but a small fraction of circuits” thing depends on or the length of or the runtime of V.
A computational no-coincidence principle
I think this isn’t the sort of post that ages well or poorly, because it isn’t topical, but I think this post turned out pretty well. It gradually builds from preliminaries that most readers have probably seen before, into some pretty counterintuitive facts that aren’t widely appreciated.
At the end of the post, I listed three questions and wrote that I hope to write about some of them soon. I never did, so I figured I’d use this review to briefly give my takes.
This comment from Fabien Roger tests some of my modeling choices for robustness, and finds that the surprising results of Part IV hold up when the noise is heavier-tailed than the signal. (I’m sure there’s more to be said here, but I probably don’t have time to do more analysis by the end of the review period.,)
My basic take is that this really is a point in favor of well-evidenced interventions, but that the best-looking speculative interventions are nevertheless better. This is because I think “speculative” here mostly refers to partial measurement rather than noisy measurement. For example, maybe you can only foresee the first-order effects of an intervention, but not the second-order effects. If the first-order effect is a (known) quantity and the second-order effect is an (unknown) quantity , then modeling the second-order effect as zero (and thus estimating the quality of the intervention as ) isn’t a noisy measurement; it’s a partial measurement. It’s still your best guess given the information you have.
I haven’t thought this through very much. I expect good counter-arguments and counter-counter-arguments to exist here.
-
No—or rather, only if the measurement is guaranteed to be exactly correct. To see this, observe that the variance of a noisy, unbiased measurement is greater than the variance of the quantity you’re trying to measure (with equality only when the noise is zero), whereas the variance of a noiseless, partial measurement is less than the variance of the quantity you’re trying to measure.
Real-world measurements are absolutely partial. They are, like, mind-bogglingly partial. This point deserves a separate post, but consider for instance the action of donating $5,000 to the Against Malaria Foundation. Maybe your measured effect from the RCT is that it’ll save one life: 50 QALYs or so. But this measurement neglects the meat-eating problem: the expected-child you’ll save will grow up to eat expected-meat from factory farms, likely causing a great amount of suffering. But then you remember: actually there’s a chance that this child will have a one eight-billionth stake in determining the future of the lightcone. Oops, actually this consideration totally dominates the previous two. Does this child have better values than the average human? Again: mind-bogglingly partial!
(The measurements are also, of course, noisy! RCTs are probably about as un-noisy as it gets: for example, making your best guess about the quality of an intervention by drawing inferences from uncontrolled macroeconomic data is much more noisy. So the answer is: generally both noisy and partial, but in some sense, much more partial than noisy—though I’m not sure how much that comparison matters.)The lessons of this post do not generalize to partial measurements at all! This post is entirely about noisy measurements. If you’ve partially measured the quality of an intervention, estimating the un-measured part using your prior will give you an estimate of intervention quality that you know is probably wrong, but the expected value of your error is zero.
Thanks for writing this. I think this topic is generally a blind spot for LessWrong users, and it’s kind of embarrassing how little thought this community (myself included) has given to the question of whether a typical future with human control over AI is good.
(This actually slightly broadens the question, compared to yours. Because you talk about “a human” taking over the world with AGI, and make guesses about the personality of such a human after conditioning on them deciding to do that. But I’m not even confident that AGI-enabled control of the world by e.g. the US government would be good.)
Concretely, I think that a common perspective people take is: “What would it take for the future to go really really well, by my lights”, and the answer to that question probably involves human control of AGI. But that’s not really the action-relevant question. The action-relevant question, for deciding whether you want to try to solve alignment, is how the average world with human-controlled AGI compares to the average AGI-controlled world. And… I don’t know, in part for the reasons you suggest.
Cool, you’ve convinced me, thanks.
Edit: well, sort of. I think it depends on what information you’re allowing yourself to know when building your statistical model. If you’re not letting yourself make guesses about how the LW population was selected, then I still think the SAT thing and the height thing are reasonable. However, if you’re actually trying to figure out an estimate of the right answer, you probably shouldn’t blind yourself quite that much.
These both seem valid to me! Now, if you have multiple predictors (like SAT and height), then things get messy because you have to consider their covariance and stuff.
Yup, I think that only about 10-15% of LWers would get this question right.
Yeah, I wonder if Zvi used the wrong model (the non-thinking one)? It’s specifically the “thinking” model that gets the question right.
Just a few quick comments about my “integer whose square is between 15 and 30” question (search for my name in Zvi’s post to find his discussion):
The phrasing of the question I now prefer is “What is the least integer whose square is between 15 and 30”, because that makes it unambiguous that the answer is −5 rather than 4. (This is a normal use of the word “least”, e.g. in competition math, that the model is familiar with.) This avoids ambiguity about which of −5 and 4 is “smaller”, since −5 is less but 4 is smaller in magnitude.
This Gemini model answers −5 to both phrasings. As far as I know, no previous model ever said −5 regardless of phrasing, although someone said o1 Pro gets −5. (I don’t have a subscription to o1 Pro, so I can’t independently check.)
I’m fairly confident that a majority of elite math competitors (top 500 in the US, say) would get this question right in a math competition (although maybe not in a casual setting where they aren’t on their toes).
But also this is a silly, low-quality question that wouldn’t appear in a math competition.
Does a model getting this question right say anything interesting about it? I think a little. There’s a certain skill of being careful to not make assumptions (e.g. that the integer is positive). Math competitors get better at this skill over time. It’s not that straightforward to learn.
I’m a little confused about why Zvi says that the model gets it right in the screenshot, given that the model’s final answer is 4. But it seems like the model snatched defeat from the jaws of victory? Like if you cut off the very last sentence, I would call it correct.
Here’s the output I get:
Thank you for making this! My favorite ones are 4, 5, and 12. (Mentioning this in case anyone wants to listen to a few songs but not the full Solstice.)
Yes, very popular in these circles! At the Bay Area Secular Solstice, the Bayesian Choir (the rationalist community’s choir) performed Level Up in 2023 and Landsailor this year.
My Spotify Wrapped
Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.
My random guess is:
The dark blue bar corresponds to the testing conditions under which the previous SOTA was 2%.
The light blue bar doesn’t cheat (e.g. doesn’t let the model run many times and then see if it gets it right on any one of those times) but spends more compute than one would realistically spend (e.g. more than how much you could pay a mathematician to solve the problem), perhaps by running the model 100 to 1000 times and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning.
What’s your guess about the percentage of NeurIPS attendees from anglophone countries who could tell you what AGI stands for?
Do you have a link/citation for this quote? I couldn’t immediately find it.