You refer a couple of times to the fact that evals are often used with the aim of upper bounding capabilities. To my mind this is an essential difficulty that acts as a point of disanalogy with things like aviation. I’m obviously no expert but in the case of aviation, I would have thought that you want to give positive answers to questions like “can this plane safely do X thousand miles?”—ie produce absolutely guaranteed lower bounds on ‘capabilities’. You don’t need to find something like the approximately smallest number Y such that it could never under any circumstances ever fly more than Y million miles.
carboniferous_umbraculum
Hmm it might be questionable to suggest that it is “non-AI” though? It’s based on symbolic and algebraic deduction engines and afaict it sounds like it might be the sort of thing that used to be very much mainstream “AI” i.e. symbolic AI + some hard-coded human heuristics?
FWIW I did not interpret Thane as necessarily having “high confidence” in “architecture / internal composition” of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to “model the world” counts as a statement about “internal composition” is sort of beside the point/beyond the scope of what’s really being said)
It’s fair enough if you would say things differently(!) but in some sense isn’t it just pointing out: ‘I would emphasize different aspects of the same underlying basic point’. And I’m not sure if that really progresses the discussion? I.e. it’s not like Thane Ruthenis actually claims that “scarily powerful artificial agents” currently exist. It is indeed true that they don’t exist and may not ever exist. But that’s just not really the point they are making so it seems reasonable to me that they are not emphasizing it.
----
I’d like to see justification of “under what conditions does speculation about ‘superintelligent consequentialism’ merit research attention at all?” and “why do we think ‘future architectures’ will have property X, or whatever?!”.
I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I’d been originally inspired by Superintelligence—significant parts of which relate to these questions imo—only to be told that the community had ‘kinda moved away from that book now’. And so I sort of sympathize with the vibe of Thane’s post (and worry that there has been a sort of mission creep)
Newtonian mechanics was systematized as a special case of general relativity.
One of the things I found confusing early on in this post was that systemization is said to be about representing the previous thing as an example or special case of some other thing that is both simpler and more broadly-scoped.
In my opinion, it’s easy to give examples where the ‘other thing’ is more broadly-scoped and this is because ‘increasing scope’ corresponds to the usual way we think of generalisation, i.e. the latter thing applies to more setting or it is ‘about a wider class of things’ in some sense. But in many cases, the more general thing is not simultaneously ‘simpler’ or more economical. I don’t think anyone would really say that general relativity were actually simpler. However, to be clear, I do think that there probably are some good examples of this, particularly in mathematics, though I haven’t got one to hand.
OK I think this will be my last message in this exchange but I’m still confused. I’ll try one more time to explain what I’m getting at.
I’m interested in what your precise definition of subjective probability is.
One relevant thing I saw was the following sentence:If I say that a coin is 50% likely to come up heads, that’s me saying that I don’t know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it’s going to land, and I can’t distinguish between the two options.
It seems to give something like a definition of what it means to say something has a 50% chance. i.e. I interpret your sentence as claiming that a statement like ‘The probability of A is 1⁄2’ means or is somehow the same as a statement a bit like
[*] ‘I don’t know the exact conditions and don’t have enough meaningful/relevant knowledge to distinguish between the possible occurrence of (A) and (not A)‘
My reaction was: This can’t possibly be a good definition.
The airplane puzzle was supposed to be a situation where there is a clear ‘difference’ in the outcomes—either the last person is in the 1 seat that matches their ticket number or they’re not. - they’re in one of the other 99 seats. It’s not as if it’s a clearly symmetric situation from the point of view of the outcomes. So it was supposed to be an example where statement [*] does not hold, but where the probability is 1⁄2. It seems you don’t accept that; it seems to me like you think that statement [*] does in fact hold in this case.
But tbh it feels sorta like you’re saying you can’t distinguish between the outcomes because you already know the answer is 1/2! i.e. Even if I accept that the outcomes are somehow indistinguishable, the example is sufficiently complicated on a first reading that there’s no way you’d just look at it and go “hmm I guess I can’t distinguish so it’s 1/2”, i.e. if your definition were OK it could be used to justify the answer to the puzzle, but that doesn’t seem right to me either.
So my point is still: What is that thing? I think yes I actually am trying to push proponents of this view down to the metaphysics—If they say “there’s a 40% chance that it will rain tomorrow”, I want to know things like what it is that they are attributing 40%-ness to. And what it means to say that that thing “has probability 40%”. That’s why I fixated on that sentence in particular because it’s the closest thing I could find to an actual definition of subjective probability in this post.
I have in mind very simple examples. Suppose that first I roll a die. If it doesn’t land on a 6, I then flip a biased coin that lands on heads 3⁄5 of the time. If it does land on a 6 I just record the result as ‘tails’. What is the probability that I get heads?
This is contrived so that the probability of heads is
5⁄6 x 3⁄5 = 1⁄2.
But do you think that that in saying this I mean something like “I don’t know the exact initial conditions… well enough to have any meaningful knowledge of how it’s going to land, and I can’t distinguish between the two options.” ?
Another example: Have you heard of the puzzle about the people randomly taking seats on the airplane? It’s a well-known probability brainteaser to which the answer is 1⁄2 but I don’t think many people would agree that saying the answer is 1⁄2 actually means something like “I don’t know the exact initial conditions… well enough to have any meaningful knowledge of how it’s going to land, and I can’t distinguish between the two options.”
There needn’t be any ‘indistinguishability of outcomes’ or ‘lack of information’ for something to have probability 0.5, it can just..well… be the actual result of calculating two distinguishable complementary outcomes.
We might be using “meaning” differently then!
I’m fine with something being subjective, but what I’m getting at is more like: Is there something we can agree on about which we are expressing a subjective view?
I’m kind of confused what you’re asking me—like which bit is “accurate” etc.. Sorry, I’ll try to re-state my question again:
- Do you think that when someone says something has “a 50% probability” then they are saying that they do not have any meaningful knowledge that allows them to distinguish between two options?
I’m suggesting that you can’t possibly think that, because there are obviously other ways things can end up 50⁄50. e.g. maybe it’s just a very specific calculation, using lots of specific information, that ends up with the value 0.5 at the end. This is a different situation from having ‘symmetry’ and no distinguishing information.
Then I’m saying OK, assuming you indeed don’t mean the above thing, then what exactly does one mean in general when saying something is 50% likely?
Presumably you are not claiming that saying
...I don’t know the exact initial conditions of the coin well enough to have any meaningful knowledge of how it’s going to land, and I can’t distinguish between the two options...
is actually necessarily what it means whenever someone says something has a 50% probability? Because there are obviously myriad ways something can have a 50% probability and this kind of ‘exact symmetry between two outcomes’ + no other information is only one very special way that it can happen.
So what does it mean exactly when you say something is 50% likely?
The traditional interpretation of probability is known as frequentist probability. Under this interpretation, items have some intrinsic “quality” of being some % likely to do one thing vs. another. For example, a coin has a fundamental probabilistic essence of being 50% likely to come up heads when flipped.
Is this right? I would have said that what you describe is a more like the classical, logical view of probability, which isn’t the same as the frequentist view. Even the wiki page you’ve linked seems to disagree with what you’ve written, i.e. it describes the frequentist view in the standard way of being about relative frequencies in the long-run. So it isn’t a coin having intrinsic “50%-ness”; you actually need the construction of the repeated experiment in order to define the probability.
My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.
Short comment/feedback just to say: This sentence is making one of your main points but is very tricky! - perhaps too long/too many subclauses?
Ah OK, I think I’ve worked out where some of my confusion is coming from: I don’t really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): “Current mathematical research could play a similar role in the coming years...” But why might it? Isn’t that where you need to be arguing?
The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that’s why I don’t understand how these are examples.
Thanks for the comment Rohin, that’s interesting (though I haven’t looked at the paper you linked).
I’ll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers and and don’t do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I’m doing ‘addition modulo 113’, but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.
I’m still not sure I buy the examples. In the early parts of the post you seem to contrast ‘machine learning research agendas’ with ‘foundational and mathematical’/‘agent foundations’ type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing “mathematical and foundational” work.
I can’t say much about the Goodhart’s Law comment but it seems at best unclear that its link to goal misgeneralization is an example of the kind you are looking for (i.e. in the absence of much more concrete examples, I have approximately no reason to believe it has anything to do with what one would call mathematical work).
Strongly upvoted.
I roughly think that a few examples showing that this statement is true will 100% make OP’s case. And that without such examples, it’s very easy to remain skeptical.
Currently, it takes a very long time to get an understanding of who is doing what in the field of AI Alignment and how good each plan is, what the problems are, etc.
Is this not ~normal for a field that it maturing? And by normal I also mean approximately unavoidable or ‘essential’. Like I could say ‘it sure takes a long time to get an understanding of who is doing what in the field of… computer science’, but I have no reason to believe that I can substantially ‘fix’ this situation in the space of a few months. It just really is because there is lots of complicated research going on by lots of different people, right? And ‘understanding’ what another researcher is doing is sometimes a really, really hard thing to do.
I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven’t been very motivated to engage much with ARC’s recent work). But I decided maybe it’s best to comment in a way that gives a better signal than silence.
I’ve generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.
Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.
I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.I can try to explain a little more: It seemed odd that the “potential” applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered—before writing the paper—questions like “OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?” Some of what was said in the paper was fairly vague stuff like:
If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible.
In my opinion, it’s also important to bear in mind that the criteria of a problem being ‘open’ is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would ‘have to be’ deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC’s stating of these open problems about heuristic estimators is an interesting contribution in itself?
To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I’m saying:
Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.
But practically it means that when I ask myself something like: ‘Why would I drop whatever else I’m working on and work on this stuff?’ I find it quite hard to answer in a way that’s not basically just all deference to some ‘vision’ that is currently undeclared (or as the paper says “mostly defer[red]” to “future articles”).
Having said all this I’ll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.
How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to ‘mentor matching’?
(e.g. see Richard Ngo’s first high-level point in this career advice post)
I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we’re on the same page.
Roughly speaking, what I have in mind is that there are at least two possible claims. One is that ‘we can’t get AI to do our alignment homework’ because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there’s some sort of ‘intrinsic’ reason why an AI built by humans could never solve alignment homework.