kaarelh AT gmail DOT com
Kaarel
a thing i think is probably happening and significant in such cases: developing good ‘concepts/ideas’ to handle a problem, ‘getting a feel for what’s going on in a (conceptual) situation’
a plausibly analogous thing in humanity(-seen-as-a-single-thinker): humanity states a conjecture in mathematics, spends centuries playing around with related things (tho paying some attention to that conjecture), building up mathematical machinery/understanding, until a proof of the conjecture almost just falls out of the machinery/understanding
I find it surprising/confusing/confused/jarring that you speak of models-in-the-sense-of-mathematical-logic=:L-models as the same thing as (or as a precise version of) models-as-conceptions-of-situations=:C-models. To explain why these look to me like two pretty much entirely distinct meanings of the word ‘model’, let me start by giving some first brushes of a picture of C-models. When one employs a C-model, one likens a situation/object/etc of interest to a situation/object/etc that is already understood (perhaps a mathematical/abstract one), that one expects to be better able to work/play with. For example, when one has data about sun angles at a location throughout the day and one is tasked with figuring out the distance from that location to the north pole, one translates the question to a question about 3d space with a stationary point sun and a rotating sphere and an unknown point on the sphere and so on. (I’m not claiming a thinker is aware of making such a translation when they make it.) Employing a C-model making an analogy. From inside a thinker, the objects/situations on each side of the analogy look like… well, things/situations; from outside a thinker, both sides are thinking-elements.[1] (I think there’s a large GOFAI subliterature trying to make this kind of picture precise but I’m not that familiar with it; here are two papers that I’ve only skimmed: https://www.qrg.northwestern.edu/papers/Files/smeff2(searchable).pdf , https://api.lib.kyushu-u.ac.jp/opac_download_md/3070/76.ps.tar.pdf .)
I’m not that happy with the above picture of C-models, but I think that it seeming like an even sorta reasonable candidate picture might be sufficient to see how C-models and L-models are very different, so I’ll continue in that hope. I’ll assume we’re already on the same page about what an L-model is ( https://en.wikipedia.org/wiki/Model_theory ). Here are some ways in which C-models and L-models differ that imo together make them very different things:
An L-model is an assignment of meaning to a language, a ‘mathematical universe’ together with a mapping from symbols in a language to stuff in that universe — it’s a semantic thing one attaches to a syntax. The two sides of a C-modeling-act are both things/situations which are roughly equally syntactic/semantic (more precisely: each side is more like a syntactic thing when we try to look at a thinker from the outside, and just not well-placed on this axis from the thinker’s internal point of view, but if anything, the already-understood side of the analogy might look more like a mechanical/syntactic game than the less-understood side, eg when you are aware that you are taking something as a C-model).
Both sides of a C-model are things/situations one can reason about/with/in. An L-model takes from a kind of reasoning (proving, stating) system to an external universe which that system could talk about.
An L-model is an assignment of a static world to a dynamic thing; the two sides of a C-model are roughly equally dynamic.
A C-model might ‘allow you to make certain moves without necessarily explicitly concerning itself much with any coherent mathematical object that these might be tracking’. Of course, if you are employing a C-model and you ask yourself whether you are thinking about some thing, you will probably answer that you are, but in general it won’t be anywhere close to ‘fully developed’ in your mind, and even if it were (whatever that means), that wouldn’t be all there is to the C-model. For an extreme example, we could maybe even imagine a case where a C-model is given with some ‘axioms and inference rules’ such that if one tried to construct a mathematical object ‘wrt which all these axioms and inference rules would be valid’, one would not be able to construct anything — one would find that one has been ‘talking about a logically impossible object’. Maybe physicists handling infinities gracefully when calculating integrals in QFT is a fun example of this? This is in contrast with an L-model which doesn’t involve anything like axioms or inference rules at all and which is ‘fully developed’ — all terms in the syntax have been given fixed referents and so on.
(this point and the ones after are in some tension with the main picture of C-models provided above but:) A C-model could be like a mental context/arena where certain moves are made available/salient, like a game. It seems difficult to see an L-model this way.
A C-model could also be like a program that can be run with inputs from a given situation. It seems difficult to think of an L-model this way.
A C-model can provide a way to talk about a situation, a conceptual lens through which to see a situation, without which one wouldn’t really be able to [talk about]/see the situation at all. It seems difficult to see an L-model as ever doing this. (Relatedly, I also find it surprising/confusing/confused/jarring that you speak of reasoning using C-models as a semantic kind of reasoning.)
(But maybe I’m grouping like a thousand different things together unnaturally under C-models and you have some single thing or a subset in mind that is in fact closer to L-models?)
All this said, I don’t want to claim that no helpful analogy could be made between C-models and L-models. Indeed, I think there is the following important analogy between C-models and L-models:
When we look for a C-model to apply to a situation of interest, perhaps we often look for a mathematical object/situation that satisfies certain key properties satisfied by the situation. Likewise, an L-model of a set of sentences is (roughly speaking) a mathematical object which satisfies those sentences.
(Acknowledgments. I’d like to thank Dmitry Vaintrob and Sam Eisenstat for related conversations.)
- ^
This is complicated a bit by a thinker also commonly looking at the C-model partly as if from the outside — in particular, when a thinker critiques the C-model to come up with a better one. For example, you might notice that the situation of interest has some property that the toy situation you are analogizing it to lacks, and then try to fix that. For example, to guess the density of twin primes, you might start from a naive analogy to a probabilistic situation where each ‘prime’ p has probability (p-1)/p of not dividing each ‘number’ independently at random, but then realize that your analogy is lacking because really p not dividing n makes it a bit less likely that p doesn’t divide n+2, and adjust your analogy. This involves a mental move that also looks at the analogy ‘from the outside’ a bit.
That said, the hypothetical you give is cool and I agree the two principles decouple there! (I intuitively want to save that case by saying the COM is only stationary in a covering space where the train has in fact moved a bunch by the time it stops, but idk how to make this make sense for a different arrangement of portals.) I guess another thing that seems a bit compelling for the two decoupling is that conservation of angular momentum is analogous to conservation of momentum but there’s no angular analogue to the center of mass (that’s rotating uniformly, anyway). I guess another thing that’s a bit compelling is that there’s no nice notion of a center of energy once we view spacetime as being curved ( https://physics.stackexchange.com/a/269273 ). I think I’ve become convinced that conservation of momentum is a significantly bigger principle :). But still, the two seem equivalent to me before one gets to general relativity. (I guess this actually depends a bit on what the proof of 12.72 is like — in particular, if that proof basically uses the conservation of momentum, then I’d be more happy to say that the two aren’t equivalent already for relativity/fields.)
here’s a picture from https://hansandcassady.org/David%20J.%20Griffiths-Introduction%20to%20Electrodynamics-Addison-Wesley%20(2012).pdf :
Given 12.72, uniform motion of the center of energy is equivalent to conservation of momentum, right? P is const ⇔ dR_e/dt is const.
(I’m guessing 12.72 is in fact correct here, but I guess we can doubt it — I haven’t thought much about how to prove it when fields and relativistic and quantum things are involved. From a cursory look at his comment, Lubos Motl seems to consider it invalid lol ( in https://physics.stackexchange.com/a/3200 ).)
The microscopic picture that Mark Mitchison gives in the comments to this answer seems pretty: https://physics.stackexchange.com/a/44533 — though idk if I trust it. The picture seems to be to think of glass as being sparse, with the photon mostly just moving with its vacuum velocity and momentum, but with a sorta-collision between the photon and an electron happening every once in a while. I guess each collision somehow takes a certain amount of time but leaves the photon unchanged otherwise, and presumably bumps that single electron a tiny bit to the right. (Idk why the collisions happen this way. I’m guessing maybe one needs to think of the photon as some electromagnetic field thing or maybe as a quantum thing to understand that part.)
And the loss mechanism I was imagining was more like something linear in the distance traveled, like causing electrons to oscillate but not completely elastically wrt the ‘photon’ inside the material.
Anyway, in your argument for the redshift as the photon enters the block, I worry about the following:
can we really think of 1 photon entering the block becoming 1 photon inside the block, as opposed to needing to think about some wave thing that might translate to photons in some other way or maybe not translate to ordinary photons at all inside the material (this is also my second worry from earlier)?
do we know that this photon-inside-the-material has energy ?
re redshift: Sorry, I should have been clearer, but I meant to talk about redshift (or another kind of energy loss) of the light that comes out of the block on the right compared to the light that went in from the left, which would cause issues with going from there being a uniformly-moving stationary center of mass to the conclusion about the location of the block. (I’m guessing you were right when you assumed in your argument that redshift is 0 for our purposes, but I don’t understand light in materials well enough atm to see this at a glance atm.)
Note however, that the principle being broken (uniform motion of centre of mass) is not at all one of the “big principles” of physics, especially not with the extra step of converting the photon energy to mass. I had not previously heard of the principle, and don’t think it is anywhere near the weight class of things like momentum conservation.
I found these sentences surprising. To me, the COM moving at constant velocity (in an inertial frame) is Newton’s first law, which is one of the big principles (and I also have a mental equality between that and conservation of momentum).
I guess we can also reach your conclusion in that thought experiment arguing from conservation of momentum directly (though I guess the argument I’ll give just contains a proof of one direction of the equivalence to the conservation of momentum as a step). Ignoring relativity for a second, we could go into the center of mass frame as the particle approaches the piece of glass from the left, then note that the momentum in this frame needs to zero forever (by conservation of momentum), then note $\int p \text{d}t=m\delta(x)$, where $\delta(x)$ is the distance moved by the center of mass, from which $\delta(x)=0$. I would guess that essentially the same argument also works when relativistic things like photons are involved (and when fields or quantum stuff is involved), as long as one replaces the center of mass by the center of energy ( https://physics.stackexchange.com/questions/742770/centre-of-energy-in-special-relativity ).
One thing that worries me about that thought experiment more than [whether Newton’s first law carries over to this context] is the assumption that (in ideal conditions) photons do not lose any energy to the material — that they don’t end up redshifted or something. (If photons got redshifted as they go through, then the photons would lose some energy and the block would end up with some momentum and heat, obviously causing issues with the broader argument.) Still, I guess it’s probably fine to say that frequency/energy of the light is indeed conserved ( https://physics.stackexchange.com/questions/810869/why-does-the-energy-and-thus-frequency-of-a-photon-entering-glass-stay-constan ), but I unfortunately don’t atm understand how to think about a light packet (or something) going through a (potentially moving) material well enough to decide for myself atm. (ChatGPT tells me of some standard argument involving the displacement field, but I haven’t decided if I’ll trust that argument in this context yet. I also tried to see whether such an effect would be higher-order in some parameter even if it existed but I didn’t see a good reason why that would be the case.)
A second thing that worries me about this argument even more is whether it even makes sense to talk about individual photons passing through materials — I think the argument doesn’t make sense if photon number is not conserved before vs after a light pulse enters a material (here I’m thinking of the light pulse having small horizontal extent compared to the material). But I really haven’t thought very carefully about this. (Also, I’d like to point out that if some kind of light packet number were conserved and we are operating with a notion of momentum such that all of it can be attributed to wave packets, then momentum conservation implies the momentum attributed to a given packet stays constant. But I guess some of it might be more naturally attributed to stuff in the block at some point. I’d need to think more about what kind of partition would be most natural.)
It additionally seems likely to me that we are presently missing major parts of a decent language for talking about minds/models, and developing such a language requires (and would constitute) significant philosophical progress. There are ways to ‘understand the algorithm a model is’ that are highly insufficient/inadequate for doing what we want to do in alignment — for instance, even if one gets from where interpretability is currently to being able to replace a neural net by a somewhat smaller boolean (or whatever) circuit and is thus able to translate various NNs to such circuits and proceed to stare at them, one probably won’t thereby be more than of the way to the kind of strong understanding that would let one modify a NN-based AGI to be aligned or build another aligned AI (in case alignment doesn’t happen by default) (much like how knowing the weights doesn’t deliver that kind of understanding). To even get to the point where we can usefully understand the ‘algorithms’ models implement, I feel like we might need to have answered sth like (1) what kind of syntax should we see thinking as having — for example, should we think of a model/mind as a library of small programs/concepts that are combined and updated and created according to certain rules (Minsky’s frames?), or as having a certain kind of probabilistic world model that supports planning in a certain way, or as reasoning in a certain internal logical language, or in terms of having certain propositional attitudes; (2) what kind of semantics should we see thinking as having — what kind of correspondence between internals of the model/mind and the external world should we see a model as maintaining(; also, wtf are values). I think that trying to find answers to these questions by ‘just looking’ at models in some ML-brained, non-philosophical way is unlikely to be competitive with trying to answer these questions with an attitude of taking philosophy (agent foundations) seriously, because one will only have any hope of seeing the cognitive/computational structure in a mind/model by staring at it if one stares at it already having some right ideas about what kind of structure to look for. For example, it’d be very tough to try to discover [first-order logic]/ZFC/[type theory] by staring at the weights/activations/whatever of the brain of a human mathematician doing mathematical reasoning, from a standpoint where one hasn’t already invented [first-order logic]/ZFC/[type theory] via some other route — if one starts from the low-level structure of a brain, then first-order logic will only appear as being implemented in the brain in some ‘highly encrypted’ way.
There’s really a spectrum of claims here that would all support the claim that agent foundations is good for understanding the ‘algorithm’ a model/mind is to various degrees. A stronger one than what I’ve been arguing for is that once one has these ideas, one needn’t stare at models at all, and that staring at models is unlikely to help one get the right ideas (e.g. because it’s better to stare at one’s own thinking instead, and to think about how one could/should think, sort of like how [first-order logic]/ZFC/[type theory] was invented), so one’s best strategy does not involve starting at models; a weaker one than what I’ve been arguing is that having more and better ideas about the structure of minds would be helpful when staring at models. I like TsviBT’s koan on this topic.
Confusion #2: Why couldn’t we make similar counting arguments for Turing machines?
I guess a central issue with separating NP from P with a counting argument is that (roughly speaking) there are equally many problems in NP and P. Each problem in NP has a polynomial-time verifier, so we can index the problems in NP by polytime algorithms, just like the problems in P.
in a bit more detail: We could try to use a counting argument to show that there is some problem with a (say) time verifier which does not have any (say) time solver. To do this, we’d like to say that there are more verifier problems than algorithms. While I don’t really know how we ought to count these (naively, there are of each), even if we had some decent notion of counting, there would almost certainly just be more algorithms than verifiers (since the verifiers are themselves algorithms).
To clarify, I think in this context I’ve only said that the claim “The minimax regret rule (sec 5.4.2 of Bradley (2012)) is equivalent to EV max w.r.t. the distribution in your representor that induces maximum regret” (and maybe the claim after it) was “false/nonsense” — in particular, because it doesn’t make sense to talk about a distribution that induces maximum regret (without reference to a particular action) — which I’m guessing you agree with.
I wanted to say that I endorse the following:
Neither of the two decision rules you mentioned is (in general) consistent with any EV max if we conceive of it as giving your preferences (not just picking out a best option), nor if we conceive of it as telling you what to do on each step of a sequential decision-making setup.
I think basically any setup is an example for either of these claims. Here’s a canonical counterexample for the version with preferences and the max_{actions} min_{probability distributions} EV (i.e., infrabayes) decision rule, i.e. with our preferences corresponding to the min_{probability distributions} EV ranking:
Let and be actions and let be flipping a fair coin and then doing or depending on the outcome. It is easy to construct a case where the max-min rule strictly prefers to and also strictly prefers to , and indeed where this preference is strong enough that the rule still strictly prefers to a small enough sweetening of and also still prefers to a small enough sweetening of (in fact, a generic setup will have such a triple). Call these sweetenings and (think of these as -but-you-also-get-one-cent or -but-you-also-get-one-extra-moment-of-happiness or whatever; the important thing is that all utility functions under consideration should consider this one cent or one extra moment of happiness or whatever a positive). However, every EV max rule (that cares about the one cent) will strictly disprefer to at least one of or , because if that weren’t the case, the EV max rule would need to weakly prefer over a coinflip between and , but this is just saying that the EV max rule weakly prefers to , which contradicts with it caring about sweetening. So these min preferences are incompatible with maximizing any EV.
There is a canonical way in which a counterexample in preference-land can be turned into a counterexample in sequential-decision-making-land: just make the “sequential” setup really just be a two-step game where you first randomly pick a pair of actions to give the agent a choice between, and then the agent makes some choice. The game forces the max min agent to “reveal its preferences” sufficiently for its policy to be revealed to be inconsistent with EV maxing. (This is easiest to see if the agent is forced to just make a binary choice. But it’s still true even if you avoid the strictly binary choice being forced upon the agent by saying that the agent still has access to (internal) randomization.)
Regarding the Thornley paper you link: I’ve said some stuff about it in my earlier comments; my best guess for what to do next would be to prove some theorem about behavior that doesn’t make explicit use of a completeness assumption, but also it seems likely that this would fail to relate sufficiently to our central disagreements to be worthwhile. I guess I’m generally feeling like I might bow out of this written conversation soon/now, sorry! But I’d be happy to talk more about this synchronously — if you’d like to schedule a meeting, feel free to message me on the LW messenger.
Oh ok yea that’s a nice setup and I think I know how to prove that claim — the convex optimization argument I mentioned should give that. I still endorse the branch of my previous comment that comes after considering roughly that option though:
That said, if we conceive of the decision rule as picking out a single action to perform, then because the decision rule at least takes Pareto improvements, I think a convex optimization argument says that the single action it picks is indeed the maximal EV one according to some distribution
(though not necessarily one in your set). However, if we conceive of the decision rule as giving preferences between actions or if we try to use it in some sequential setup, then I’m >95% sure there is no way to see it as EV max (except in some silly way, like forgetting you had preferences in the first place).
Sorry, I feel like the point I wanted to make with my original bullet point is somewhat vaguer/different than what you’re responding to. Let me try to clarify what I wanted to do with that argument with a caricatured version of the present argument-branch from my point of view:
your original question (caricatured): “The Sun prayer decision rule is as follows: you pray to the Sun; this makes a certain set of actions seem auspicious to you. Why not endorse the Sun prayer decision rule?”
my bullet point: “Bayesian expected utility maximization has this big red arrow pointing toward it, but the Sun prayer decision rule has no big red arrow pointing toward it.”
your response: “Maybe a few specific Sun prayer decision rules are also pointed to by that red arrow?”
my response: “The arrow does not point toward most Sun prayer decision rules. In fact, it only points toward the ones that are secretly bayesian expected utility maximization. Anyway, I feel like this does very little to address my original point that there is this big red arrow pointing toward bayesian expected utility maximization and no big red arrow pointing toward Sun prayer decision rules.”
(See the appendix to my previous comment for more on this.)
That said, I admit I haven’t said super clearly how the arrow ends up pointing to structuring your psychology in a particular way (as opposed to just pointing at a class of ways to behave). I think I won’t do a better job at this atm than what I said in the second paragraph of my previous comment.
The minimax regret rule (sec 5.4.2 of Bradley (2012)) is equivalent to EV max w.r.t. the distribution in your representor that induces maximum regret.
I’m (inside view) 99.9% sure this will be false/nonsense in a sequential setting. I’m (inside view) 99% sure this is false/nonsense even in the one-shot case. I guess the issue is that different actions get assigned their max regret by different distributions, so I’m not sure what you mean when you talk about the distribution that induces maximum regret. And indeed, it is easy to come up with a case where the action that gets chosen is not best according to any distribution in your set of distributions: let there be one action which is uniformly fine and also for each distribution in the set, let there be an action which is great according to that distribution and disastrous according to every other distribution; the uniformly fine action gets selected, but this isn’t EV max for any distribution in your representor. That said, if we conceive of the decision rule as picking out a single action to perform, then because the decision rule at least takes Pareto improvements, I think a convex optimization argument says that the single action it picks is indeed the maximal EV one according to some distribution (though not necessarily one in your set). However, if we conceive of the decision rule as giving preferences between actions or if we try to use it in some sequential setup, then I’m >95% sure there is no way to see it as EV max (except in some silly way, like forgetting you had preferences in the first place).
The maximin rule (sec 5.4.1) is equivalent to EV max w.r.t. the most pessimistic distribution.
I didn’t think about this as carefully, but >90% that the paragraph above also applies with minor changes.
You might say “Then why not just do precise EV max w.r.t. those distributions?” But the whole problem you face as a decision-maker is, how do you decide which distribution? Different distributions recommend different policies. If you endorse precise beliefs, it seems you’ll commit to one distribution that you think best represents your epistemic state. Whereas someone with imprecise beliefs will say: “My epistemic state is not represented by just one distribution. I’ll evaluate the imprecise decision rules based on which decision-theoretic desiderata they satisfy, then apply the most appealing decision rule (or some way of aggregating them) w.r.t. my imprecise beliefs.” If the decision procedure you follow is psychologically equivalent to my previous sentence, then I have no objection to your procedure — I just think it would be misleading to say you endorse precise beliefs in that case.
I think I agree in some very weak sense. For example, when I’m trying to diagnose a health issue, I do want to think about which priors and likelihoods to use — it’s not like these things are immediately given to me or something. In this sense, I’m at some point contemplating many possible distributions to use. But I guess we do have some meaningful disagreement left — I guess I take the most appealing decision rule to be more like pure aggregation than you do; I take imprecise probabilities with maximality to be a major step toward madness from doing something that stays closer to expected utility maximization.
But the CCT only says that if you satisfy [blah], your policy is consistent with precise EV maximization. This doesn’t imply your policy is inconsistent with Maximality, nor (as far as I know) does it tell you what distribution with respect to which you should maximize precise EV in order to satisfy [blah] (or even that such a distribution is unique). So I don’t see a positive case here for precise EV maximization [ETA: as a procedure to guide your decisions, that is]. (This is my also response to your remark below about “equivalent to “act consistently with being an expected utility maximizer”.”)
I agree that any precise EV maximization (which imo = any good policy) is consistent with some corresponding maximality rule — in particular, with the maximality rule with the very same single precise probability distribution and the same utility function (at least modulo some reasonable assumptions about what ‘permissibility’ means). Any good policy is also consistent with any maximality rule that includes its probability distribution as one distribution in the set (because this guarantees that the best-according-to-the-precise-EV-maximization action is always permitted), as well as with any maximality rule that makes anything permissible. But I don’t see how any of this connects much to whether there is a positive case for precise EV maximization? If you buy the CCT’s assumptions, then you literally do have an argument that anything other than precise EV maximization is bad, right, which does sound like a positive case for precise EV maximization (though not directly in the psychological sense)?
ETA: as a procedure to guide your decisions, that is
Ok, maybe you’re saying that the CCT doesn’t obviously provide an argument for it being good to restructure your thinking into literally maintaining some huge probability distribution on ‘outcomes’ and explicitly maintaining some function from outcomes to the reals and explicitly picking actions such that the utility conditional on these actions having been taken by you is high (or whatever)? I agree that trying to do this very literally is a bad idea, eg because you can’t fit all possible worlds (or even just one world) in your head, eg because you don’t know likelihoods given hypotheses as you’re not logically omniscient, eg because there are difficulties with finding yourself in the world, etc — when taken super literally, the whole shebang isn’t compatible with the kinds of good reasoning we actually can do and do do and want to do. I should say that I didn’t really track the distinction between the psychological and behavioral question carefully in my original response, and had I recognized you to be asking only about the psychological aspect, I’d perhaps have focused on that more carefully in my original answer. Still, I do think the CCT has something to say about the psychological aspect as well — it provides some pro tanto reason to reorganize aspects of one’s reasoning to go some way toward assigning coherent numbers to propositions and thinking of decisions as having some kinds of outcomes and having a schema for assigning a number to each outcome and picking actions that lead to high expectations of this number. This connection is messy, but let me try to say something about what it might look like (I’m not that happy with the paragraph I’m about to give and I feel like one could write a paper at this point instead). The CCT says that if you ‘were wise’ — something like ‘if you were to be ultimately content with what you did when you look back at your life’ — your actions would need to be a particular way (from the outside). Now, you’re pretty interested in being content with your actions (maybe just instrumentally, because maybe you think that has to do with doing more good or being better). In some sense, you know you can’t be fully content with them (because of the reasons above). But it makes sense to try to move toward being more content with your actions. One very reasonable way to achieve this is to incorporate some structure into your thinking that makes your behavior come closer to having these desired properties. This can just look like the usual: doing a bayesian calculation to diagnose a health problem, doing an EV calculation to decide which research project to work on, etc..
(There’s a chance you take there to be another sense in which we can ask about the reasonableness of expected utility maximization that’s distinct from the question that broadly has to do with characterizing behavior and also distinct from the question that has to do with which psychology one ought to choose for oneself — maybe something like what’s fundamentally principled or what one ought to do here in some other sense — and you’re interested in that thing. If so, I hope what I’ve said can be translated into claims about how the CCT would relate to that third thing.)
Anyway, If the above did not provide a decent response to what you said, then it might be worthwhile to also look at the appendix (which I ended up deprecating after understanding that you might only be interested in the psychological aspect of decision-making). In that appendix, I provide some more discussion of the CCT saying that [maximality rules which aren’t behaviorally equivalent to expected utility maximization are dominated]. I also provide some discussion recentering the broader point I wanted to make with that bullet point that CCT-type stuff is a big red arrow pointing toward expected utility maximization, whereas no remotely-as-big red arrow is known for [imprecise probabilities + maximality].
e.g. if one takes the cost of thinking into account in the calculation, or thinks of oneself as choosing a policy
Could you expand on this with an example? I don’t follow.
For example, preferential gaps are sometimes justified by appeals to cases like: “you’re moving to another country. you can take with you your Fabergé egg xor your wedding album. you feel like each is very cool, and in a different way, and you feel like you are struggling to compare the two. given this, it feels fine for you to flip a coin to decide which one (or to pick the one on the left, or to ‘just pick one’) instead of continuing to think about it. now you remember you have 10 dollars inside the egg. it still seems fine to flip a coin to decide which one to take (or to pick the one on the left, or to ‘just pick one’).”. And then one might say one needs preferential gaps to capture this. But someone sorta trying to maximize expected utility might think about this as: “i’ll pick a randomization policy for cases where i’m finding two things hard to compare. i think this has good EV if one takes deliberation costs into account, with randomization maybe being especially nice given that my utility is concave in the quantities of various things.”.
Maximality and imprecision don’t make any reference to “default actions,”
I mostly mentioned defaultness because it appears in some attempts to precisely specify alternatives to bayesian expected utility maximization. One concrete relation is that one reasonable attempt at specifying what it is that you’ll do when multiple actions are permissible is that you choose the one that’s most ‘default’ (more precisely, if you have a prior on actions, you could choose the one with the highest prior). But if a notion of defaultness isn’t relevant for getting from your (afaict) informal decision rule to a policy, then nvm this!
I also don’t understand what’s unnatural/unprincipled/confused about permissibility or preferential gaps. They seem quite principled to me: I have a strict preference for taking action A over B (/ B is impermissible) only if I’m justified in beliefs according to which I expect A to do better than B.
I’m not sure I understand. Am I right in understanding that permissibility is defined via a notion of strict preferences, and the rest is intended as an informal restatement of the decision rule? In that case, I still feel like I don’t know what having a strict preference or permissibility means — is there some way to translate these things to actions? If the rest is intended as an independent definition of having a strict preference, then I still don’t know how anything relates to action either. (I also have some other issues in that case: I anticipate disliking the distinction between justified and unjustified beliefs being made (in particular, I anticipate thinking that a good belief-haver should just be thinking and acting according to their beliefs); it’s unclear to me what you mean by being justified in some beliefs (eg is this a non-probabilistic notion); are individual beliefs giving you expectations here or are all your beliefs jointly giving you expectations or is some subset of beliefs together giving you expectations; should I think of this expectation that A does better than B as coming from another internal conditional expected utility calculation). I guess maybe I’d like to understand how an action gets chosen from the permissible ones. If we do not in fact feel that all the actions are equal here (if we’d pay something to switch from one to another, say), then it starts to seem unnatural to make a distinction between two kinds of preference in the first place. (This is in contrast to: I feel like I can relate ‘preferences’ kinda concretely to actions in the usual vNM case, at least if I’m allowed to talk about money to resolve the ambiguity between choosing one of two things I’m indifferent between vs having a strict preference.)
Anyway, I think there’s a chance I’d be fine with sometimes thinking that various options are sort of fine in a situation, and I’m maybe even fine with this notion of fineness eg having certain properties under sweetenings of options, but I quite strongly dislike trying to make this notion of fineness correspond to this thing with a universal quantifier over your probability distributions, because it seems to me that (1) it is unhelpful because it (at least if implemented naively) doesn’t solve any of the computational issues (boundedness issues) that are a large part of why I’d entertain such a notion of fineness in the first place, (2) it is completely unprincipled (there’s no reason for this in particular, and the split of uncertainties is unsatisfying), and (3) it plausibly gives disastrous behavior if taken seriously. But idk maybe I can’t really even get behind that notion of fineness, and I’m just confusing it with the somewhat distinct notion of fineness that I use when I buy two different meals to distribute among myself and a friend and tell them that I’m fine with them having either one, which I think is well-reduced to probably having a smaller preference than my friend. Anyway, obviously whether such a notion of fineness is desirable depends on how you want it to relate to other things (in particular, actions), and I’m presently sufficiently unsure about how you want it to relate to these other things to be unsure about whether a suitable such notion exists.
basically everything becomes permissible, which seems highly undesirable
This is a much longer conversation, but briefly: I think it’s ad hoc / putting the cart before the horse to shape our epistemology to fit our intuitions about what decision guidance we should have.
It seems to me like you were like: “why not regiment one’s thinking xyz-ly?” (in your original question), to which I was like “if one regiments one thinking xyz-ly, then it’s an utter disaster” (in that bullet point), and now you’re like “even if it’s an utter disaster, I don’t care”. And I guess my response is that you should care about it being an utter disaster, but I guess I’m confused enough about why you wouldn’t care that it doesn’t make a lot of sense for me to try to write a library of responses.
Appendix with some things about CCT and expected utility maximization and [imprecise probabilities] + maximality that got cut
Precise EV maximization is a special case of [imprecise probabilities] + maximality (namely, the special case where your imprecise probabilities are in fact precise, at least modulo some reasonable assumptions about what things mean), so unless your class of decision rules turns out to be precisely equivalent to the class of decision rules which do precise EV maximization, the CCT does in fact say it contains some bad rules. (And if it did turn out to be equivalent, then I’d be somewhat confused about why we’re talking about it your way, because it’d seem to me like it’d then just be a less nice way to describe the same thing.) And at least on the surface, the class of decision rules does not appear to be equivalent, so the CCT indeed does speak against some rules in this class (and in fact, all rules in this class which cannot be described as precise EV maximization).
If you filled in the details of your maximality-type rule enough to tell me what your policy is — in particular, hypothetically, maybe you’d want to specify sth like the following: what it means for some options to be ‘permissible’ or how an option gets chosen from the ‘permissible options’, potentially something about how current choices relate to past choices, and maybe just what kind of POMDP, causal graph, decision tree, or whatever game setup we’re assuming in the first place — such that your behavior then looks like bayesian expected utility maximization (with some particular probability distribution and some particular utility function), then I guess I’ll no longer be objecting to you using that rule (to be precise: I would no longer be objecting to it for being dominated per the CCT or some such theorem, but I might still object to the psychological implementation of your policy on other grounds).
That said, I think the most straightforward ways [to start from your statement of the maximality rule and to specify some sequential setup and to make the rule precise and to then derive a policy for the sequential setup from the rule] do give you a policy which you would yourself consider dominated though. I can imagine a way to make your rule precise that doesn’t give you a dominated policy that ends up just being ‘anything is permissible as long as you make sure you looked like a bayesian expected utility maximizer at the end of the day’ (I think the rule of Thornley and Petersen is this), but at that point I’m feeling like we’re stressing some purely psychological distinction whose relevance to matters of interest I’m failing to see.
But maybe more importantly, at this point, I’d feel like we’ve lost the plot somewhat. What I intended to say with my original bullet point was more like: we’ve constructed this giant red arrow (i.e., coherence theorems; ok, it’s maybe not that giant in some absolute sense, but imo it is as big as presently existing arrows get for things this precise in a domain this messy) pointing at one kind of structure (i.e., bayesian expected utility maximization) to have ‘your beliefs and actions ultimately correspond to’, and then you’re like “why not this other kind of structure (imprecise probabilities, maximality rules) though?” and then my response was “well, for one, there is the giant red arrow pointing at this other structure, and I don’t know of any arrow pointing at your structure”, and I don’t really know how to see your response as a response to this.
Here are some brief reasons why I dislike things like imprecise probabilities and maximality rules (somewhat strongly stated, medium-strongly held because I’ve thought a significant amount about this kind of thing, but unfortunately quite sloppily justified in this comment; also, sorry if some things below approach being insufficiently on-topic):
I like the canonical arguments for bayesian expected utility maximization ( https://www.alignmentforum.org/posts/sZuw6SGfmZHvcAAEP/complete-class-consequentialist-foundations ; also https://web.stanford.edu/~hammond/conseqFounds.pdf seems cool (though I haven’t read it properly)). I’ve never seen anything remotely close for any of this other stuff — in particular, no arguments that pin down any other kind of rule compellingly. (I associate with this the vibe here (in particular, the paragraph starting with “To the extent that the outer optimizer” and the paragraph after it), though I guess maybe that’s not a super helpful thing to say.)
The arguments I’ve come across for these other rules look like pointing at some intuitive desiderata and saying these other rules sorta meet these desiderata whereas canonical bayesian expected utility maximization doesn’t, but I usually don’t really buy the desiderata and/or find that bayesian expected utility maximization also sorta has those desired properties, e.g. if one takes the cost of thinking into account in the calculation, or thinks of oneself as choosing a policy.
When specifying alternative rules, people often talk about things like default actions, permissibility, and preferential gaps, and these concepts seem bad to me. More precisely, they seem unnatural/unprincipled/confused/[I have a hard time imagining what they could concretely cache out to that would make the rule seem non-silly/useful]. For some rules, I think that while they might be psychologically different than ‘thinking like an expected utility maximizer’, they give behavior from the same distribution — e.g., I’m pretty sure the rule suggested here (the paragraph starting with “More generally”) and here (and probably elsewhere) is equivalent to “act consistently with being an expected utility maximizer”, which seems quite unhelpful if we’re concerned with getting a differently-behaving agent. (In fact, it seems likely to me that a rule which gives behavior consistent with expected utility maximization basically had to be provided in this setup given https://web.stanford.edu/~hammond/conseqFounds.pdf or some other canonical such argument, maybe with some adaptations, but I haven’t thought this through super carefully.) (A bunch of other people (Charlie Steiner, Lucius Bushnaq, probably others) make this point in the comments on https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems; I’m aware there are counterarguments there by Elliott Thornley and others; I recall not finding them compelling on an earlier pass through these comments; anyway, I won’t do this discussion justice in this comment.)
I think that if you try to get any meaningful mileage out of the maximality rule (in the sense that you want to “get away with knowing meaningfully less about the probability distribution”), basically everything becomes permissible, which seems highly undesirable. This is analogous to: as soon as you try to get any meaningful mileage out of a maximin (infrabayesian) decision rule, every action looks really bad — your decision comes down to picking the least catastrophic option out of options that all look completely catastrophic to you — which seems undesirable. It is also analogous to trying to find an action that does something or that has a low probability of causing harm ‘regardless of what the world is like’ being imo completely impossible (leading to complete paralysis) as soon as one tries to get any mileage out of ‘regardless of what the world is like’ (I think this kind of thing is sometimes e.g. used in davidad’s and Bengio’s plans https://www.lesswrong.com/posts/pKSmEkSQJsCSTK6nH/an-open-agency-architecture-for-safe-transformative-ai?commentId=ZuWsoXApJqD4PwfXr , https://www.youtube.com/watch?v=31eO_KfkjRQ&t=1946s ). In summary, my inside view says this kind of knightian thing is a complete non-starter. But outside-view, I’d guess that at least some people that like infrabayesianism have some response to this which would make me view it at least slightly more favorably. (Well, I’ve only stated the claim and not really provided the argument I have in mind, but that would take a few paragraphs I guess, and I won’t provide it in this comment.)
To add: it seems basically confused to talk about the probability distribution on probabilities or probability distributions, as opposed to some joint distribution on two variables or a probability distribution on probability distributions or something. It seems similarly ‘philosophically problematic’ to talk about the set of probability distributions, to decide in a way that depends a lot on how uncertainty gets ‘partitioned’ into the set vs the distributions. (I wrote about this kind of thing a bit more here: https://forum.effectivealtruism.org/posts/Z7r83zrSXcis6ymKo/dissolving-ai-risk-parameter-uncertainty-in-ai-future#vJg6BPpsG93iyd7zo .)
I think it’s plausible there’s some (as-of-yet-undeveloped) good version of probabilistic thinking+decision-making for less-than-ideal agents that departs from canonical bayesian expected utility maximization; I like approaches to finding such a thing that take aspects of existing messy real-life (probabilistic) thinking seriously but also aim to define a precise formal setup in which some optimality result could be proved. I have some very preliminary thoughts on this and a feeling that it won’t look at all like the stuff I’ve discussed disliking above. Logical induction ( https://arxiv.org/abs/1609.03543 ) seems cool; a heuristic estimator ( https://arxiv.org/pdf/2211.06738 ) would be cool. That said, I also assign significant probability to nothing very nice being possible here (this vaguely relates to the claim: “while there’s a single ideal rationality, there are many meaningfully distinct bounded rationalities” (I’m forgetting whom I should attribute this to)).
I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)
“if your first vector has cosine similarity 0.6 with d, then to be orthogonal to the first vector but still high cosine similarity with d, it’s easier if you have a larger magnitude”
If by ‘cosine similarity’ you mean what’s usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say ‘dot product’/‘projection’ when you said ‘cosine similarity’, but I don’t think making this substitution everywhere makes things make sense overall either.)
“then your method finds things which have cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1”
For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.
“More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine(theta0, d), then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude.”
This doesn’t make sense. For alpha1 to be cos(theta1, d), you can’t freely choose the magnitude of noise1
How many nearly-orthogonal vectors can you fit in a spherical cap?
Proposition. Let be a unit vector and let also be unit vectors such that they all sorta point in the direction, i.e., for a constant (I take you to have taken ), and such that the are nearly orthogonal, i.e., for all , for another constant . Assume also that . Then .
Proof. We can decompose , with a unit vector orthogonal to ; then . Given , it’s a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around can only decrease each pairwise dot product; doing this gives a new collection of unit vectors , still with . This implies that . Note that since , the RHS is some negative constant. Consider . On the one hand, it has to be positive. On the other hand, expanding it, we get that it’s at most . From this, , whence .
(acknowledgements: I learned this from some combination of Dmitry Vaintrob and https://mathoverflow.net/questions/24864/almost-orthogonal-vectors/24887#24887 )
For example, for and , this gives .
(I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)
Remark. If , then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.
What about the philosophical point? (low-quality section)
Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven’t really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It’s possible it goes through with a significantly smaller or if the vectors weren’t really that orthogonal or something. (To give a better answer, the first thing I’d try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector — it’d just be the gradient of this loss.)
how many times did the explanation just “work out” for no apparent reason
From the examples later in your post, it seems like it might be clearer to say something more like “how many things need to hold about the circuit for the explanation to describe the circuit”? More precisely, I’m objecting to your “how many times” because it could plausibly mean “on how many inputs” which I don’t think is what you mean, and I’m objecting to your “for no apparent reason” because I don’t see what it would mean for an explanation to hold for a reason in this case.
a few thoughts on hyperparams for a better learning theory (for understanding what happens when a neural net is trained with gradient descent)
Having found myself repeating the same points/claims in various conversations about what NN learning is like (especially around singular learning theory), I figured it’s worth writing some of them down. My typical confidence in a claim below is like 95%[1]. I’m not claiming anything here is significantly novel. The claims/points:
local learning (eg gradient descent) strongly does not find global optima. insofar as running a local learning process from many seeds produces outputs with ‘similar’ (train or test) losses, that’s a law of large numbers phenomenon[2], not a consequence of always finding the optimal neural net weights.[3][4]
if your method can’t produce better weights: were you trying to produce better weights by running gradient descent from a bunch of different starting points? getting similar losses this way is a LLN phenomenon
maybe this is a crisp way to see a counterexample instead: train, then identify a ‘lottery ticket’ subnetwork after training like done in that literature. now get rid of all other edges in the network, and retrain that subnetwork either from the previous initialization or from a new initialization — i think this literature says that you get a much worse loss in the latter case. so training from a random initialization here gives a much worse loss than possible
dynamics (kinetics) matter(s). the probability of getting to a particular training endpoint is highly dependent not just on stuff that is evident from the neighborhood of that point, but on there being a way to make those structures incrementally, ie by a sequence of local moves each of which is individually useful.[5][6][7] i think that this is not an academic correction, but a major one — the structures found in practice are very massively those with sensible paths into them and not other (naively) similarly complex structures. some stuff to consider:
the human eye evolving via a bunch of individually sensible steps, https://en.wikipedia.org/wiki/Evolution_of_the_eye
(given a toy setup and in a certain limit,) the hardness of learning a boolean function being characterized by its leap complexity, ie the size of the ‘largest step’ between its fourier terms, https://arxiv.org/pdf/2302.11055
imagine a loss function on a plane which has a crater somewhere and another crater with a valley descending into it somewhere else. the local neighborhoods of the deepest points of the two craters can look the same, but the crater with a valley descending into it will have a massively larger drainage basin. to say more: the crater with a valley is a case where it is first loss-decreasing to build one simple thing, (ie in this case to fix the value of one parameter), and once you’ve done that loss-decreasing to build another simple thing (ie in this case to fix the value of another parameter); getting to the isolated crater is more like having to build two things at once. i think that with a reasonable way to make things precise, the drainage basin of a ‘k-parameter structure’ with no valley descending into it will be exponentially smaller than that of eg a ‘k-parameter structure’ with ‘a k/2-parameter valley’ descending into it, which will be exponentially smaller still than a ‘k-parameter structure’ with a sequence of valleys of slowly increasing dimension descending into it
it seems plausible to me that the right way to think about stuff will end up revealing that in practice there are basically only systems of steps where a single [very small thing]/parameter gets developed/fixed at a time
i’m further guessing that most structures basically have ‘one way’ to descend into them (tho if you consider sufficiently different structures to be the same, then this can be false, like in examples of convergent evolution) and that it’s nice to think of the probability of finding the structure as the product over steps of the probability of making the right choice on that step (of falling in the right part of a partition determining which next thing gets built)
one correction/addition to the above is that it’s probably good to see things in terms of there being many ‘independent’ structures/circuits being formed in parallel, creating some kind of ecology of different structures/circuits. maybe it makes sense to track the ‘effective loss’ created for a structure/circuit by the global loss (typically including weight norm) together with the other structures present at a time? (or can other structures do sufficiently orthogonal things that it’s fine to ignore this correction in some cases?) maybe it’s possible to have structures which were initially independent be combined into larger structures?[8]
everything is a loss phenomenon. if something is ever a something-else phenomenon, that’s logically downstream of a relation between that other thing and loss (but this isn’t to say you shouldn’t be trying to find these other nice things related to loss)
grokking happens basically only in the presence of weight regularization, and it has to do with there being slower structures to form which are eventually more efficient at making logits high (ie more logit bang for weight norm buck)
in the usual case that generalization starts to happen immediately, this has to do with generalizing structures being stronger attractors even at initialization. one consideration at play here is that
nothing interesting ever happens during a random walk on a loss min surface
it’s not clear that i’m conceiving of structures/circuits correctly/well in the above. i think it would help a library of like >10 well-understood toy models (as opposed to like the maybe 1.3 we have now), and to be very closely guided by them when developing an understanding of neural net learning
some related (more meta) thoughts
to do interesting/useful work in learning theory (as of 2024), imo it matters a lot that you think hard about phenomena of interest and try to build theory which lets you make sense of them, as opposed to holding fast to an existing formalism and trying to develop it further / articulate it better / see phenomena in terms of it
this is somewhat downstream of current formalisms imo being bad, it imo being appropriate to think of them more as capturing preliminary toy cases, not as revealing profound things about the phenomena of interest, and imo it being feasible to do better
but what makes sense to do can depend on the person, and it’s also fine to just want to do math lol
and it’s certainly very helpful to know a bunch of math, because that gives you a library in terms of which to build an understanding of phenomena
it’s imo especially great if you’re picking phenomena to be interested in with the future going well around ai in mind
(* but it looks to me like learning theory is unfortunately hard to make relevant to ai alignment[9])
acknowledgments
these thoughts are sorta joint with Jake Mendel and Dmitry Vaintrob (though i’m making no claim about whether they’d endorse the claims). also thank u for discussions: Sam Eisenstat, Clem von Stengel, Lucius Bushnaq, Zach Furman, Alexander Gietelink Oldenziel, Kirke Joamets
with the important caveat that, especially for claims involving ‘circuits’/‘structures’, I think it’s plausible they are made in a frame which will soon be superseded or at least significantly improved/clarified/better-articulated, so it’s a 95% given a frame which is probably silly
train loss in very overparametrized cases is an exception. in this case it might be interesting to note that optima will also be off at infinity if you’re using cross-entropy loss, https://arxiv.org/pdf/2006.06657
also, gradient descent is very far from doing optimal learning in some solomonoff sense — though it can be fruitful to try to draw analogies between the two — and it is also very far from being the best possible practical learning algorithm
by it being a law of large numbers phenomenon, i mean sth like: there are a bunch of structures/circuits/pattern-completers that could be learned, and each one gets learned with a certain probability (or maybe a roughly given total number of these structures gets learned), and loss is roughly some aggregation of indicators for whether each structure gets learned — an aggregation to which the law of large numbers applies
to say more: any concept/thinking-structure in general has to be invented somehow — there in some sense has to be a ‘sensible path’ to that concept — but any local learning process is much more limited than that still — now we’re forced to have a path in some (naively seen) space of possible concepts/thinking-structures, which is a major restriction. eg you might find the right definition in mathematics by looking for a thing satisfying certain constraints (eg you might want the definition to fit into theorems characterizing something you want to characterize), and many such definitions will not be findable by doing sth like gradient descent on definitions
ok, (given an architecture and a loss,) technically each point in the loss landscape will in fact have a different local neighborhood, so in some sense we know that the probability of getting to a point is a function of its neighborhood alone, but what i’m claiming is that it is not nicely/usefully a function of its neighborhood alone. to the extent that stuff about this probability can be nicely deduced from some aspect of the neighborhood, that’s probably ‘logically downstream’ of that aspect of the neighborhood implying something about nice paths to the point.
also note that the points one ends up at in LLM training are not local minima — LLMs aren’t trained to convergence
i think identifying and very clearly understanding any toy example where this shows up would plausibly be better than anything else published in interp this year. the leap complexity paper does something a bit like this but doesn’t really do this
i feel like i should clarify here though that i think basically all existing alignment research fails to relate much to ai alignment. but then i feel like i should further clarify that i think each particular thing sucks at relating to alignment after having thought about how that particular thing could help, not (directly) from some general vague sense of pessimism. i should also say that if i didn’t think interp sucked at relating to alignment, i’d think learning theory sucks less at relating to alignment (ie, not less than interp but less than i currently think it does). but then i feel like i should further say that fortunately you can just think about whether learning theory relates to alignment directly yourself :)