That’s very cool, maybe I should try to do that for important talks. Though I suppose almost always you have slide aid, so it may not be worth the time investment.
Adrià Garriga-alonso
Maybe being a guslar is not so different from telling a joke 2294 lines long
That’s a very good point! I think the level of ability required is different but it seems right.
The guslar’s songs are (and were of course already in the 1930-1950s) also printed, so the analogy may be closer than you thought.
Is there a reason I should want to?
I don’t know, I can’t tell you that. If I had to choose I also strongly prefer literacy.
But I didn’t know there was a tradeoff there! I thought literacy was basically unambiguously positive—whereas now I think it is net highly positive.
Also I strongly agree with frontier64 that the skill that is lost is rough memorization + live composition, which is a little different.
It’s definitely not exact memorization, but it’s almost more impressive than that, it’s rough memorization + composition to fit the format.
They memorize the story, with particular names; and then sing it with consitent decasyllabic metre and rhyme. Here’s an example song transcribed with its recording: Ropstvo Janković Stojana (The Captivity of Janković Stojan)
the collection: https://mpc.chs.harvard.edu/lord-collection-1950-51/
Does literacy remove your ability to be a bard as good as Homer?
Folks generally don’t need polyamory to enjoy this benefit, but I’m glad you get it from that!
If you’re still interested in this, we have now added Appendix N to the paper, which explains our final take.
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
Thomas Kwa’s research journal
Thank you, that makes sense!
Indefinite integrals would make a lot more sense this way, IMO
Why so? I thought they already made sense, they’re “antiderivatives”, so a function such that taking its derivative gives you the original functions. Do you need anything further to define them?
(I know about the definite integral Riemann and Lebesgue definitions, but I thought indefinite integrals were much easier in comparison.
In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system
I disagree. An inductive bias is not necessarily a prior distribution. What’s the prior?
I don’t think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Yeah, that’s basically my model. How it regularizes I don’t know. Perhaps the volume of “simple” functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
This seems false if you’re interacting with a computable universe, and don’t need to model yourself or copies of yourself
Reasonable people disagree. Why should I care about the “limit of large data” instead of finite-data performance?
OK, let’s look through the papers you linked.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they’re Bayesian is related (which is true—it’s also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they’re using to do predictions, it does not work as well as the NN. So you can’t explain that the NN works well by appealing that it’s similar to this particular Bayesian posterior.
SLT; “Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition”
I have many problems with SLT and a proper comment will take me a couple extra hours. But also I could come away thinking that it’s basically correct, so maybe this is the one.
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn’t necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. ). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if , otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let’s think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let’s hope the distribution on architectures is as simple as possible.
How wide, how deep? You don’t want to choose an arbitrary distribution or (god forbid) arbitrary number, so let’s make it infinitely wide and deep! It turns out that an infinitely wide network just collapses to a random process without any internal features. It turns out an infinitely deep network, but that collapses to a stationary distribution which doesn’t depend on the input. Oops.
Okay, let’s give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you’ve made so many choices, which have to be informed by what empirically works well, that it’s a strange Bayesian reasoner you end up with. And you haven’t even specified your prior distribution yet.
I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.
Think about it: NNs have a bunch of parameters. Their loss is basically always (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).
In classical frequentist analysis they’re likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find the actually maximizing parameters .
But SGD is kind of a shitty optimizer. It turns out the two mistakes cancel out, and NNs are very effective.
First, “probability is in the world” is an oversimplification. Quoting from Wikipedia, “probabilities are discussed only when dealing with well-defined random experiments”. Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.
it doesn’t seem to trump the “but that just sounds really absurd to me though” consideration
Is there anything that could trump that consideration? One of my main objections to Bayesianism is that it prescribes that ideal agent’s beliefs must be probability distributions, which sounds even more absurd to me.
first at least seems pretty subjectivist to me,
Estimators in frequentism have ‘subjective beliefs’, in the sense that their output/recommendations depends on the evidence they’ve seen (i.e., the particular sample that’s input into it). The objectivity of frequentist methods is aspirational: the ‘goodness’ of an estimator is decided by how good it is in all possible worlds. (Often the estimator which is best in the least convenient world is preferred, but sometimes that isn’t known or doesn’t exist. Different estimators will be better in some worlds than others, and tough choices must be made, for which the theory mostly just gives up. See e.g. “Evaluating estimators”, Section 7.3 of “Statistical Inference” by Casella and Berger).
wouldn’t a frequentist think the probability of logical statements, being the most deterministic system, should have only 1 or 0 probabilities?
Indeed, in reality logical statements are either true or false, and thus their probabilities are either 1 or 0. But the estimator-algorithm is free to assign whatever belief it wants to it.
I agree that logical induction is very much Bayesianism-inspired, precisely because it wants to assign weights from zero to 1 that are as self-consistent as possible (i.e. basically probabilities) to statements. But it is frequentist in the sense that it’s examining “unconditional” properties of the algorithm, as opposed to properties assuming the prior distribution is true. (It can’t do the latter because, as you point out, the prior probability of logical statements is just 0 or 1).
But also, assigning probabilities of 0 or 1 to things is not exclusively a Bayesian thing. You could think of an predictor that outputs numbers between 0 and 1 as an estimator of whether a statement will be true or false. If you were to evaluate this estimator you could choose, say, mean-squared error. The best estimator is the one with the least MSE. And indeed, that’s how probabilistic forecasts are typically evaluated.
Daniel states he considers these frequentist because:
I call logical induction and boundedly rational inductive agents ‘frequentist’ because they fall into the family of “have a ton of ‘experts’ and play them off against each other” (and crucially, don’t constrain those experts to be ‘rational’ according to some a priori theory of good reasoning).
and I think indeed not prescribing that things must think in probabilities is more of a frequentist thing. I’m not sure I’d call them decidedly frequentist (logical induction is very much a different beast than classical statistics) but they’re not in the other camp either.
They don’t seem like a success of any statistical theory to me
In absolute terms you’re correct. In relative terms, they’re an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).
Whereas Bayesian theory would throw up its hands and say it’s not a prior that gets updated, so it’s not worth considering as a statistical estimator. This seems even wronger.
More recent theory can account for them working, somewhat. But it’s about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there’s plenty of attempts to the latter going around).
That’s a lot of things done, congratulations!