tgb

Karma: 1,854

tgb Mar 26, 2025, 6:26 PM
2 points
0
in reply to: TsviBT’s comment on: Tabula Bio: towards a future free of disease (& looking for collaborators)
I’m not as concerned about your points because there are a number of projects already doing something similar and (if you believe them) succeeding at it. Here’s a paper comparing some of them: https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2.full

tgb Mar 26, 2025, 2:53 PM
2 points
0
in reply to: Yejun Y.’s comment on: Tabula Bio: towards a future free of disease (& looking for collaborators)
ML arguments can take more data as input. In particular, the genomic sequence is not a predictor used in LASSO regression models: the variants are just arbitrarily coded as 0,1, or 2 alternative allele count. The LASSO models have limited ability to pool information across variants or across data modes. ML models like this one can (in theory) predict effects of variants just based off their sequence on data like RNA-sequencing (which shows which genes are actively being transcribed). That information is effectively pooled across variants and ties genomic sequence to another data type (RNA-seq). If you include that information into a disease-effect prediction model, you might improve upon the LASSO regression model. There are a lot of papers claiming to do that now, for example the BRCA1 supervised experiment in the EVO-2 paper. Of course, a supervised disease-effect prediction layer could be LASSO itself and just include some additional features derived from the ML model.

tgb Nov 22, 2024, 3:50 PM
2 points
0
in reply to: notfnofn’s comment on: A very strange probability paradox
This is a lovely little problem, so thank you for sharing it. I thought at first it would be [a different problem](https://www.wolfram.com/mathematica/new-in-9/markov-chains-and-queues/coin-flip-sequences.html) that’s similarly paradoxical.

tgb Aug 21, 2024, 1:53 PM
2 points
0
in reply to: Richard_Kennaway’s comment on: Critique of ‘Many People Fear A.I. They Shouldn’t’ by David Brooks.
Again, why wouldn’t you want to read things addressed to other sorts of audiences if you thought altering public opinion on that topic was important? Maybe you don’t care about altering public opinion but a large number of people here say they do care.

tgb Aug 20, 2024, 1:50 AM
2 points
0
in reply to: Richard_Kennaway’s comment on: Critique of ‘Many People Fear A.I. They Shouldn’t’ by David Brooks.
He’s influential and it’s worth knowing what his opinion is because it will become the opinion of many of his readers. Hes also representative of what a lot of other people are (independently) thinking.

What’s Scott Alexander qualified to comment on? Should we not care about the opinion of Joe Biden because he has no particular knowledge about AI? Sure, I’m doubt we learn anything from rebutting his arguments, but once upon a time LW cared about changing the public opinion on this matter and so should absolutely care about reading that public opinion.

Honestly, I embarrassed for us that this needs to be said.

tgb Jun 26, 2024, 12:27 PM
4 points
2
on: Childhood and Education Roundup #6: College Edition
But you don’t need grades to separate yourself academically. You take harder classes to do that. And incentivizing GPA again will only punish people for taking actual classes instead of sticking to easier ones they can get an A in.
Concretely, everyone in my math department that was there to actually get an econ job took the basic undergrad sequences and everyone looking to actually do math started with the honors (“throw you in the deep end until you can actually write a proof”) course and rapidly started taking graduate-level courses. The difference on their transcript was obvious but not necessarily on their GPA.
What system would turn that into a highly legible number akin to GPA? I’m not sure, some sort of ELO system?

tgb Feb 23, 2024, 2:32 PM
1 point
0
on: Do sparse autoencoders find “true features”?
I was confused until I realized that the “sparsity” that this post is referring to is activation sparsity not the more common weight sparsity that you get from L1 penalization of weights.

tgb Feb 17, 2024, 1:04 PM
4 points
0
in reply to: GoteNoSente’s comment on: I played the AI box game as the Gatekeeper — and lost
Wait why do you think inmates escaping is extremely rare? Are you just referring to escapes where guards assisted the escape? I work in a hospital system and have received two security alerts in my memory where a prisoner receiving medical treatment ditched their escort and escaped. At least one of those was on the loose for several days. I can also think of multiple escapes from prisons themselves, for example: https://abcnews.go.com/amp/US/danelo-cavalcante-murderer-escaped-pennsylvania-prison-weeks-facing/story?id=104856784 notable since the prisoner was an accused murderer and likely to be dangerous and armed. But there was also another escape from that same jail earlier that year: https://www.dailylocal.com/2024/01/08/case-of-chester-county-inmate-whose-escape-showed-cavalcante-the-way-out-continued/amp/

tgb Feb 12, 2024, 3:24 PM

13 points

on: How do you actually obtain and report a likelihood function for scientific research?

i have some reservations about the practicality of reporting likelihood functions and have never done this before, but here are some (sloppy) examples in python. Primarily answering number 1 and 3.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib
import pylab

np.random.seed(100)

## Generate some data for a simple case vs control example
# 10 vs 10 replicates with a 1 SD effect size
controls = np.random.normal(size=10)
cases = np.random.normal(size=10) + 1
data = pd.DataFrame(
    {
        "group": ["control"] * 10 + ["case"] * 10,
        "value": np.concatenate((controls, cases)),
    }
)

## Perform a standard t-test as comparison
# Using OLS (ordinary least squares) to model the data
results = smf.ols("value ~ group", data=data).fit()
print(f"The p-value is {results.pvalues['group[T.control]']}")

## Report the (log)-likelihood function
# likelihood at the fit value (which is the maximum likelihood)
likelihood = results.llf
# or equivalently
likelihood = results.model.loglike(results.params)

## Results at a range of parameter values:
# we evaluate at 100 points between -2 and 2
control_case_differences = np.linspace(-2, 2, 100)
likelihoods = []
for cc_diff in control_case_differences:
    params = results.params.copy()
    params["group[T.control]"] = cc_diff
    likelihoods.append(results.model.loglike(params))

## Plot the likelihood function
fig, ax = pylab.subplots()
ax.plot(
    control_case_differences,
    likelihoods,
)
ax.set_xlabel("control - case")
ax.set_ylabel("log likelihood")


## Our model actually has two parameters, the intercept and the control-case difference
# We only varied the difference parameter without changing the intercept, which denotes the
# the mean value across both groups (since we are balanced in case/control n's)
# Now lets vary both parameters, trying all combinations from -2 to 2 in both values
mean_values = np.linspace(-2, 2, 100)
mv, ccd = np.meshgrid(mean_values, control_case_differences)
likelihoods = []
for m, c in zip(mv.flatten(), ccd.flatten()):
    likelihoods.append(
        results.model.loglike(
            pd.Series(
                {
                    "Intercept": m,
                    "group[T.control": c,
                }
            )
        )
    )
likelihoods = np.array(likelihoods).reshape(mv.shape)

# Plot it as a 2d grid
fig, ax = pylab.subplots()
h = ax.pcolormesh(
    mean_values,
    control_case_differences,
    likelihoods,
)
ax.set_ylabel("case - control")
ax.set_xlabel("mean")
fig.colorbar(h, label="log likelihood")

The two figures are:

I think this code will extend to any other likelihood-based model in statsmodels, not just OLS, but I haven’t tested.

It’s also worth familiarizing yourself with how the likelihoods are actually defined. For OLS we assume that residuals are normally distributed. For data points y_i at X_i the likelihood for a linear model with independent, normal residuals is:

$L = \prod_{i = 1}^{n} e x p (- (y_{i} - X_{i} β) / 2 σ^{2}) / \sqrt{2 π σ^{2}}$

where $β$ is the parameters of the model, $σ^{2}$ is the variance of the residuals, and $n$ is the number of datapoints. So the likelihood function here is this value as a function of $β$ (and maybe also $σ^{2}$ , see below).

So if we want to tell someone else our full likelihood function and not just evaluate it at a grid of points, it’s enough to tell them $y$ and $X$ . But that’s the entire dataset! To get a smaller set of summary statistics that capture the entire information, you look for ‘sufficient statistics’. Generally for OLS those are just $X^{T} y$ and $X^{T} X$ . I think that’s also enough to recreate the likelihood function up to a constant?

Note that $σ^{2}$ matters for reporting the likelihood but doesn’t matter for traditional frequentist approaches like MLE and OLS since it ends up cancelling out when you’re doing finding the maximum or reporting likelihood ratios. This is inconvenient for reporting likelihood functions and I think the code I provided is just using the estimated $σ^{2}$ from the MLE estimate. However, at the end of the day, someone using your likelihood function would really only be using it to extract likelihood ratios and therefore the $σ^{2}$ probably doesn’t matter here either?

tgb Jan 17, 2024, 4:48 PM
5 points
0
on: Medical Roundup #1
But yes, working out is mostly unpleasant and boring as hell as we conceive of it and we need to stop pretending otherwise. Once we agree that most exercise mostly bores most people who try it out of their minds, we can work on not doing that.
I’m of the nearly opposite opinion: we pretend that exercise ought to be unpleasant. We equate exercise with elite or professional athletes and the vision of needing to push yourself to the limit, etc. In reality, exercise does include that but for most people should look more like “going for a walk” than “doing hill sprints until my legs collapse”.

On boredom specifically, I think strenuousness affects that more than monotony. When I started exercising, I would watch a TV show on the treadmill and kept feeling bored, but the moment I toned down to a walking speed to cool off, suddenly the show was engaging and I’d find myself overstaying just to watch it. Why wasn’t it engaging while I was running? The show didn’t change. Monotony wasn’t the deciding factor, but rather the exertion.

Later, I switched to running outside and now I don’t get bored despite using no TV or podcast or music. And it requires no willpower! If you’re two miles from home, you can’t quit. Quitting just means running two miles back which isn’t really quitting so you might as well keep going. But on a treadmill, you can hop off at any moment, so there’s a constant drain on willpower. So again, I think the ‘boredom’ here isn’t actually about the task being monotonous and finding ways to make it less monotonous won’t fix the perceived boredom.

I do agree with the comment of playing tag for heart health. But that already exists and is socially acceptable in the form of pickup basketball/soccer/flag-football/ultimate. Lastly, many people do literally find weightlifting fun, and it can be quite social.

tgb Jan 17, 2024, 4:24 PM
6 points
0
in reply to: mike_hawke’s comment on: Medical Roundup #1
The American Heart Association (AHA) Get with the Guidelines–Heart Failure Risk Score predicts the risk of death in patients admitted to the hospital.⁹ It assigns three additional points to any patient identified as “nonblack,” thereby categorizing all black patients as being at lower risk. The AHA does not provide a rationale for this adjustment. Clinicians are advised to use this risk score to guide decisions about referral to cardiology and allocation of health care resources. Since “black” is equated with lower risk, following the guidelines could direct care away from black patients.
From the NEJM article. This is the exact opposite of Zvi’s conclusions (“Not factoring this in means [blacks] will get less care”).

I confirmed the NEJM’s account by using an online calculator for that score. https://www.mdcalc.com/calc/3829/gwtg-heart-failure-risk-score Setting a patient with black=No gives higher risk than black=yes. Similarly so for a risk score from the AHA,: https://static.heart.org/riskcalc/app/index.html#!/baseline-risk

Is Zvi/NYT referring to a different risk calculator? There are a lot of them out there. The NEJM also discuses a surgical risk score that has the opposite directionality, so maybe that one? Though there the conclusion is also about less care for blacks: “When used preoperatively to assess risk, these calculations could steer minority patients, deemed to be at higher risk, away from surgery.” Of course, less care could be a good thing here!

I agree that this looks complicated.

tgb Jan 17, 2024, 3:45 PM
2 points
2
on: Medical Roundup #1
Wegovy (a GLP-1 antagonist)
Wegovy/Ozempic/Semaglutide are GLP-1 receptor agonists, not GLP-1 antagonists. This means they activate the GLP-1 receptor, which GLP-1 also does. So it’s more accurate to say that they are GLP-1 analogs, which makes calling them “GLP-1s” reasonable even though that’s not really accurate either.

tgb Jun 28, 2023, 1:43 AM
2 points
0
in reply to: DirectedEvolution’s comment on: Another medical miracle
Broccoli is higher in protein content per calorie than either beans or pasta and is a very central example of a vegetable, though you’d also want to mix it with beans or something for a better protein quality. 3500 calories of broccoli is 294g protein, if Google’s nutrition facts are to be trusted. Spinach, kale, and cauliflower all also have substantially better protein per calories than potatoes and better PDCAAS scores than I expected (though I’m not certain I trust them—does spinach actually get a 1?). I think potatoes are a poor example (and also not one vegetarians turn to for protein).
Though I tend to drench my vegetables in olive oil so these calories per gram numbers don’t mean much to me in practice, and good luck eating such a large volume of any of these.

tgb Jun 20, 2023, 1:51 PM
5 points
1
in reply to: Ege Erdil’s comment on: My impression of singular learning theory
In my view, it’s a significant philosophical difference between SLT and your post that your post talks only about choosing macrostates while SLT talks about choosing microstates. I’m much less qualified to know (let alone explain) the benefits of SLT, though I can speculate. If we stop training after a finite number of steps, then I think it’s helpful to know where it’s converging to. In my example, if you think it’s converging to $(0, 1)$ , then stopping close to that will get you a function that doesn’t generalize too well. If you know it’s converging to $(0, 0)$ then stopping close to that will get you a much better function—possibly exactly equally as good as you pointed out due to discretization.
Now this logic is basically exactly what you’re saying in these comments! But I think if someone read your post without prior knowledge of SLT, they wouldn’t figure out that it’s more likely to converge to a point near $(0, 0)$ than near $(0, 1)$ . If they read an SLT post instead, they would figure that out. In that sense, SLT is more useful.

I am not confident that that is the intended benefit of SLT according to its proponents, though. And I wouldn’t be surprised if you could write a simpler explanation of this in your framework than SLT gives, I just think that this post wasn’t it.

tgb Jun 20, 2023, 11:37 AM
4 points
0
in reply to: Ege Erdil’s comment on: My impression of singular learning theory
Everything I wrote in steps 1-4 was done in a discrete setting (otherwise $| A^{- 1} (f_{0}) |$ is not finite and whole thing falls apart). I was intending $θ$ to be pairs of floating point numbers and $A$ to be floats to floats.

However, using that I think I see what you’re trying to say. Which is that $θ_{1} θ_{2}$ will equal zero for some cases where $θ_{1}$ and $θ_{2}$ are both non-zero but very small and will multiply down to zero due to the limits of floating point numbers. Therefore the pre-image of $A^{- 1} (f_{0})$ is actually larger than I claimed, and specifically contains a small neighborhood of $(0, 0)$ .

That doesn’t invalidate my calculation that shows that $(0, 0)$ is equally likely as $(0, 1)$ though: they still have the same loss and $A$ -complexity (since they have the same macrostate). On the other hand, you’re saying that there are points in parameter space that are very close to $(0, 0)$ that are also in this same pre-image and also equally likely. Therefore even if $(0, 0)$ is just as likely as $(0, 1)$ , being near to $(0, 0)$ is more likely than being near to $(0, 1)$ . I think it’s fair to say that that is at least qualitatively the same as SLT gives in the continous version of this.

However, I do think this result “happened” due to factors that weren’t discussed in your original post, which makes it sound like it is “due to” $A$ -complexity. $A$ -complexity is a function of the macrostate, which is the same at all of these points and so does not distinguish between $(0, 0)$ and $(0, 1)$ at all. In other words, your post tells me which $f$ is likely while SLT tells me which $θ$ is likely—these are not the same thing. But you clearly have additional ideas not stated in the post that also help you figure out which $θ$ is likely. Until that is clarified, I think you have a mental theory of this which is very different from what you wrote.

tgb Jun 19, 2023, 10:44 PM
3 points
0
in reply to: Ege Erdil’s comment on: My impression of singular learning theory
the worse a singularity is, the lower the $A$ -complexity of the corresponding discrete function will turn out to be
This is where we diverge. Please let me know where you think my error is in the following. Returning to my explicit example (though I wrote $f (θ)$ originally but will instead use $A (θ)$ in this post since that matches your definitions).

1. Let $f_{0} (x) = 0 x$ be the constant zero function and $S = A^{- 1} (f_{0}) .$

2. Observe that $S$ is the minimal loss set under our loss function and also $S$ is the set of parameters $θ = (θ_{1}, θ_{2})$ where $θ_{1} = 0$ or $θ_{2} = 0$ .
3. Let $α, β \in S$ . Then $A^{- 1} (α) = f_{0} = A^{- 1} (β)$ by definition of $S$ . Therefore, $c (A (α)) = c (A (β)) .$
4. SLT says that $θ = (0, 0)$ is a singularity of $S$ but that $θ = (0, 1) \in S$ is not a singularity.

5. Therefore, there exists a singularity (according to SLT) which has identical $A$ -complexity (and also loss) as a non-singular point, contradicting your statement I quote.

tgb Jun 19, 2023, 3:45 PM
4 points
0
in reply to: Ege Erdil’s comment on: My impression of singular learning theory
The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn’t vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh $ε$ , say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.
As I read it, the arguments you make in the original post depend only on the macrostate $f$ , which is the same for both the singular and non-singular points of the minimal loss set (in my example), so they can’t distinguish these points at all. I see that you’re also applying the logic to points near the minimal set and arguing that the nearly-optimal points are more abundant near the singularities than near the non-singularities. I think that’s a significant point not made at all in your original point that brings it closer to SLT, so I’d encourage you to add it to the post.

I think there’s also terminology mismatch between your post and SLT. You refer to singularities of $A$ (i.e. its derivative is degenerate) while SLT refers to singularities of the set of minimal loss parameters. The point $θ = (0, 1)$ in my example is not singular at all in SLT but $A (θ)$ is singular. This terminology collision makes it sound like you’ve recreated SLT more than you actually have.

tgb Jun 19, 2023, 11:38 AM
5 points
2
in reply to: Ege Erdil’s comment on: My impression of singular learning theory
Here’s a concrete toy example where SLT and this post give different answers (SLT is more specific). Let $f (θ) (x) = θ_{1} θ_{2} x$ . And let $L (f (θ)) = | f (θ) (1) |^{2} = θ_{1}^{2} θ_{2}^{2}$ . Then the minimal loss is achieved at the set of parameters where $θ_{1} = 0$ or $θ_{2} = 0$ (note that this looks like two intersecting lines, with the singularity being the intersection). Note that all $θ$ in that set also give the same exact $f (θ)$ . The theory in your post here doesn’t say much beyond the standard point that gradient descent will (likely) select a minimal or near-minimal $θ$ , but it can’t distinguish between the different values of $θ$ within that minimal set.

SLT on the other hand says that gradient descent will be more likely to choose the specific singular value $θ_{1} = 0 = θ_{2}$ .

Now I’m not sure this example is sufficiently realistic to demonstrate why you would care about SLT’s extra specificity, since in this case I’m perfectly happy with any value of $θ$ in the minimal set—they all give the exact same $f (θ)$ . If I were to try to generalize this into a useful example, I would try to find a case where $L (f (θ))$ has a minimal set that contains multiple different $f (θ)$ . For example, $L$ only evaluates $f (θ)$ on a subset of points (the ‘training data’) but different choices of minimal $θ$ give different values outside of that subset of training data. Then we can consider which $f (θ)$ has the best generalization to out-of-training data—do the parameters predicted by SLT yield $f (θ)$ that are best at generalizing?

Disclaimer: I have a very rudimentary understanding of SLT and may be misrepresenting it.

tgb May 1, 2023, 11:39 PM
3 points
1
in reply to: jessicata’s comment on: Hell is Game Theory Folk Theorems
I guess the unstated assumption is that the prisoners can only see the temperatures of others from the previous round and/or can only change their temperature at the start of a round (though one tried to do otherwise in the story). Even with that it seems like an awfully precarious equilibrium since if I unilaterally start choosing 30 repeatedly, you’d have to be stupid to not also start choosing 30, and the cost to me is really quite tiny even while no one else ever ‘defects’ alongside me. It seems to be too weak a definition of ‘equilibrium’ if it’s that easy to break—maybe there’s a more realistic definition that excludes this case?

tgb May 1, 2023, 3:47 PM
3 points
0
on: Hell is Game Theory Folk Theorems
I don’t think the ‘strategy’ used here (set to 99 degrees unless someone defects, then set to 100) satisfies the “individual rationality condition”. Sure, when everyone is setting it to 99 degrees, it beats the minmax strategy of choosing 30. But once someone chooses 30, the minmax for everyone else is now to also choose 30 - there’s no further punishment that will or could be given. So the behavior described here, where everyone punishes the 30, is worse than minmaxing. At the very least, it would be an unstable equilibrium that would have broken down in the situation described—and knowing that would give everyone an incentive to ‘defect’ immediately.