Razied

Karma: 1,944

Razied Jan 2, 2025, 8:18 PM
6 points
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
More insightful than what is conserved under the scaling symmetry of ReLU networks is what is not conserved: the gradient. Scaling $w_{1}$ by $α$ scales $\partial E / \partial w_{1}$ by $1 / α$ and $\partial E / \partial w_{2}$ by $α$ , which means that we can obtain arbitrarily large gradient norms by simply choosing small enough $α$ . And in general bad initializations can induce large imbalances in how quickly the parameters on either side of the neuron learn.
Some time ago I tried training some networks while setting these symmetries to the values that would minimize the total gradient norm, effectively trying to distribute the gradient norm as equally as possible throughout the network. This significantly accelerated learning, and allowed extremely deep (100+ layers) networks to be trained without residual layers. This isn’t that useful for modern networks because batchnorm/layernorm seems to effectively do the same thing, and isn’t dependent on having ReLU as the activation function.

Thus, the γ value is a “conserved quantity” under gradient descent associated with the symmetry. If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the γ value will still be conserved under gradient descent so long as we’re inside that region.
Minor detail, but this is false in practice because we are doing gradient descent with a non-zero learning rate, so there will be some diffusion between different hyperbolas in weight space as we take gradient steps of finite size.

Razied Aug 3, 2024, 12:52 PM
4 points
−9
in reply to: aphyer’s comment on: Martín Soto’s Shortform
I suspect the expert judges would need to resort to known jailbreaking techniques to distinguish LLMs. A fair interesting test might be against expert-but-not-in-ML judges.

Razied Jul 24, 2024, 12:39 PM
2 points
−4
on: GPT-4 for personal productivity: online distraction blocker
Sorry to be blunt, but any distraction filter that can be disabled through the chrome extension menu is essentially worthless. Speaking from experience, for most people this will work for exactly 3 days until they find a website they really want to visit and just “temporarily” disable the extension in order to see it.

Razied Jul 2, 2024, 3:07 AM
2 points
0
in reply to: Donald Hobson’s comment on: AI #17: The Litany
For #5, I think the answer would be to make the AI produce the AI safety ideas which not only solve alignment, but also yield some aspect of capabilities growth along an axis that the big players care about, and in a way where the capabilities are not easily separable from the alignment. I can imagine this being the case if the AI safety idea somehow makes the AI much better at instruction-following using the spirit of the instruction (which is after all what we care about). The big players do care about having instruction-following AIs, and if the way to do that is to use the AI safety book, they will use it.

Razied Jun 15, 2024, 8:32 PM
2 points
0
in reply to: tailcalled’s comment on: Yann LeCun: We only design machines that minimize costs [therefore they are safe]
Do you expect Lecun to have been assuming that the entire field of RL stops existing in order to focus on his specific vision?

Razied Jun 15, 2024, 8:15 PM
11 points
2
on: Yann LeCun: We only design machines that minimize costs [therefore they are safe]
Very many things wrong with all of that:
1. RL algorithms don’t minimize costs, but maximize expected reward, which can well be unbounded, so it’s wrong to say that the ML field only minimizes cost.
2. LLMs minimize expected log probability of correct token, which is indeed bounded at zero from below, but achieving zero in that case means perfectly predicting every single token on the internet.
3. The boundedness of the thing you’re minimizing is totally irrelevant, since maximizing $f (x)$ is exactly the same as maximizing $g (f (x))$ where g is a monotonic function. You can trivially turn a bounded function into an unbounded one without changing anything to the solution sets.
4. Even if utility is bounded between 0 and 1, an agent maximizing the expected utility will still never stop, because you can always decrease the probability you were wrong. Quadruple-check every single step and turn the universe into computronium to make sure you didn’t make any errors.
This is very dumb, Lecun should know better, and I’m sure he *would* know better if he spent 5 minutes thinking about any of this.

Razied May 18, 2024, 11:30 PM
16 points
16
on: On Privilege
The word “privilege” has been so tainted by its association with guilt that it’s almost an infohazard to think you’ve got privilege at this point, it makes you lower your head in shame at having more than others, and brings about a self-flagellation sort of attitude. It elicits an instinct to lower yourself rather than bring others up. The proper reactions to all these things you’ve listed is gratitude to your circumstances and compassion towards those who don’t have them. And certainly everyone should be very careful towards any instinct they have at publicly “acknowledging their privilege”… it’s probably your status-raising instincts having found a good opportunity to boast about your intelligence, appearance and good looks while appearing like you’re being modest.

Razied May 4, 2024, 1:31 AM
9 points
0
in reply to: rosiecam’s comment on: Which skincare products are evidence-based?
Weird side effect to beware for retinoids: they make dry eyes worse, and in my experience this can significantly decrease your quality of life, especially if it prevents you from sleeping well.

Razied Apr 25, 2024, 4:53 PM
2 points
0
in reply to: DanielFilan’s comment on: Bayesian inference without priors
Basically, this shows that every term in a standard Bayesian inference, including the prior ratio, can be re-cast as a likelihood term in a setting where you start off unsure about what words mean, and have a flat prior over which set of words is true.
If the possible meanings of your words are a continuous one-dimensional variable x, a flat prior over x will not be a flat prior if you change variables to y = f(y) for an arbitrary bijection f, and the construction would be sneaking in a specific choice of function f.

Say the words are utterances about the probability of a coin falling heads, why should the flat prior be over the probability p, instead of over the log-odds log(p/(1-p)) ?

Razied Apr 25, 2024, 10:33 AM
2 points
0
on: Bayesian inference without priors
Most of the weird stuff involving priors comes into being when you want posteriors over a continuous hypothesis space, where you get in trouble because reparametrizing your space changes the form of your prior, so a uniform “natural” prior is really a particular choice of parametrization. Using a discrete hypothesis space avoids big parts of the problem.

Razied Apr 19, 2024, 10:03 PM
2 points
0
in reply to: Richard_Ngo’s comment on: What is the best way to talk about probabilities you expect to change with evidence/experiments?
Wait, why doesn’t the entropy of your posterior distribution capture this effect? In the basic example where we get to see samples from a bernoulli process, the posterior is a beta distribution that gets ever sharper around the truth. If you compute the entropy of the posterior, you might say something like “I’m unlikely to change my mind about this, my posterior only has 0.2 bits to go until zero entropy”. That’s already a quantity which estimates how much future evidence will influence your beliefs.

Razied Apr 19, 2024, 9:17 PM
4 points
0
in reply to: Richard_Ngo’s comment on: What is the best way to talk about probabilities you expect to change with evidence/experiments?
Surely something like the expected variance of $log (p / (1 - p))$ would be a much simpler way of formalising this, no? The probability over time is just a stochastic process, and OP is expecting the variance of this process to be very high in the near future.

Razied Mar 17, 2024, 5:38 PM
2 points
0
in reply to: Decaeneus’s comment on: Decaeneus’s Shortform
Unfortunately the entire complexity has just been pushed one level down into the definition of “simple”. The L2 norm can’t really be what we mean by simple, because simply scaling the weights in a layer by A, and the weights in the next layer by 1/A leaves the output of the network invariant, assuming ReLU activations, yet you can obtain arbitrarily high L2 norms by just choosing A high enough.

Razied Mar 1, 2024, 11:00 PM
3 points
2
in reply to: mako yass’s comment on: Elon Sues OpenAI
Unfortunately if OpenAI the company is destroyed, all that happens is that all of its employees get hired by Microsoft, they change the lettering on the office building, and sama’s title changes from CEO to whatever high level manager positions he’ll occupy within microsoft.

Razied Feb 19, 2024, 7:01 PM
2 points
0
in reply to: Seth Herd’s comment on: Goal-Completeness is like Turing-Completeness for AGI
Hmm, but here the set of possible world states would be the domain of the function we’re optimising, not the function itself. Like, No-Free-Lunch states (from wikipedia):
Theorem 1: Given a finite set $V$ and a finite set $S$ of real numbers, assume that $f : S \to V$ is chosen at random according to uniform distribution on the set $S^{V}$ of all possible functions from $V$ to $S$ . For the problem of optimizing $f$ over the set $V$ , then no algorithm performs better than blind search.

Here $V$ is the set of possible world arrangements, which is admittedly much smaller than all possible data structures, but the theorem still holds because we’re averaging over all possible value functions on this set of worlds, a set which is not physically restricted by anything.
I’d be very interested if you can find Byrnes’ writeup.

Razied Feb 14, 2024, 7:21 PM
2 points
0
on: What experiment settles the Gary Marcus vs Geoffrey Hinton debate?
Obviously LLMs memorize some things, the easy example is that the pretraining dataset of GPT-4 probably contained lots of cryptographically hashed strings which are impossible to infer from the overall patterns of language. Predicting those accurately absolutely requires memorization, there’s literally no other way unless the LLM solves an NP-hard problem. Then there are in-between things like Barack Obama’s age, which might be possible to infer from other language (a president is probably not 10 yrs old or 230), but within the plausible range, you also just need to memorize it.

Razied Feb 6, 2024, 6:28 PM
8 points
4
on: Evolution is an observation, not a process
There is no optimization pressure from “evolution” at all. Evolution isn’t tending toward anything. Thinking otherwise is an illusion.
Can you think of any physical process at all where you’d say that there is in fact optimization pressure? Of course at the base layer it’s all just quantum fields changing under unitary evolution with a given Hamiltonian, but you can still identify subparts of the system that are isomorphic with a process we’d call “optimization”. Evolution doesn’t have a single time-independent objective it’s optimizing, but it does seem to me that it’s basically doing optimization on a slowly time-changing objective.

Razied Jan 30, 2024, 3:44 PM
7 points
−8
on: Childhood and Education Roundup #4

Why would you want to take such a child and force them to ‘emotionally develop’ with dumber children their own age?

Because you primarily make friends in school with people in your grade, and if you skip too many grades, the physical difference between the gifted kid and other kids will prevent them from building a social circle based on physical play, and probably make any sort of dating much harder.

Razied Jan 25, 2024, 12:03 AM
4 points
3
on: Is a random box of gas predictable after 20 seconds?
Predicting the ratio at t=20s is hopeless. The only sort of thing you can predict is the variance in the ratio over time, like the ratio as a function of time is $μ (t) = 0.5 + ϵ$ , where $ϵ \sim N (0, σ^{2})$ . Here the large number of atoms lets you predict $σ^{2}$ , but the exact number after 20 seconds is chaotic. To get an exact answer for how much initial perturbation still leads to a predictable state, you’d need to compute the lyapunov exponents of an interacting classical gas system, and I haven’t been able to find a paper that does this within 2 min of searching. (Note that if the atoms are non-interacting the problem stops being chaotic, of course, since they’re just bouncing around on the walls of the box)

Razied Dec 20, 2023, 11:45 PM
3 points
0
in reply to: Liron’s comment on: Goal-Completeness is like Turing-Completeness for AGI
I’ll try to say the point some other way: you define “goal-complete” in the following way:
By way of definition: An AI whose input is an arbitrary goal, which outputs actions to effectively steer the future toward that goal, is goal-complete.
Suppose you give me a specification of a goal as a function $f : S \to {0, 1}$ from a state space to a binary output. Is the AI which just tries out uniformly random actions in perpetuity until it hits one of the goal states “goal-complete”? After all, no matter the goal specification this AI will eventually hit it, though it might take a very long time.
I think the interesting thing you’re trying to point at is contained in what it means to “effectively” steer the future, not in goal-arbitrariness.