rpglover64

Karma: 216

rpglover64 Mar 18, 2025, 2:17 PM
3 points
0
in reply to: Zach Furman’s comment on: Interpreting Complexity
Thanks!

I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.

rpglover64 Mar 15, 2025, 8:55 PM
4 points
1
on: Interpreting Complexity
This may be a naive question/observation, but it seems to me like treating the trace as a vector conflates the behavior before convergence and after convergence. Note that my intuition is that SGLD is basically an MCMC technique, so if I’m misunderstanding, it’s probably because of that.

After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn’t seem to make sense to treat them as a vector, since there should be many equivalent traces. I’d be interested to see density estimates per sample point, not just the per-timestep ones you have.

Before convergence, it does make sense to look at the vector, but it’s still noisy, so I’d also be interested in certain summaries of it; for example, time to convergence and slope of loss (which is a function of time and the mean loss of the stationary distribution).

I also find myself wondering about other, more tangential things:
- Could you use a per-sample gradient trace (rather than the loss trace) of the SGLD to learn something?
- Does it make sense to e.g. run multiple chains?
- Does it make sense to change the temperature throughout the run (like simulated annealing) rather than just run with each temperature?

rpglover64 Nov 30, 2024, 5:54 AM
1 point
0
in reply to: mu_(negative)’s comment on: What TMS is like
Not my medical miracle post, just my comment on it.

“Objectively” for me would translate to “biomarker” i.e., a bio-physical signal that predicts a clinical outcome.

Yes, though I wouldn’t restrict it to “clinical” because I care about non-medical outcomes, and “bio-physical” seems restrictive, though based on your example, that seems to be just my interpretation of the term.

Note that for depression and many psychological issues this means that we find the biomarkers by asking people how they feel

These are legitimate biomarkers, but they’re not what I want, and I’m struggling to explain specifically why; the two things that come up are that they have low statistical power and they’re a particularly lagging indicator (imagine for contrast e.g. being able to tell whether an antidepressant would work for you after taking it for a week, even if it takes two months to feel the effects). They’re fine and useful for statistics, and even for measuring the effectiveness of a treatment in an individual, but a lot less useful for experimenting.

I’m assuming you mean biomarkers for psychological / mental health outcomes specifically. This is spiritually pretty close to what my lab studies—ways to predict how TMS will affect individuals, and adjust it to make it work better in each person.

That sounds really cool. I’m assuming there’s nothing actionable available right now for patients?

Our philosophy [...] is that the effects of an intervention will manifest most reliably in reactions to very simple cognitive tasks like vigilance, working memory, and so on. Most serious health issues impact your reaction times, accuracy, bias, etc. in subtle but statistically reliable ways.

Yep. This is basically what I’m hoping to monitor in myself. For example, better vigilance might translate to better focus on work tasks, or better selective attention might imply better impulse control.

Measuring these with random sampling from a phone app and doing good statistics on the data is probably your best bet for objectively assessing interventions. Maybe that is what Quantified Mind does, I’m not sure?

QM doesn’t work so well on phone and hasn’t been updated on years and has major UX issues for my use case that makes it too hard to work with. It also doesn’t expose the raw statistics. Cognifit (the only app I’ve found that does assessment and not just “brain training”) reports even less.

Do you have a specific app that you know of?

The short answer is that if this were easy, it would already be popular, because we clearly need it. A lot of academic labs and industry people are trying to do this all the time. There is growing success, but it’s slow growing and fraught with non-replicable work.

I don’t think this is true. My alternative hypothesis (which I think is also compatible with the data) is that it’s not hard, but there’s no money in it, so there’s not much commercial “free energy” making it happen, and that it’s tedious, so there’s not much hobbyist “free energy”, and academia is slow as things like this.

rpglover64 Nov 9, 2024, 3:40 PM
1 point
1
in reply to: mu_(negative)’s comment on: What TMS is like
This isn’t directly related to TMS, but I’ve been trying to get an answer to this question for years, and maybe you have one.

When doing TMS, or any depression treatment, or any supplementation experiment, etc. it would make sense to track the effects objectively (in addition to, not as a replacement for subjective monitoring). I haven’t found any particularly good option for this, especially if I want to self-administer it most days. Quantified mind comes close, but it’s really hard to use their interface to construct a custom battery and an indefinite experiment.

Do you know of anything?

PSA: Consider alternatives to AUROC when reporting classifier metrics for alignment

rpglover64Jun 24, 2024, 5:53 PM

18 points

1 comment3 min readLW link

rpglover64 Apr 5, 2024, 3:27 PM
LW: 1 AF: 1
0
AF
on: LLMs for Alignment Research: a safety priority?
Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they’re not explicitly weakened in terms of helping people develop capabilities? I’d be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that’s still a net negative.

rpglover64 Mar 5, 2024, 4:25 PM
−1 points
1
in reply to: ryan_greenblatt’s comment on: Many arguments for AI x-risk are wrong
I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled ’s reminder that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.
A framing that feels alive for me is that AlphaGo didn’t significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.

rpglover64 Mar 3, 2024, 4:32 PM
1 point
0
in reply to: habryka’s comment on: Open Thread – Winter 2023/2024
before:
after:
Here the difference seems only to be spacing, but I’ve also seen bulleted lists appear. I think but I can’t recall for sure that I’ve seen something similar happen to top-level posts.

rpglover64 Mar 2, 2024, 8:50 PM
1 point
0
on: Open Thread – Winter 2023/2024
@Habryka @Raemon I’m experiencing weird rendering behavior on Firefox on Android. Before voting, comments are sometimes rendered incorrectly in a way that gets fixed after I vote on them.
Is this a known issue?

rpglover64 Mar 2, 2024, 8:44 PM
1 point
1
in reply to: habryka’s comment on: Open Thread – Winter 2023/2024
This is also mitigated by automatic images like gravatar or the ssh key visualization. I wonder if they can be made small enough to just add to usernames everywhere while maintaining enough distinguishable representations.

rpglover64 Feb 27, 2024, 6:54 PM
1 point
0
in reply to: Gerald Monroe’s comment on: Hidden Cognition Detection Methods and Benchmarks

Note that every issue you mentioned here can be dealt with by trading off capabilities.

Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.

I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.

rpglover64 Feb 27, 2024, 1:43 PM
1 point
0
in reply to: Gerald Monroe’s comment on: Hidden Cognition Detection Methods and Benchmarks

The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.

This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.

If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean.

First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.

rpglover64 Feb 26, 2024, 5:26 PM
2 points
0
on: Hidden Cognition Detection Methods and Benchmarks
Since the sleeper agents paper, I’ve been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the “attention monitoring” approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.

A question I’m curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.

rpglover64 Sep 30, 2023, 6:08 PM
4 points
1
in reply to: Coding2077’s comment on: Commonsense Good, Creative Good
Let’s consider the trolley problem. One consequentialist solution is “whichever choice leads to the best utility over the lifetime of the universe”, which is intractable. This meta-principle rules it out as follows: if, for example, you learned that one of the 5 was on the brink of starting a nuclear war and the lone one was on the brink of curing aging, that would say switch, but if the two identities were flipped, it would say stay, and generally, there are too many unobservables to consider. By contrast, a simple utilitarian approach of “always switch” is allowed by the principle, as are approaches that take into account demographics or personal importance.

The principle also suggests that killing a random person on the street is bad, even if the person turns out to be plotting a mass murder, and conversely, a doctor saving said person’s life is good.

Two additional cases where the principle may be useful and doesn’t completely correspond to common sense:
- I once read an article by a former vegan arguing against veganism and vegetarianism; one example was the fact that the act of harvesting grain involves many painful deaths of field mice, and that’s not particularly better than killing one cow. Applying the principle, this suggests that suffering or indirect death cannot straightforwardly be the basis for these dietary choices, and that consent is on shaky ground.
- When thinking about building a tool (like the LW infrastructure) that could be either hugely positive (because it leads to aligned AI) or hugely negative (because it leads to unaligned AI by increasing AI discussions), and there isn’t really a way to know which, you are morally free to build it or not; any steps you take to increase the likelihood of a positive outcome are good, but you are not required to stop building the tool due to a huge unknowable risk. Of course, if there’s compelling reason to believe that the tool is net-negative, that reduces the variance and suggests that you shouldn’t build it (e.g. most AI capabilities advancements).
Framed a different way, the principle is, “Don’t tie yourself up in knots overthinking.” It’s slightly reminiscent of quiescence search in that it’s solving a similar “horizon effect” problem, but it’s doing so by discarding evaluation heuristics that are not locally stable.

rpglover64 Sep 28, 2023, 2:00 PM
3 points
0
on: Commonsense Good, Creative Good
This makes me think that a useful meta-principle for application of moral principles in the absence omniscience is “robustness to auxillary information.” Phrased another way, if the variance of the outcomes of your choices is high according to a moral principle, in all but the most extreme cases, either find more information or pick a different moral principle.

rpglover64 Mar 19, 2023, 2:13 PM
2 points
0
on: Natural Abstractions: Key claims, Theorems, and Critiques
I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don’t think it has been explicitly mentioned before.

One of Eliezer’s examples is “The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry.” One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies—the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem goes away on its own as capabilities increase; that is, AGI will understand us when we communicate something that has a coherent natural interpretation, even without extra effort on our part to translate it to the AGI version of machine code.

Does that seem right?

rpglover64 Mar 18, 2023, 6:37 PM
7 points
3
on: “Carefully Bootstrapped Alignment” is organizationally hard
(Why “Top 3” instead of “literally the top priority?” Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )
I think the situation is more dire than this post suggests, mostly because “You only get one top priority.” If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can’t get off the ground.
The best distillation of my understanding regarding why “second priority” is basically the same as “not a priority at all” is this twitter thread by Dan Luu.
The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they’d achieve none of their goals.
What links here?
- Dan Luu on “You can only communicate one top priority” by Raemon (Mar 18, 2023, 6:55 PM; 148 points)

rpglover64 Mar 4, 2023, 4:03 PM
1 point
0
in reply to: abramdemski’s comment on: Teleosemantics!
I just read an article that reminded me of this post. The relevant section starts with “Bender and Manning’s biggest disagreement is over how meaning is created”. Bender’s position seems to have some similarities with the thesis you present here, especially when viewed in contrast to what Manning claims is the currently more popular position that meaning can arise purely from distributional properties of language.

This got me wondering: if Bender is correct, then there is a fundamental limitation in how well (pure) language models can understand the world; are there ways to test this hypothesis, and what does it mean for alignment?

Thoughts?

rpglover64 24 Feb 2023 2:09 UTC
3 points
0
on: Teleosemantics!
Interesting. I’m reminded of this definition of “beauty”.

rpglover64 20 Feb 2023 14:57 UTC
22 points
16
on: The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.

rpglover64

PSA: Con­sider al­ter­na­tives to AUROC when re­port­ing clas­sifier met­rics for alignment

PSA: Consider alternatives to AUROC when reporting classifier metrics for alignment