TheManxLoiner

Karma: 154

TheManxLoiner Apr 9, 2025, 11:11 AM
11 points
0
on: TheManxLoiner’s Shortform
In Sakana AI’s paper on AI Scientist v-2, they claim that the sytem is independent of human code. Based on quick skim, I think this is wrong/deceptful. I wrote up my thoughts here: https://lovkush.substack.com/p/are-sakana-lying-about-the-independence

Main trigger was this line in the system prompt for idea generation: “Ensure that the proposal can be done starting from the provided codebase.”

TheManxLoiner Mar 13, 2025, 4:54 PM
6 points
0
on: Top AI safety newsletters, books, podcasts, etc – new AISafety.com resource
Substacks:
- https://aievaluation.substack.com/
- https://peterwildeford.substack.com/
- https://www.exponentialview.co/
- https://milesbrundage.substack.com/

Podcasts:
- Cognitive Revolution. https://www.cognitiverevolution.ai/tag/episodes/
- Doom debates. https://www.youtube.com/@DoomDebates
- AI policy podcast https://www.csis.org/podcasts/ai-policy-podcast
Worth checking this too: https://forum.effectivealtruism.org/posts/5Hk96JqpEaEAyCEud/how-do-you-follow-ai-safety-news

TheManxLoiner Mar 9, 2025, 11:47 PM
1 point
0
on: Conditional Importance in Toy Models of Superposition
Vague thoughts/intuitions:
- Using the word “importance” I think is misleading. Or, makes it harder to reason about the connection between this toy scenario and real text data. In real comedy/drama, there is patterns in the data to let me/the model deduce it is comedy or drama and hence allow me to focus on the conditionally important features.
- Phrasing the task as follows helps me: You will be given 20 random numbers x1 to x20. I want you to find projections that can recover x1 to x20. Half the time I will ignore your answers from x1 to x10 and the other half the time x11 to x20. It’s totally random which half of the numbers I will ignore. xi and x_{10+i} get the same reward, and reward decreases for bigger i. Now, I find it easier to understand the model: the “obvious” strategy is to make sure I can reproduce x1 and x11, then x2 and x12, and so on, putting little weight on x10 and x20. Alternatively, this is equivalent to having fixed importance of (0.7, 0.49,...,0.7,0.49,...) without any conditioning.
- Follow up Id be interested in is if the conditional importance was deducible from the data. E.g. x is a “comedy” if x1 + … + x20 > 0. Or if x1>0. With same architecture, I’d predict getting the same results though...? Not sure how the model could make use of this pattern.
- And contrary to Charlie, I personally found the experiment crucial to understanding the informal argument. Shows how different ppl think!

TheManxLoiner Mar 9, 2025, 10:20 PM
1 point
0
on: Thoughts on Toy Models of Superposition
there are features such as X_1 which are perfectly recovered
Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, …, X_n
Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?

TheManxLoiner Feb 27, 2025, 3:14 PM
2 points
1
on: [PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
If you’d like to increase the probability of me writing up a “Concrete open problems in computational sparsity” LessWrong post
I’d like this!

Two flaws in the Machiavelli Benchmark

TheManxLoinerFeb 12, 2025, 7:34 PM

23 points

0 comments3 min readLW link

Liron Shapira vs Ken Stanley on Doom Debates. A review

TheManxLoinerJan 24, 2025, 6:01 PM

9 points

0 comments14 min readLW link

TheManxLoiner Jan 7, 2025, 2:17 PM
4 points
1
on: Shallow review of technical AI safety, 2024
I think this is missing from the list. https://wba-initiative.org/en/25057/. Whole brain architectue initiative.

TheManxLoiner Dec 20, 2024, 10:30 AM
1 point
0
on: TheManxLoiner’s Shortform
Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?

I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality

TheManxLoiner’s Shortform

TheManxLoinerDec 20, 2024, 10:30 AM

3 points

6 comments LW link

TheManxLoiner Dec 14, 2024, 12:24 AM
3 points
1
in reply to: Roman Malov’s comment on: Visual demonstration of Optimizer’s curse
Sounds sensible to me!

TheManxLoiner Dec 13, 2024, 7:10 PM
1 point
0
on: Visual demonstration of Optimizer’s curse
What do we mean by $U - V$ ?
I think the setting is:
- We have a true value function $V$
- We have a process to learn an estimate of $V$ . We run this process once and we get $U$
- We then ask an AI system to act so as to maximize $U$ (its estimate of human values)
So in this context, $U - V$ is just a fixed function measuring the error between the learnt values and true values.

I think confusion could be using the term $U$ to represent both a single instance or the random variable/process.

How to make evals for the AISI evals bounty

TheManxLoinerDec 3, 2024, 10:44 AM

9 points

0 comments5 min readLW link

TheManxLoiner Nov 21, 2024, 10:45 AM
1 point
0
AF
on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Thanks for this post! Very clear and great reference.

- You appear to use the term ‘scope’ in a particular technical sense. Could you give a one-line definition?
- Do you know if this agenda has been picked up since you made this post?

TheManxLoiner Nov 17, 2024, 7:20 PM
1 point
0
in reply to: Max H’s comment on: Scattered thoughts on what it means for an LLM to believe
But in this Eiffel Tower example, I’m not sure what is correlating with what
The physical object Eiffel Tower is correlated with itself.

However, I think the basic ability of an LLM to correctly complete the sentence “the Eiffel Tower is in the city of…” is not very strong evidence of having the relevant kinds of dispositions.
It is highly predictive of the ability of the LLM to book flights to Paris, when I create an LLM-agent out of it and ask it to book a trip to see the Eiffel Tower.

I think the question about whether current AI systems have real goals and beliefs does indeed matter
I dont think we disagree here. To clarify, my belief is there are threat models / solutions that are not affected by whether the AI has ‘real’ beliefs, and there are other threats/solutions where it does matter.
I think CGP Grey perspective puts more weight on Definition 3.
I actually do not understand the distinction between Definition 2 and Definition 3. Don’t need to resolve it here. I’ve editted post to include my uncertainty on this.

TheManxLoiner Nov 17, 2024, 7:02 PM
1 point
0
on: Are we dropping the ball on Recommendation AIs?
Zvi’s latest newsletter has a section on this topic! https://thezvi.substack.com/i/151331494/good-advice

TheManxLoiner Nov 12, 2024, 9:05 PM
1 point
0
on: Emergence, The Blind Spot of GenAI Interpretability?
+1 to you continuing with this series.

TheManxLoiner Nov 12, 2024, 8:40 PM
1 point
0
on: Automation collapse
1. Pedantic point. You say “Automating AI safety means developing some algorithm which takes in data and outputs safe, highly-capable AI systems.” I do not think semi-automated interpretability fits into this, as the output of interpretability (currently) is not a model but an explanation of existing models.
2. Unclear why Level (1) does not break down into the ‘empirical’ vs ‘human checking’. In particular, how would this belief obtained: “The humans are confident the details provided by the AI systems don’t compromise the safety of the algorithm.”
3. Unclear (but good chance I just need to think more carefully through the concepts) why Level (3) does not collapse to Level (1) too, using same reasoning. Might be related to Martin’s alternative framing.

TheManxLoiner Nov 11, 2024, 7:44 PM
1 point
0
on: Are we dropping the ball on Recommendation AIs?
Couple of thoughts:
1. I recently found out about this new-ish social media platform. https://www.heymaven.com/. Good chance they are researching alternative recommendation algorithms.
2. What particular actions do you think rationality/ea community could do that other big efforts have not done enough, e.g. projects by Tristan Harris or Jaron Lanier.

TheManxLoiner Nov 10, 2024, 5:50 PM
2 points
0
in reply to: RohanS’s comment on: Scattered thoughts on what it means for an LLM to believe
Thanks for the feedback! Have editted the post to include your remarks.

TheManxLoiner

Two flaws in the Machi­avelli Benchmark

Liron Shapira vs Ken Stan­ley on Doom De­bates. A review

TheManxLoiner’s Shortform

How to make evals for the AISI evals bounty

Two flaws in the Machiavelli Benchmark

Liron Shapira vs Ken Stanley on Doom Debates. A review