I think this is missing from the list. https://wba-initiative.org/en/25057/. Whole brain architectue initiative.
TheManxLoiner
Two flaws in the Machiavelli Benchmark
Liron Shapira vs Ken Stanley on Doom Debates. A review
Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?
I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality
TheManxLoiner’s Shortform
Sounds sensible to me!
What do we mean by ?
I think the setting is:
We have a true value function
We have a process to learn an estimate of . We run this process once and we get
We then ask an AI system to act so as to maximize (its estimate of human values)
So in this context, is just a fixed function measuring the error between the learnt values and true values.
I think confusion could be using the term to represent both a single instance or the random variable/process.
How to make evals for the AISI evals bounty
Thanks for this post! Very clear and great reference.
- You appear to use the term ‘scope’ in a particular technical sense. Could you give a one-line definition?
- Do you know if this agenda has been picked up since you made this post?
But in this Eiffel Tower example, I’m not sure what is correlating with what
The physical object Eiffel Tower is correlated with itself.
However, I think the basic ability of an LLM to correctly complete the sentence “the Eiffel Tower is in the city of…” is not very strong evidence of having the relevant kinds of dispositions.
It is highly predictive of the ability of the LLM to book flights to Paris, when I create an LLM-agent out of it and ask it to book a trip to see the Eiffel Tower.
I think the question about whether current AI systems have real goals and beliefs does indeed matter
I dont think we disagree here. To clarify, my belief is there are threat models / solutions that are not affected by whether the AI has ‘real’ beliefs, and there are other threats/solutions where it does matter.
I think CGP Grey perspective puts more weight on Definition 3.
I actually do not understand the distinction between Definition 2 and Definition 3. Don’t need to resolve it here. I’ve editted post to include my uncertainty on this.
Zvi’s latest newsletter has a section on this topic! https://thezvi.substack.com/i/151331494/good-advice
+1 to you continuing with this series.
Pedantic point. You say “Automating AI safety means developing some algorithm which takes in data and outputs safe, highly-capable AI systems.” I do not think semi-automated interpretability fits into this, as the output of interpretability (currently) is not a model but an explanation of existing models.
Unclear why Level (1) does not break down into the ‘empirical’ vs ‘human checking’. In particular, how would this belief obtained: “The humans are confident the details provided by the AI systems don’t compromise the safety of the algorithm.”
Unclear (but good chance I just need to think more carefully through the concepts) why Level (3) does not collapse to Level (1) too, using same reasoning. Might be related to Martin’s alternative framing.
Couple of thoughts:
1. I recently found out about this new-ish social media platform. https://www.heymaven.com/. Good chance they are researching alternative recommendation algorithms.
2. What particular actions do you think rationality/ea community could do that other big efforts have not done enough, e.g. projects by Tristan Harris or Jaron Lanier.
Thanks for the feedback! Have editted the post to include your remarks.
Scattered thoughts on what it means for an LLM to believe
The ‘evolutionary pressures’ being discussed by CGP Grey is not the direct gradient descent used to train an individual model. Instead, he is referring to the whole set of incentives we as a society put on AI models. Similar to memes—there is no gradient descent on memes.
(Apologies if you already understood this, but it seems your post and Steven Byrne’s post focus on training of individual models)
AI as a powerful meme, via CGP Grey
What is the status of this project? Are there any estimates of timelines?
I’d like this!