Trying to get into alignment. Have a low bar for reaching out!
247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b
Trying to get into alignment. Have a low bar for reaching out!
247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b
I think[1] people[2] probably trust individual tweets way more than they should.
Like, just because someone sounds very official and serious, and it’s a piece of information that’s inline with your worldviews, doesn’t mean it’s actually true. Or maybe it is true, but missing important context. Or it’s saying A causes B when it’s more like A and C and D all cause B together, and actually most of the effect is from C but now you’re laser focused on A.
Also you should be wary that the tweets you’re seeing are optimized for piquing the interests of people like you, not truth.
I’m definitely not the first person to say this, but feels like it’s worth it to say it again.
Sorry, is there a timezone for when the applications would close by, or is it AoE?
Man, politics really is the mind killer
I think knowing the karma and agreement is useful, especially to help me decide how much attention to pay to a piece of content, and I don’t think there’s that much distortion from knowing what others think. (i.e., overall benefits>costs)
Thanks for putting this up! Just to double check—there aren’t any restrictions against doing multiple AISC projects at the same time, right?
Is there no event on Oct 29th?
Wait a minute, “agentic” isn’t a real word? It’s not on dictionary.com or Merriam-Webster or Oxford English Dictionary.
I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it’s probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn’t actually making a lot of concrete predictions yet at this stage).
I don’t think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that “we can understand neural network computation by doing mech interp.”
Thanks for reading my post! Here’s how I think this hypothesis is helpful:
It’s possible that we wouldn’t be able to understand what’s going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I’m skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn’t need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself.
Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more “feature-centric” framework. I don’t know how this would manifest itself exactly, but it felt like it’s worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.
I think there’s something wrong with the link :/ It was working fine earlier but seems to be down now
I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for “model does thing for anomalous reasons.”
Sorry, I linked to the wrong paper! Oops, just fixed it. I meant to link to Aaron Mueller’s Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks.
You could also use \text{}
since people often treat heuristics as meaning that it doesn’t generalize at all.
Yeah and I think that’s a big issue! I feel like what’s happening is that once you chain a huge number of heuristics together you can get behaviors that look a lot like complex reasoning.
I see, I think that second tweet thread actually made a lot more sense, thanks for sharing!
McCoy’s definitions of heuristics and reasoning is sensible, although I personally would still avoid “reasoning” as a word since people probably have very different interpretations of what it means. I like the ideas of “memorizing solutions” and “generalizing solutions.”
I think where McCoy and I depart is that he’s modeling the entire network computation as a heuristic, while I’m modeling the network as compositions of bags of heuristics, which in aggregate would display behaviors he would call “reasoning.”
The explanation I gave above—heuristics that shifts the letter forward by one with limited composing abilities—is still a heuristics-based explanation. Maybe this set of composing heuristics would fit your definition of an “algorithm.” I don’t think there’s anything inherently wrong with that.
However, the heuristics based explanation gives concrete predictions of what we can look for in the actual network—individual heuristic that increments a to b, b to c, etc., and other parts of the network that compose the outputs.
This is what I meant when I said that this could be a useful framework for interpretability :)
Yeah that’s true. I meant this more as “Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk.”
Thanks for the pointer! I skimmed the paper. Unless I’m making a major mistake in interpreting the results, the evidence they provide for “this model reasons” is essentially “the models are better at decoding words encrypted with rot-5 than they are at rot-10.” I don’t think this empirical fact provides much evidence one way or another.
To summarize, the authors decompose a model’s ability to decode shift ciphers (e.g., Rot-13 text: “fgnl” Original text: “stay”) into three categories, probability, memorization, and noisy reasoning.
Probability just refers to a somewhat unconditional probability that a model assigns to a token (specifically, ‘The word is “WORD”’). The model is more likely to decode words that are more likely a priori—this makes sense.
Memorization is defined as how often the type of rotational cipher shows up. rot-13 is the most common one by far, followed by rot-3. The model is better at decoding rot-13 ciphers more than any other cipher, which makes sense since there’s more of it in the training data, and the model probably has specialized circuitry for rot-13.
What they call “noisy reasoning” is how many rotations is needed to get to the outcome. According to the authors, the fact that GPT-4 does better on shift ciphers with fewer shifts compared to ciphers with more shifts is evidence of this “noisy reasoning.”
I don’t see how you can jump from this empirical result to make claims about the model’s ability to reason. For example, an alternative explanation is that the model has learned some set of heuristics that allows it to shift letters from one position to another, but this set of heuristics can only be combined in a limited manner.
Generally though, I think what constitutes as a “heuristic” is somewhat of a fuzzy concept. However, what constitutes as “reasoning” seems even less defined.
I think it’s mostly because he’s well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else.
If you watch his Q&A at METR, he says something along the lines of “I want to retire and don’t plan on doing AI safety research. I do outreach and media appearances because I think it’s the best way I can help (and because I like seeing myself on TV).”
And he’s continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.
This chapter on AI follows immediately after the year in review, I went and checked the previous few years’ annual reports to see what the comparable chapters were about, they are
2023: China’s Efforts To Subvert Norms and Exploit Open Societies
2022: CCP Decision-Making and Xi Jinping’s Centralization Of Authority
2021: U.S.-China Global Competition (Section 1: The Chinese Communist Party’s Ambitions and Challenges at its Centennial
2020: U.S.-China Global Competition (Section 1: A Global Contest For Power and Influence: China’s View of Strategic Competition With the United States)
And this year it’s Technology And Consumer Product Opportunities and Risks (Chapter 3: U.S.-China Competition in Emerging Technologies)
Reminds of when Richard Ngo said something along the lines of “We’re not going to be bottlenecked by politicians not caring about AI safety. As AI gets crazier and crazier everyone would want to do AI safety, and the question is guiding people to the right AI safety policies”