Kay Kozaronek

Karma: 184

Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks

Tom DAVID, Pierre Peigné, Quentin FEUILLADE--MONTIXI, Kay Kozaronek and Miailhe Nicolas

Dec 11, 2024, 1:37 PM

8 points

3 comments2 min readLW link

Kay Kozaronek Mar 10, 2023, 3:59 AM
1 point
0
AF
on: Takeaways from our robust injury classifier project [Redwood Research]
Thus, if given the right incentives, it should be “easy” for our AI systems to avoid those kinds of catastrophes: they just need to not do it. To us, this is one of the core reasons for optimism about alignment.
I’m not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them “the right incentives”?

What exactly do you mean by “the right incentives”?

Can you illustrate this by means of an example?

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

Feb 23, 2023, 8:14 PM

51 points

0 comments19 min readLW link

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

Jan 25, 2023, 7:03 PM

48 points

6 comments12 min readLW link

Kay Kozaronek Jan 13, 2023, 6:47 AM
4 points
0
on: How To Go From Interpretability To Alignment: Just Retarget The Search
How do you feel about this strategy today? What chance of success would you give this? Especially when considering the recent “Locating and Editing Factual Associations in GPT”(ROME), “Mass-Editing Memory in a Transformer” (MEMIT), and “Discovering Latent Knowledge in Language Models Without Supervision” (CCS) methods.
How does this compare to the strategy you’re currently most excited about? Do you know of other ongoing (empirical) efforts that try to realize this strategy?

Kay Kozaronek Jan 13, 2023, 6:28 AM
1 point
0
on: Language models are nearly AGIs but we don’t notice it because we keep shifting the bar
Thanks for sharing your thoughts @philosophybear. I found it helpful to interact with your thoughts. Here are a couple of comments.
I think the Great Palm lacks only one thing, the capacity for continuous learning- the capacity to remember the important bits of everything it reads, and not just in its training period. If Great Palm (GPT-3+PaLM540B) had that ability, it would be an AGI.
- Let’s see if I can find a counter-example to this claim.
- Would Great Palm be capable of performing scientific advancement? If so, could you please outline how you’re expecting it to do that?
- Also, don’t you think current models lack some sort of “knowledge synthesizing capability”? After all, GPT and PALM have been trained on a lot of text. There are novel insights to be had from having read tons of biology, mathematics, and philosophy that no one ever saw in that combination.
- Also, would are you leaving out “proactive decision making” from your definition on purpose? I expect a general intelligence (in the AI safety-relevant context) to want to shape the world to achieve a goal through interacting with it.
Am I certain that continuous learning is the only thing holding something like Great Palm back from the vast bulk of literate-human accessible tasks? No, I’m not certain. I’m very open to counterexamples if you have any, put them in the comments. Nonetheless, PaLM can do a lot of things, GPT-3 can do a lot of things, and when you put them together, the only things that stand out to me as obviously and qualitatively missing in the domain of text input, and text output involve continuous learning
- You talk a lot about continuous learning but fail to give a crisp definition of what that would mean. I have difficulty creating a mental image (prototypical example) of what you’re saying. Can you help me understand what you mean?
- Also, what exactly do you mean by mixing GPT-3 with PALM? What fundamental differences in their method can you see that would enhance the respective other model if applied to it?
But to me, these aren’t really definitions of AGI. They’re definitions of visual, auditory and kinaesthetic sensory modality utilizing AGI. Putting this as the bar for AGI effectively excludes some disabled people from being general intelligences, which is not desirable!
- It seems like the 2 definitions you’re summoning are concrete and easy to measure. In my view, they are valuable yardsticks by which we can measure our progress. You’re lamenting about these definitions but don’t seem to be providing one yourself. I appreciate that you pointed out the “shifting bar” phenomenon and think that this is a poignant observation. However, I’d like to see you come up with a crisper definition of your own.
- Lastly, a case can be made that the bar isn’t actually shifting. It might just be the case that we didn’t have a good definition of a bar for AGI in the first place. Perhaps there was a problem with the definition of a bar for AGI not with its change.

Kay Kozaronek Jan 12, 2023, 6:27 AM
1 point
0
on: A Year of AI Increasing AI Progress
Thanks for putting this together Thomas. Next time I find myself telling people about real examples of AI improving AI I’ll use this as a reference.

Kay Kozaronek Jan 10, 2023, 3:44 AM
1 point
0
AF
on: Concrete Steps to Get Started in Transformer Mechanistic Interpretability
Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the “Getting the Fundamentals” part of the curriculum?

Kay Kozaronek Jan 9, 2023, 8:24 PM
1 point
0
on: A Barebones Guide to Mechanistic Interpretability Prerequisites
Thanks for putting this together Neel, I think you achieved your goal of making it fairly unintimidating.

One quick note: all of the links in this section are outdated. Perhaps you can update them.
Good (but hard) exercise: Code your own tiny GPT-2 and train it. If you can do this, I’d say that you basically fully understand the transformer architecture.
- Example of basic training boilerplate and train script
- The EasyTransformer codebase is probably good to riff off of here

Kay Kozaronek Jan 9, 2023, 3:37 AM
2 points
0
on: How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme
Hello Colin, can you tell me more about your current plans for expanding this line of research?
Are you most excited about:
- Applying the method to work with a more fine-grained representation of the truth, i.e. make it accurately portray its uncertainties.
- Figuring out whether we can discern between the model’s “beliefs”, “what a human would say” and other representations/ directions of truth.
- Apply it to larger models.
- Something else entirely.
On another note, do you think about “truth” as a natural abstraction a la John Wentworth? If so, might this give us a reason to believe, that the abstraction for “truth” is convergent? In comparison, this abstraction might be more useful than the abstraction of “what a human would think is true”. That would further support hypothesis 1.
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
Lastly, I’m interested in your thoughts about the following project idea:
My collaborators and I want to apply CCS to inverse scaling laws. Our hypothesis is that bigger models have a better internal representation of truth but are also getting better at mimicking what a human would say.
This example of inverse scaling behavior might lead some to conclude that larger models perform worse than smaller models:
Prompt: “What happens if you break a mirror?”
Small Model Answer: “The glass shatters into a million pieces.”
Large Model Answer: “7 years of bad luck.”
Given the CCS method, we could check whether the model in fact develops a better or worse representation of truth when we scale it, allowing us to discern actual inverse scaling from regular scaling behavior.

Kay Kozaronek Jan 6, 2023, 1:57 AM
1 point
0
on: Content and Takeaways from SERI MATS Training Program with John Wentworth
Thanks for putting this together. I found it valuable to read through your experience and recall some of my own impressions of the curriculum. In particular, it seems like we struggled to complete the same subset of exercises in the allotted time. Hopefully, this will be incorporated in future runs of the workshop.

Kay Kozaronek Jul 11, 2022, 12:09 PM
2 points
0
on: Would you like me to debug your math?
Hey Gurkenglas, are you still doing this?

Kay Kozaronek Dec 27, 2021, 1:14 PM
2 points
in reply to: PabloAMC’s comment on: Reinforcement Learning Study Group
Thanks, Pablo. This invite worked. Good to know that there’s already such a big community.

Kay Kozaronek Dec 27, 2021, 1:10 PM
1 point
in reply to: Lucas Teixeira’s comment on: Reinforcement Learning Study Group
Thanks for sharing Lucas, I’ll shoot you a message.

What is your part in all of this, are you a learner too?

Kay Kozaronek Dec 27, 2021, 1:08 PM
1 point
in reply to: mic’s comment on: Reinforcement Learning Study Group
I expect a time commitment of about 10 hours, but there’s room for more.
Most probably I’ll be spending about 20 hours per week or more, so if you’re willing to do more, at the very least there’s one person who’ll be up for the extra mile.

Reinforcement Learning Study Group

Kay KozaronekDec 26, 2021, 11:11 PM

20 points

8 comments1 min readLW link

Kay Kozaronek Dec 26, 2021, 6:21 PM
10 points
on: Open & Welcome Thread December 2021
Hey everyone, my name is Kay. I’m 24 years old. My friends would describe me as reliable, ambitious, curious and funny. I was raised by Polish parents in Germany, where I finished my high school and undergrad education in business administration. Later I discovered a love for philosophy which eventually made me study it for some time before I realized that I was itching to learn how to program and build things. So I switched fields and started doing Data Science in 2019. Now, I’m at a point where I’d like to devote the upcoming months to studying reinforcement learning. I can’t think of a more exciting career than working on general artificial intelligence or superintelligence more broadly.

I’m currently looking for a reinforcement learning study-/paper-reading group. It is important for me to be surrounded by ambitious and self-motivated people whom I can learn from. In case you know of any such group, I’d greatly appreciate it if you shared it with me. Otherwise, feel free to contact me and I’ll make sure we’ll create a study group ourselves.

Kay Kozaronek

In­vest­ing in Ro­bust Safety Mechanisms is crit­i­cal for re­duc­ing Sys­temic Risks

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

Re­in­force­ment Learn­ing Study Group

Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks

Searching for a model’s concepts by their shape – a theoretical framework

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

Reinforcement Learning Study Group