Utilop

Karma: 15

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety—A Pilot Retrospective

Alvin Ånestrand, Jonas Hallgren and Utilop

Jan 10, 2025, 4:22 PM

21 points

0 comments4 min readLW link

Utilop Dec 23, 2023, 7:59 AM
13 points
4
on: AI safety advocates should consider providing gentle pushback following the events at OpenAI
I think it is worth noting that we do not quite know the reasons for the events and it may be too soon to say that the safety situation at OpenAI has worsened.
Manifold does not seem to consider safety conflicts a likely cause:

Utilop Dec 1, 2021, 12:19 AM
2 points
AF
on: Visible Thoughts Project and Bounty Announcement
Some naive thoughts in case useful:
A) Is the structured annotation format more useful than a gamemaster/writer thinking aloud while recording themselves (possibly with an audience)?
That could be the closest thing to a full transcript of the human process which downstream tasks could condense as needed. An adopted annotation format (prescribed or not) could potentially cause thoughts to be filtered, reinterpreted, or even steer human generation?
One key example against a fixed-format annotation, I think is that human gamemasters and writers do not spend approximate constant effort per player action. They will do a lot of up-front work to have a plan for the story, can go on auto-pilot for many of the interactions, while thinking hard about critical parts of the story. Language models which generate stories today notoriously seem to lack this red thread and filling out a form summarizing the writers’ thoughts may fail to capture this process.
The unstructured approach may also be closer to what pretrained models have learned and therefore require less data.
It could perhaps also provide a highly interesting dataset for another task relevant to the application—metareasoning in generation—should the agent output the next part of the story or keep thinking about the generation?
Alternatively, one could record all thoughts as they come, but follow up each output with some standardized questions—if there are some critical to the application?
B) I am curious whether sufficiently strong language models wouldn’t be able to fake the explanations post-hoc.
At least, looking at the forms, I am not sure whether I could tell competent explanations apart. If that is the case, it could be that the dataset does not get us that far in interpretability and lead to more specific needs. It might be worth trying to answer that question too.
E.g. before the dataset is made public, you could hide the thoughts in a crafted run and let another team fill in thoughts post-hoc. They could be rewarded for swaying evaluators to accept theirs as the original. This could also answer whether even humans are able to tell apart genuine motivations behind a decision vs made-up explanations; and provide another task dataset.
( C) Probably clear already but models like GPT3 can generate responses/stories while reflecting/talking to itself, and some already use it this way and only output the end results. Although that is probably not operating at the desired level. Fine-tuning is also fairly cheap so don’t think one has to settle for GPT2. If the goal was interpretability of each generated token, perhaps the thoughts should also be derived from intermediate layers rather than being part of the sequence)

Utilop

The Align­ment Map­ping Pro­gram: Forg­ing In­de­pen­dent Thinkers in AI Safety—A Pilot Retrospective

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety—A Pilot Retrospective