RSS

Con­sti­tu­tional AI

TagLast edit: Jul 11, 2023, 12:52 PM by Benaya Koren

Constitutional AI is a method for fine-tuning language models, used in Anthropic’s Claude. The main conceptual difference from RLHF is that instead of human feedback on specific behaviors it relies on the model’s ability to apply general principles (stated in natural language) to specific situations.

Re­view of Align­ment Plan Cri­tiques- De­cem­ber AI-Plans Cri­tique-a-Thon Re­sults

IknownothingJan 15, 2024, 7:37 PM
24 points
0 comments25 min readLW link
(aiplans.substack.com)

Con­tex­tual Con­sti­tu­tional AI

aksh-nSep 28, 2024, 11:24 PM
12 points
2 comments12 min readLW link

Mak­ing LLMs safer is more in­tu­itive than you think: How Com­mon Sense and Diver­sity Im­prove AI Align­ment

Jeba SaniaDec 29, 2024, 7:27 PM
−5 points
1 comment6 min readLW link

Thoughts about what kinds of virtues are rele­vant in con­text of LLMs.

CanalettoMar 8, 2025, 7:02 PM
1 point
0 comments10 min readLW link

In­de­pen­dent re­search ar­ti­cle an­a­lyz­ing con­sis­tent self-re­ports of ex­pe­rience in ChatGPT and Claude

rifeJan 6, 2025, 5:34 PM
4 points
20 comments1 min readLW link
(awakenmoon.ai)

Con­sti­tu­tions for ASI?

ukc10014Jan 28, 2025, 4:32 PM
3 points
0 comments1 min readLW link
(forum.effectivealtruism.org)

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM
3 points
0 comments10 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM
16 points
1 comment1 min readLW link
(www.anthropic.com)

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya KorenJul 8, 2023, 5:32 PM
6 points
0 comments9 min readLW link
No comments.