RSS

JanB

Karma: 732

How to Catch an AI Liar: Lie De­tec­tion in Black-Box LLMs by Ask­ing Un­re­lated Questions

28 Sep 2023 18:53 UTC
183 points
37 comments3 min readLW link

Your posts should be on arXiv

JanB25 Aug 2022 10:35 UTC
150 points
44 comments3 min readLW link

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanB4 Oct 2023 17:10 UTC
97 points
8 comments3 min readLW link

The ex­pected value of ex­tinc­tion risk re­duc­tion is positive

JanB9 Jun 2019 15:49 UTC
22 points
0 comments61 min readLW link

Re­search re­quest (al­ign­ment strat­egy): Deep dive on “mak­ing AI solve al­ign­ment for us”

JanB1 Dec 2022 14:55 UTC
16 points
3 comments1 min readLW link

Look­ing for an al­ign­ment tutor

JanB17 Dec 2022 19:08 UTC
15 points
2 comments1 min readLW link

[LINK] - ChatGPT discussion

JanB1 Dec 2022 15:04 UTC
13 points
8 comments1 min readLW link
(openai.com)

Some ideas for fol­low-up pro­jects to Red­wood Re­search’s re­cent paper

JanB6 Jun 2022 13:29 UTC
10 points
0 comments7 min readLW link

[Question] What is the differ­ence be­tween ro­bust­ness and in­ner al­ign­ment?

JanB15 Feb 2020 13:28 UTC
9 points
2 comments1 min readLW link