It spits out much scarier information than a google search supplies. Much.
I see a sense in which GPT-4 is completely useless for serious programming in the hands of a non-programmer who wouldn’t be capable/inclined to become a programmer without LLMs, even as it’s somewhat useful for programming (especially with unfamiliar but popular libraries/tools). So the way in which a chatbot helps needs qualification.
One possible measure is how much a chatbot increases the fraction of some demographic that’s capable of some achievement within some amount of time. All these “changes the difficulty by 4x” or “by 1.25x” need to mean something specific, otherwise there is hopeless motte-and-bailey that allows credible reframing of any data as fearmongering. That is, even when it’s only intuitive guesses, the intuitive guesses should be about a particular meaningful thing rather than level of scariness. Something prediction-marketable.
Yes, I quite agree. Do you have suggestions for what a credible objective eval might consist of? What sort of test would seem convincing to you, if administered by a neutral party?
Here’s my guess (which is maybe the obvious thing to do).
Take bio undergrads, have them do synthetic biology research projects (ideally using many of the things which seem required for bioweapons), randomize into two groups where one is allowed to use LLMs (e.g. GPT-4) and one isn’t. The projects should ideally have a reasonable duration (at least >1 week, more ideally >4 weeks). Also, for both groups, provide high level research advice/training about how to use the research tools they are given (in the LLM case, advice about how to best use LLMs).
Then, have experts in the field assess the quality of projects.
For a weaker preliminary experiment, you could do 2-4 hour experiments of doing some quick synth bio lab experiment with the same approximate setup (but there are complications with the shortened duration).
I see a sense in which GPT-4 is completely useless for serious programming in the hands of a non-programmer who wouldn’t be capable/inclined to become a programmer without LLMs, even as it’s somewhat useful for programming (especially with unfamiliar but popular libraries/tools). So the way in which a chatbot helps needs qualification.
One possible measure is how much a chatbot increases the fraction of some demographic that’s capable of some achievement within some amount of time. All these “changes the difficulty by 4x” or “by 1.25x” need to mean something specific, otherwise there is hopeless motte-and-bailey that allows credible reframing of any data as fearmongering. That is, even when it’s only intuitive guesses, the intuitive guesses should be about a particular meaningful thing rather than level of scariness. Something prediction-marketable.
I was trying to say “cost in time/money goes down by that factor for some group”.
Yes, I quite agree. Do you have suggestions for what a credible objective eval might consist of? What sort of test would seem convincing to you, if administered by a neutral party?
Here’s my guess (which is maybe the obvious thing to do).
Take bio undergrads, have them do synthetic biology research projects (ideally using many of the things which seem required for bioweapons), randomize into two groups where one is allowed to use LLMs (e.g. GPT-4) and one isn’t. The projects should ideally have a reasonable duration (at least >1 week, more ideally >4 weeks). Also, for both groups, provide high level research advice/training about how to use the research tools they are given (in the LLM case, advice about how to best use LLMs).
Then, have experts in the field assess the quality of projects.
For a weaker preliminary experiment, you could do 2-4 hour experiments of doing some quick synth bio lab experiment with the same approximate setup (but there are complications with the shortened duration).