In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;
Iām not sure how well this metric tracks what people care about ā performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)
Iām not sure how well this metric tracks what people care about ā performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)