Thank you for the references! I certainly agree that LLMs are very useful in many parts of the stack even if they cannot do the full stack autonomously. I also expect that they can do better with better prompting, and probably much better on this task with prompting + agent scaffolding + rag etc along the lines of the work you linked in the other comment. My experiments are more asking the question: Say you have some dataset, can you simply give a description of the dataset to a llm and get a good ml model (possibly after a few iterations). My experiments do suggest that this might not be reliable unless you have a very simple dataset.
If the LLMs had done very well on my first challenge, that would suggest that someone not very familiar with ML could get an LLM to basically create a good ML model for them, even if they had a fairly complicated dataset. I guess it is somewhat a question about how much work you have to put in vs how much work you get out of the LLM, and what is the barrier of entry.
I suspect the picture is a bit more complicated than this, they’re very likely not good yet at doing the full stack of SOTA ML research, but they seeem to be showing signs of life in many subdomains (including parts of [prosaic] alignment!), see e.g. https://arxiv.org/abs/2402.17453, https://sakana.ai/llm-squared/, https://multimodal-interpretability.csail.mit.edu/maia/. And including ML reviewing https://arxiv.org/abs/2310.01783 and coming up with research ideas https://arxiv.org/abs/2404.07738.
Thank you for the references! I certainly agree that LLMs are very useful in many parts of the stack even if they cannot do the full stack autonomously. I also expect that they can do better with better prompting, and probably much better on this task with prompting + agent scaffolding + rag etc along the lines of the work you linked in the other comment. My experiments are more asking the question: Say you have some dataset, can you simply give a description of the dataset to a llm and get a good ml model (possibly after a few iterations). My experiments do suggest that this might not be reliable unless you have a very simple dataset.
If the LLMs had done very well on my first challenge, that would suggest that someone not very familiar with ML could get an LLM to basically create a good ML model for them, even if they had a fairly complicated dataset. I guess it is somewhat a question about how much work you have to put in vs how much work you get out of the LLM, and what is the barrier of entry.