> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I’d be curious to know how much effort you put into these (I’m generally curious how much of your agent’s ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn’t getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?
Thank you! No, I’m not building custom prompts for the different tasks. I wrote a single prompt template—the only difference between runs is the task description, which gets plugged into the template. I think ARC Evals did the same thing.
I have been improving the prompt as I worked through the tasks. I probably spent 2-3 hours working on the prompt to try and improve the agent’s performance on some tasks. I’ll definitely rerun all the tasks with the current version of my prompt, just to check that it can still perform the easier tasks.
You’re right that getting the agent to attempt the last three tasks is relatively simple. Still, I was thinking that it wasn’t worth the time or money. I think it’s very unlikely that the agent will succeed at any of the last three tasks. Still, maybe it’s worth getting a conclusive negative result.
Nice job! I’m working on something similar.
> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I’d be curious to know how much effort you put into these (I’m generally curious how much of your agent’s ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn’t getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?
Thank you! No, I’m not building custom prompts for the different tasks. I wrote a single prompt template—the only difference between runs is the task description, which gets plugged into the template. I think ARC Evals did the same thing.
I have been improving the prompt as I worked through the tasks. I probably spent 2-3 hours working on the prompt to try and improve the agent’s performance on some tasks. I’ll definitely rerun all the tasks with the current version of my prompt, just to check that it can still perform the easier tasks.
You’re right that getting the agent to attempt the last three tasks is relatively simple. Still, I was thinking that it wasn’t worth the time or money. I think it’s very unlikely that the agent will succeed at any of the last three tasks. Still, maybe it’s worth getting a conclusive negative result.