Subtracting out the “web-based” part as a first class requirement, while focusing on the bridge made of code as a “middle” from which to work “outwards” towards raw inputs and final results...
...I tend to do the first ~20 data entry actions as variable constants in my code that I tweak by hand, then switch to the CSV format for the next 10^2 to 10^5 data entry tasks that my data labelers work on, based on how I think it might work best (while giving them space for positive creativity).
A semi-common transitional pattern during the CSV stage involves using cloud spreadsheets (with multiple people logged in who can edit together and watch each other edit (which makes it sorta web-based, and also lets you use data labelers anywhere on the planet)) and ends with a copypasta out of the cloud and into a CSV that can be checked into git. Data entry… leads to crashes… which leads to validation code… which leads to automated tooling to correct common human errors <3
If the label team does more than ~10^4 data entry actions, and the team is still using CSV, then I feel guilty about having failed to upgrade a step in the full pipeline (including the human parts) whose path of desire calls out for an infrastructure upgrade if it is being used that much. If they get to 10^5 labeling actions with that system and those resources then upper management is confused somehow (maybe headcount maxxing instead of result maxxing?) and fixing that confusion is… complicated.
This CSV growth stage is not perfect, but it is highly re-usable during exploratory sketch work on blue water projects because most of the components can be accomplished with a variety of non-trivial tools.
If you know of something better for these growth stages, I’d love to hear about your workflows, my own standard methods are mostly self constructed.
Subtracting out the “web-based” part as a first class requirement, while focusing on the bridge made of code as a “middle” from which to work “outwards” towards raw inputs and final results...
...I tend to do the first ~20 data entry actions as variable constants in my code that I tweak by hand, then switch to the CSV format for the next 10^2 to 10^5 data entry tasks that my data labelers work on, based on how I think it might work best (while giving them space for positive creativity).
A semi-common transitional pattern during the CSV stage involves using cloud spreadsheets (with multiple people logged in who can edit together and watch each other edit (which makes it sorta web-based, and also lets you use data labelers anywhere on the planet)) and ends with a copypasta out of the cloud and into a CSV that can be checked into git. Data entry… leads to crashes… which leads to validation code… which leads to automated tooling to correct common human errors <3
If the label team does more than ~10^4 data entry actions, and the team is still using CSV, then I feel guilty about having failed to upgrade a step in the full pipeline (including the human parts) whose path of desire calls out for an infrastructure upgrade if it is being used that much. If they get to 10^5 labeling actions with that system and those resources then upper management is confused somehow (maybe headcount maxxing instead of result maxxing?) and fixing that confusion is… complicated.
This CSV growth stage is not perfect, but it is highly re-usable during exploratory sketch work on blue water projects because most of the components can be accomplished with a variety of non-trivial tools.
If you know of something better for these growth stages, I’d love to hear about your workflows, my own standard methods are mostly self constructed.