Eye tracking could also mean face/expression tracking, too. I figure there are probably some areas (stage, audience) where it’s important for you to look without issuing commands, and other areas (floor? above audience?) where you won’t gain useful data by looking. It’s those not-helpful-to-look areas where I’m wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to “select” it, blink or do a certain mouth movement to “click” it, etc.
Your confidence in the quality of your mic updates my hope that audio processing might actually be feasible. The lazy approach I’d take to finding music-ish noises which can be picked out of an audio stream from that mic would be to play some appropriate background noise and then kinda freestyle beatbox into the mic in a way that feels compatible with the music, while recording. I’d then throw that track into whatever signal processing software I was already using to see whether it already had any filters that could garner a level of meaning from the music-compatible mouth-noises. A similar process could be to put on background music and rap music-compatible nonsense syllables to it, and see what speech-to-text can do with the result.
(As a listener, I’m also selfish in proposing nonsense noises/sounds over English words, because my brain insists on parsing all language in music that I hear. This makes me expect that some portion of your audience would have a worse time listening to you if the music you’re trying to play was mixed with commands that the listeners would be meant to ignore. )
I expect that by brute forcing the “what can this software hear clearly and easily?” problem in this way, you’ll discover that the systems you’re using do well at discerning certain noises and poorly at discerning others. It’s almost like working with an animal that has great hearing in some ranges that we consider normal and poor hearing in others. When my family members who farm with working dogs need to name a puppy, they actually test lists of monosyllabic names in a similar way to make sure that no current dog will confuse the puppy’s name for its own. before teaching the puppy what its name is.
After building your alphabet of easy-to-process sounds, you can map combinations of those sounds to commands in any way that you like, and never have to worry about stumbling across a word that the text-to-speech just can’t handle in the noisy context.
The less lazy way, of course, would be to choose your vocabulary of commands and then customize the software until it can handle them. That’s valid and arguably cooler; it just strikes me as a potentially unbounded amount of work.
I’m wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to “select” it, blink or do a certain mouth movement to “click” it, etc
Maybe! This would definitely be nice if it worked. Probably better for switching the system between modes than triggering sounds in real time, though?
This makes me expect that some portion of your audience would have a worse time listening to you if the music you’re trying to play was mixed with commands that the listeners would be meant to ignore.
When using the mic in this mode I wouldn’t be sending it out to the hall. It wouldn’t be audible offstage.
see what speech-to-text can do with the result
I do think that’s worth doing, though only if I get far enough along to have speech-to-text running at all. Right now I think I probably am just trying to use hardware that isn’t up to the task.
Thanks for explaining!
Eye tracking could also mean face/expression tracking, too. I figure there are probably some areas (stage, audience) where it’s important for you to look without issuing commands, and other areas (floor? above audience?) where you won’t gain useful data by looking. It’s those not-helpful-to-look areas where I’m wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to “select” it, blink or do a certain mouth movement to “click” it, etc.
Your confidence in the quality of your mic updates my hope that audio processing might actually be feasible. The lazy approach I’d take to finding music-ish noises which can be picked out of an audio stream from that mic would be to play some appropriate background noise and then kinda freestyle beatbox into the mic in a way that feels compatible with the music, while recording. I’d then throw that track into whatever signal processing software I was already using to see whether it already had any filters that could garner a level of meaning from the music-compatible mouth-noises. A similar process could be to put on background music and rap music-compatible nonsense syllables to it, and see what speech-to-text can do with the result.
(As a listener, I’m also selfish in proposing nonsense noises/sounds over English words, because my brain insists on parsing all language in music that I hear. This makes me expect that some portion of your audience would have a worse time listening to you if the music you’re trying to play was mixed with commands that the listeners would be meant to ignore. )
I expect that by brute forcing the “what can this software hear clearly and easily?” problem in this way, you’ll discover that the systems you’re using do well at discerning certain noises and poorly at discerning others. It’s almost like working with an animal that has great hearing in some ranges that we consider normal and poor hearing in others. When my family members who farm with working dogs need to name a puppy, they actually test lists of monosyllabic names in a similar way to make sure that no current dog will confuse the puppy’s name for its own. before teaching the puppy what its name is.
After building your alphabet of easy-to-process sounds, you can map combinations of those sounds to commands in any way that you like, and never have to worry about stumbling across a word that the text-to-speech just can’t handle in the noisy context.
The less lazy way, of course, would be to choose your vocabulary of commands and then customize the software until it can handle them. That’s valid and arguably cooler; it just strikes me as a potentially unbounded amount of work.
Maybe! This would definitely be nice if it worked. Probably better for switching the system between modes than triggering sounds in real time, though?
When using the mic in this mode I wouldn’t be sending it out to the hall. It wouldn’t be audible offstage.
I do think that’s worth doing, though only if I get far enough along to have speech-to-text running at all. Right now I think I probably am just trying to use hardware that isn’t up to the task.