You guys were using an AI that generated the music fully formed (as PCM), right?
It ticks me off that this is how it works. It’s “good”, but you see the problems:
Poor audio quality [edit: the YouTube version is poor quality, but the “Suno” versions are not. Why??]
You can’t edit the music afterward or re-record the voices
You had to generate 3,000-4,000 tracks to get 15 good ones
Is there some way to convince AI people to make the following?
An AI (or two) whose input is a spectral decomposition of PCM music (I’m guessing exponentially-spaced wavelets will be better than FFT) whose job is to separate the music into instrumental tracks + voice track(s) that sum up to the original waveform (and to detect which tracks are voice tracks). Train it using (i) tracker and MIDI archives, which are inherently pre-separated into different instruments, (ii) AI-generated tracker music with noisy instrument timing (the instruments should be high-quality and varied but the music itself probably doesn’t have to be good for this to work, so a quick & dirty AI could be used to make training data) and (iii) whatever real-world decompositions can be found.
An AI that takes these instrumental tracks and decomposes each one into (i) a “music sheet” (a series of notes with stylistic information) and (ii) a set of instrument samples, where each sample is a C-note (middle C ± one or two octaves, drums exempt), with the goal of minimizing the set of instrument samples needed to represent an instrument while representing the input faithfully (if a large number of samples are needed, it’s probably a voice track or difficult instrument such as guitar, but some voice tracks are repetitive and can still be deduplicated this way, and in any case the decomposition into notes is important). [alternate version of this AI: use a fixed set of instrument samples, so the AIs job is not to decompose but to select samples, making it more like speech-to-text rather than a decomposition tool. This approach can’t handle voice tracks, though]
Use the MIDI and tracker libraries, together with the output of the first two AIs inferencing on a music library, to train a third AI whose job is to generate tracker music plus a voice track (I haven’t thought through how to do the part where lyrics drive the generation process). Train it on the world’s top 30,000 songs or whatever.
And voila, the generated music is now editable “in post” and has better sound quality. I also conjecture that if high-quality training data can be found, this AI can either (i) generate better music, on average, than whatever was used for “I Have Been a Good Bing” or (ii) require less compute, because the task it does is simpler. Not only that, while the third AI was the goal, the first pair of AIs are highly useful in their own right and would be much appreciated by artists.
When I was working on my AI music project (melodies.ai) a couple of years ago, I ended up focusing on creating catchy melodies for this reason. Even back then, voice singing software was already quite good, so I didn’t see the need to do everything end-to-end. This approach is much more flexible for professional musicians, and I still think it’s a better idea overall. We can describe images with text much more easily than music, but for professional use, AI-generated images still require fine-scale editing.
You guys were using an AI that generated the music fully formed (as PCM), right?
It ticks me off that this is how it works. It’s “good”, but you see the problems:
Poor audio quality[edit: the YouTube version is poor quality, but the “Suno” versions are not. Why??]You can’t edit the music afterward or re-record the voices
You had to generate 3,000-4,000 tracks to get 15 good ones
Is there some way to convince AI people to make the following?
An AI (or two) whose input is a spectral decomposition of PCM music (I’m guessing exponentially-spaced wavelets will be better than FFT) whose job is to separate the music into instrumental tracks + voice track(s) that sum up to the original waveform (and to detect which tracks are voice tracks). Train it using (i) tracker and MIDI archives, which are inherently pre-separated into different instruments, (ii) AI-generated tracker music with noisy instrument timing (the instruments should be high-quality and varied but the music itself probably doesn’t have to be good for this to work, so a quick & dirty AI could be used to make training data) and (iii) whatever real-world decompositions can be found.
An AI that takes these instrumental tracks and decomposes each one into (i) a “music sheet” (a series of notes with stylistic information) and (ii) a set of instrument samples, where each sample is a C-note (middle C ± one or two octaves, drums exempt), with the goal of minimizing the set of instrument samples needed to represent an instrument while representing the input faithfully (if a large number of samples are needed, it’s probably a voice track or difficult instrument such as guitar, but some voice tracks are repetitive and can still be deduplicated this way, and in any case the decomposition into notes is important). [alternate version of this AI: use a fixed set of instrument samples, so the AIs job is not to decompose but to select samples, making it more like speech-to-text rather than a decomposition tool. This approach can’t handle voice tracks, though]
Use the MIDI and tracker libraries, together with the output of the first two AIs inferencing on a music library, to train a third AI whose job is to generate tracker music plus a voice track (I haven’t thought through how to do the part where lyrics drive the generation process). Train it on the world’s top 30,000 songs or whatever.
And voila, the generated music is now editable “in post”
and has better sound quality. I also conjecture that if high-quality training data can be found, this AI can either (i) generate better music, on average, than whatever was used for “I Have Been a Good Bing” or (ii) require less compute, because the task it does is simpler. Not only that, while the third AI was the goal, the first pair of AIs are highly useful in their own right and would be much appreciated by artists.When I was working on my AI music project (melodies.ai) a couple of years ago, I ended up focusing on creating catchy melodies for this reason. Even back then, voice singing software was already quite good, so I didn’t see the need to do everything end-to-end. This approach is much more flexible for professional musicians, and I still think it’s a better idea overall. We can describe images with text much more easily than music, but for professional use, AI-generated images still require fine-scale editing.