idea for music recognition, conversion and composition using artificial neural networks
I had this idea while walking the kids to school. Starting from a simple network that can classify music styles as rock/metal/classical/folk/etc, I think that it would be possible to adapt the same algorithm to convert a music file from one style to another, and even write music from scratch in whatever style you want. And if I’m right, I think it would be very simple to write.
This is the simplest task. To recognise the style of a music file, all you need is a feed-forward network with a few thousand inputs, at least one hidden layer, and one output for each style you want to recognise.
A standard data rate for recorded music is 160kbps. That means that every second, there are 10,240 separate wave heights (160*1024/16) that need to be examined. Of course, you can recognise music using lower bps values, but let’s use the same setting for the whole process (160 will be wanted for later parts).
So, the input layer would need 10,240*n inputs, where n is the number of seconds you want the network to sample in order to determine the style. In some cases (metal/classical), you may get away with sampling just a single second, but for better results, you might want a larger value. I’ll be setting n to 300, so it samples the entire song in most cases. This makes it easier to be accurate about the result, but will also be useful in a later stage.
The output layer needs to have one node per tag you want to measure. For example, you might have an output that measures how “rock” a song is, and another that measures how “baroque” it is. You could use output nodes that return a simple Yes/No result, but there is a good reason to return a more linear certainty instead (which we’ll get to).
The hidden network needs at least one neuron, obviously, but I don’t think there is any way to say exactly how many it needs, so it would be better to use a network model which grows automatically as it learns (I don’t know the technical term – I just build the things!).
After building the network, you need to train it. This is the easiest part – you just need a large database of music, and tags for every one of those tunes.
One handy idea: if you’re training a 5 second network (for example), then a 3 minute song has at least 36 completely separate training sets for you to sample – all you need to do is start linking to the inputs at second 0, 1, 2, .5, etc, and the network will see what it thinks (initially) is a completely different data set.
After training this for a while, you should be able to run a few seconds of a song through the network and have fairly accurate results of how “funk” or “jazz” a song is.
After figuring out the above, I started thinking of alternative uses for the idea, and one surprising idea took hold.
Let’s say that you have a “folk” song played on guitar and violin. How would you go about making it “metal”? You could start by fuzzing the violin and distorting the guitar, and maybe adding some drums in.
I think it would be possible to write a program which lets you convert a song from one style to another literally at the click of a button.
Remember I mentioned that the output neurons should say how metal/classical/etc a song is, not just that it is or is not.
If the network is written with enough precision, then adjusting one or more of the input values should give a different value in the outputs.
As an example, let’s say you have a folk tune that you want to convert to neo-punk. Adjusting the inputs such that the sounds are more distorted (clipping high values, for example), or faster (shifting later inputs to the left, maybe) might change the tune’s “neo-punk” output from 0.00024 to 0.00025.
If you repeat this over and over (automatically, obviously), discarding changes that reduce the output and repeating changes that increase the output, until the “neo-punk” output reaches an acceptable threshold such as .9, then you have just created an automatic way to convert a tune from one style to another.
I think this has a lot of applications. For example, let’s say you want to convert a piano tune to guitar? You train your network to recognise what piano and guitar tunes sound like, and then simply convert as above!
This may be the simplest of the lot.
After creating the above programs, try inputting a sound sample of pure static into the conversion program, and tell it to convert the static to piano. I think it would come up with some interesting tunes. Maybe not completely accurate tunes, but they would be interesting.
I think the network would automatically learn rules about harmony and rhythm, but don’t think it would learn about structure. For example, you could train a network to recognise a 3/4 rhythm, but I don’t know if you could write something that recognises a fugue.