How To Deep Learn Music: Two Suggestions

3 June, 2020

Two suggestions for how to deep learn music better:

1. Solve an easier problem, with more limited scope: select a single backing track, and create a dataset of improvisations against that backing track.

2. Start with a hypothesis about what music is, or might be, and use that to drive the deep learning strategy.

by Philip Dorrell

Deep Learning Music: Current Status

Deep Learning is the new-ish hot thing in machine learning.

And there have been attempts to apply deep learning to the problem of composing music..

I could review some of the more well-known recent attempts to do this, and I could give my opinion on how good the results are.

You, the reader, could read my review of the results, and you could listen to the same compositions, and you could have your own opinion on how good the compositions are, and whether or not you agreed with my opinion.

But actually, if deep learning had already "solved" the problem of making music, then I think we would already know about it.

I haven't yet heard of deep learning being used to create commercial quality music.

So I'm going to say that the problem of deep learning music has not yet been solved.

Why is it so hard?

Deep learning has been spectacularly successful at solving certain types of learning problem.

Many of these problems are problems that humans can solve, without too much difficulty, and prior to deep learning, computers have not been able to solve them.

Like looking at photos, and deciding which animals are cats and which animals are dogs.

In some cases where deep learning succeeds, it still requires much more learning data than a human would require.

Like millions and millions of photos of cats and dogs.

In the case of music, we should note that many people are exposed to thousands of musical items over their lifetime, and this does not result in those people learning how to compose good quality original music.

Even people who learn how to play music do not necessarily learn how to compose music as good as the music that they know how to play.

If the problem is hard for people to solve, even with a large amount of available data, it's going to be even harder for deep learning to solve.

It might be that deep learning can only succeed in learning how to compose music, if we have available a learning set that is far larger than all the music that exists so far in the world.

Also, we should note that every deep learning solution involves some kind of choice of neural network structure, and this structure will typically include some type of assumption about the structure of the problem being solved and/or the structure of a successful solution.

For example, convolutional neural networks contain built-in assumptions about the translation invariance of object recognition.

In the case of music, we don't have any a priori understanding of what music actually is, so any guess we make about the structure of the problem or a possible solution could be way off the mark.

Having identified these difficulties, I will now proceed to give a couple of suggestions that might overcome them.

Suggestion 1: Limit scope by choosing a fixed backing track

There are musicians who can endlessly improvise over a fixed repetitive backing track.

Also there are certain chord sequences that are fairly easy to improvise against, like Am F C G, or a blues 12 bar track.

So, a good way to provide a constrained dataset for deep learning would be to record the melodies improvised by one or more musicians against a single endlessly repetitive backing track.

The advantage of such a dataset is that now you are trying to solve a more limited problem. Instead of solving the problem of how to construct "music", you are only trying to solve the problem of how to construct melody over the top of one specific Am F C G backing track.

By deliberately limiting the scope of the learning problem, you have made it an easier problem to solve, and you might succeed in actually solving it.

Suggestion 2: Test the Non-Spontaneity Superstimulus Hypothesis

My first suggestion did not depend on any assumptions about what music actually is.

In my article Music Is A Superstimulus For The Perception Of Non-Spontaneous Non-Conversational Speech, I proposed the hypothesis that music is a superstimulus for the perception that the speech of a speaker is not spontaneous conversational speech.

We can test this hypothesis by breaking up the deep learning problem into two stages:

Deep learn the difference between spontaneous and non-spontaneous (ie rehearsed or planned) speech.
Apply the result of step 1 to the problem of deep learning music.

Deep learning Spontaneous vs Non-Spontaneous Speech

To deep learn the difference between spontaneous and non-spontaneous speech, you need a suitable dataset, and there are two ways I can think of to create such a dataset:

Contrive circumstances where people spontaneously converse, and record their conversations. Then contrive circumstances where people say things that are rehearsed and/or pre-planned, and record that speech.
Or, find an existing dataset of speech samples, and task a human army to tag which samples sound spontaneous, and which samples sound non-spontaneous.

I'm guessing that option 2 is probably the cheaper option. Also, in the case of option 1, the moment you tell people to do something, it's not spontaneous anymore (that old joke about "Everyone be spontaneous!").

Apply the Spontaneous/Non-Spontaneous Deep Learning to Music

There are two different ways that you could apply the result of deep learning the spontaneous/non-spontaneous distinction to deep learning music:

Feed music as input to the network trained for the spontaneous/non-spontaneous distinction, and observe the output, ie does the network think that music is extremely non-spontaneous?
Alternatively, train a new network to produce audio that maximises the 'non-spontaneity' output from the first trained network, and then observe whether the result of that training produces audio that sounds musical.

This hypothesis-based approach could be generalised to any other superstimulus hypothesis that you might have about music.

For example, if you think that music is a superstimulus for X vs Y, then train a network to distinguish X vs Y, and apply one of the two approaches above – observe the results of applying that trained network to music, or, train a second network to maximise the output of the first network.