Music And Machine Learning

28 November, 2019

Some thoughts on the application of Machine Learning to music science: (1) what do we hope to achieve? (2) the need to know something about the structure of the solution to your problem (3) a way to create more constrained musical datasets which might be easier for an algorithm to learn against, in some sense.

by Philip Dorrell

What do we hope to gain from the application of Machine Learning to music?

The idea behind Machine Learning is that we can write computer software which is programmed to learn how to solve some problem.

Instead of writing an algorithm to solve a problem, we write an algorithm which somehow searches for an algorithm that will solve the problem. Usually the search involves use of large datasets relevant to the problem we want to solve.

In the case of music, the problems we might want to solve are:

Given an item of music, is it good music, or is it not good music? (ie how strong is it?)
Is there a formula which can be used to compose new original items of music which people want to listen to?

There is an obvious commercial motive to solve problem No 2: write an algorithm that learns to compose music, use the algorithm to compose number one hits, collect royalties, retire somewhere nice.

People like to complain about "formulaic music", but I would suggest that no such formula has yet been discovered, because if it had been discovered, then whoever found such a formula would be using it to compose endless amounts of new music as good as or better than existing "human-composed" music and they would be totally dominating the music industry. As far as I can tell, this has not yet happened.

From a scientific point of view, solving either of these two problems by machine learning will help us solve the scientific problem of music, if we can extract a comprehensible "understanding" of the problem solution that the machine learning has generated.

The extraction problem is a general problem in all forms of machine learning. For example, if an algorithm can learn to distinguish cats from dogs, can the algorithm "explain" to us humans how it does it?

(So actually someone might solve the problem of music perception and/or music composition using machine learning, but the answer would exist only as a set of weights in a neural network, and it could require a non-trivial amount of analysis to determine what those weights actually "mean".)

Has Machine Learning already "solved" music?

Every now and then there are announcements of new advances in applying machine learning to the generation of music, and these announcements are accompanied by examples of compositions, and some of these are quite "interesting".

However, at time of writing, the business of composing music is still heavily dependent on the efforts of human composers – as I just explained above, the "formula" for composing music has not yet been discovered.

We can conclude that music is not like chess, or Go, where the computers have already learned to beat humans at their own game.

A Priori Assumptions about Solutions to Machine Learning Problems

In one sense Machine Learning promises us "magic".

Supposedly, we don't have to write a program to solve a problem. Instead we just find a large relevant dataset, and throw the data at the Machine Learning, and, somehow, the Machine Learning magically converts the dataset into a program for solving the problem relating to the dataset.

In practice, although the Machine Learning algorithm writes the final problem-solving algorithm, someone still has to write the Machine Learning algorithm.

And it turns out that different Machine Learning algorithms suit different problems.

Most modern machine learning research uses neural networks, which define an algorithm in terms of a set of numerical weights that specify the function that converts input data to output data, and the learning algorithm uses a standard back-propagation algorithm to search for a "good" set of weights to provide a "good" solution to the problem.

With neural networks, the problem of choosing the best machine learning algorithm for a particular problem is reduced to the problem of choosing the best neural network structure for that problem.

Typically the chosen structure may contain some implicit assumptions about the likely nature of the solution.

For example, convolutional neural networks contain an implicit assumption that the solution to a visual recognition problem will have a solution which is invariant under geometric translation, ie if I move the kitten higher up in the picture, it still has to look like a kitten. CNNs also contain an implicit assumption that the relationships between immediately neighbouring pixels in an image matter more than the relationships between more distant pixels, at least in the initial stages of analysis.

Another example, of how neural networks contain implicit assumptions about solution structure, is Long Short-Term Memory networks as applied to language comprehension. LSTM networks have a structure reflecting the structural relationships that we would expect to exist between words and groups of words in an extended body of text – some aspects of the meanings of words relate to other words nearby, and other aspects of meaning relate to words or groups of words further away.

If the best solution to a problem requires connections between certain items of information (either inputs, or derived values computed from the inputs), and those connections don't exist in the neural network, then the learning algorithm is going to struggle to find a good solution using that network.

Assumptions about Music

With music, we don't really know what music is, at all. So in looking for solutions to problems of music perception and composition, we actually have no idea what type of structure the solution is going to have.

We can make intelligent guesses based on the apparent structure that musical items have.

But the fact that neither music perception nor composition have yet been "solved" by neural networks suggests that something vital may be missing from all the guesses that have been made so far in the field of musical machine learning.

For example, music might be defined in some manner by its relationship to speech, and by how it affects neural networks in the brain which form part of the mechanics of speech perception.

So, for example, solving music with machine learning might require two steps:

Create and train a neural network which solves certain problems of speech perception, using a suitable speech dataset.
Fix the weights of that first network. Then train a second network, which takes as input the musical dataset combined with the outputs of the first network when the musical dataset is fed into that first network.

This approach could be summarized as:

Step 1: learn to perceive speech.
Step 2: learn to perceive music, taking into account the perception of music, as if it was speech, as learned in step 1.

A Way to Generate Better Musical Datasets

Anyone trying to do machine learning for music is faced with the problem of choosing a musical dataset.

One approach is to turn on the radio, and then tell the neural network to learn, somehow, from the constant stream of musical sound.

This approach seems fairly logical:

We want the machine learning algorithm to understand "music" as a whole
Therefore we should feed the algorithm all of the music.

However this approach may be too ambitious.

For a start, humans have not solved the problem of what music is, as a whole.

So, perhaps, we should pick a simpler musical problem that humans can learn, and see if the machine can learn that.

Improvisation against Chord Loops

Composing music from scratch is not a straightforward thing to do.

However, I have found that it is possible to learn to improvise melody against a fixed accompaniment.

Such improvisation is a limited form of composition, because you are not playing exactly the same melody each time.

Also, you will learn to play past "mistakes", so that most of the time there is no such thing as a "mistake", just different variations on the same tune.

One prerequisite for such improvisation is that you, as a musician, need to be very fluent, in the sense of being able to play scales and variations of scales quickly and easily on your instrument.

On traditional instruments, it can take years and years of practice to reach the required level of fluency.

But I have discovered an easier alternative, which is the iOS ThumbJam app.

This app makes it very easy to improvise, especially if you choose instrument sounds in the app which work best with the screen interface. Very specifically, the Flute and Violin instrument sounds.

You can also get good results using the Electric Guitar instrument sound, if you program variable filter effects and combine them with external filter apps. (I use ToneStack to provide a "fuzz" effect, and Aum to connect ThumbJam and ToneStack together.)

There are also some suitable chordloops apps available in iOS, so that you can do everything in the same iPhone. I use:

Chordbot, and
the SessionBand family of apps

Although all these apps can run happily together on an iPhone 6 or later, you will find that the sound coming out of an iPhone speaker is too "tinny", and it will be too difficult to coordinate your improvisation of the melody against the backing track. To properly hear what you are doing, you will need headphones, or, a small cable-connected speaker (you will need a cable, because Bluetooth has too much latency).

A few examples are the following improvisations over chord loops D A Bm F#m G D G A (from Pachelbel's Canon in D), Dm B♭ F C and Am Em Dm Am:

Summary of Plan

So, to generate a suitable constrained musical dataset, do the following:

Determine a chord loop to improvise against.
Determine an instrument setting in ThumbJam.
Practice for at least a few months.
Set up suitable data capture. This is not something I have yet found the most efficient way of doing, but options would include:
- record the combined sound of the improvised melody and the accompaniment, or,
- separately record just the sound of the improvised melody, or,
- directly record all the performance details of the melody directly from ThumbJam (eg as MIDI data, although I'm not sure if all the performance inputs enabled by ThumbJam can be externally recorded).
Feed the recorded data into whatever machine learning algorithm you want to apply.
If you get tired of generating hours and hours of the same improvisation, create a larger dataset by getting other musicians to improvise using the same instrument setting and the same chord loop.

A Crowd-Sourced Musical Improvisation Dataset Plan

Expanding on the last item in the previous list, it may be that a very large dataset will be required – too large for just one musician to create.

A bigger more amitious version of the previous plan is:

Promote the plan so that thousands of musicians might be interested in taking part. Actually, ThumbJam is so easy to use, no one really has to be a "musician" a priori.
Choose, by some means, a chord loop to improvise against. Possibly start with a few choices, and find out which gives the best results for the most musicians.
Determine an instrument setting in ThumbJam. (An alternative version of the plan might not require that everyone use the same instrument setting, but for the moment I am going to run with the assumption that success depends on everyone being constrained to use exactly the same setup.)
Determine how everyone should do the data capture.
Everyone practice for at least a few weeks.
Everyone improvise, and save at least a few minutes of music every day.
Save all the data into a common location, and make it available as a public machine learning dataset.

Once enough data is available, the machine learning experts can get to work.