The Ancestor Of Music Raised Questions That Needed Answering

4 January, 2023

Music-based language came first, so that the speaker could express an emotion. But what was the emotion about? To help answer this question, words evolved, as a supplement to the music-based language. In the end the words became more powerful, and the musical component of language ceased to be useful. Nowadays we only know word-based language, and music has evolved to become something else (or maybe nothing at all).

by Philip Dorrell

In the Very Beginning

The very earliest ancestor of modern human music was a language of emotion that an individual used to express an emotion that they felt as a result of something.

Those listening to this proto-musical language would not necessarily know what the "something" was that gave rise to the emotion being communicated.

The listeners might figure it out, or they might not.

The listeners might not actually care what was causing the emotion being communicated, because it wasn't their emotion – it was the emotion of the speaker.

We can't know what this ancestral form of music actually sounded like. It may have had some of the features of modern human music, but not all of them.

This earliest form of music probably existed more than 2 million years ago (there is a reason for that date estimate, which I will get to in the next section).

What Happened Next

The thing that happened next is that our ancestors experienced a sharp increase in the level of in-group cooperation, as a result of a change in life-style.

A very strong candidate for such a change in life-style is Confrontational Scavenging.

Derek Bickerton's paper, which presents the concept of Confrontational Scavenging, refers to fossil evidence that this type of scavenging may have occurred about 2 million years ago.

(It should be noted that Bickerton suggests that the adoption of confrontational scavenging was a major trigger for the evolution of human language, but he does not make any mention of music – so what I am presenting here is a combination of Bickerton's hypothesis about confrontational scavenging and my own hypotheses about the evolution of music and word-based language.)

This change in life-style, and the increase in level of in-group cooperation, had a very important effect on the proto-musical language.

Whereas before the proto-music expressed the speaker's emotion, in response to something, now the proto-music also made a statement about the listeners' emotions, in response to something.

In other words, before it was "There is something which is making me happy", but now, because of an assumed high level of common interests within the group, it was "There is something which is making me happy and it should be making us all happy."

So now there was more motivation for the listeners to try and figure out what was the "something", because it was a something that, presumably, they cared about just as much as the speaker cared about it.

However, the proto-musical language, by itself, did not include any specific information about what the something actually was.

The music created the question, ie "what is the something?"

But additional components of language needed to evolve in order to provide the answer. (And eventually one of those additional components would be words and sentences.)

Figuring out the "Something"

Given that the speaker had used proto-musical language to communicate an emotion about "something" to the group, how could the listeners determine what the "something" was?

There is a secondary issue that is relevant, which is, to what extent was the speaker motivated to provide additional information about what the "something" was to the listeners?

We could hypothesize that these human ancestors had sufficient "theory of mind" to have a concept that other individuals wanted to know something, and therefore it would be good for a speaker to help the listeners to figure it out.

Alternatively, we could assume that initially there might be various means by which the speaker could accidentally or unintentionally provide additional information about what was the "something", and that could be enough to get the evolutionary ball rolling.

There are four major methods of providing this extra information (at least there are four that I can think of), and these methods can be assigned a likely order, both in terms of time of appearance and also in terms of the level of sophistication and possible speaker intention required.

These four methods are:

The "something" is obvious, because it exists in the immediate environment of the speaker and the listeners.
The "something" is somewhere near, and the speaker gives a physical indication of the actual location, by looking at it, or pointing at it, or moving towards it (or maybe away from it, if it's something bad).
The proto-musical language expresses emotion as a function of the "melody" of the music, but for a given emotion there may be different melodies to express the same emotion, and a given speaker might happen to use different melodies as a function of the specific nature of the "something" causing the emotion being expressed.
Words.

Of the four methods, we can see that the first two are fairly direct, and the last two are more symbolic.

Also, here I am giving a first hint as to how the evolution of proto-musical language gave rise to the evolution of modern word-based language. (Nowadays, because music is not really a language, we usually use the word "language" to refer to word-based language. But in the context of the current hypothesis I refer to "word-based" language so that there is no confusion.)

In particular, I am hypothesizing that word-based language initially evolved as a supplement to an existing music-based language.

We will also see, when considering later stages in the evolution of music and word-based language, that the word-based language eventually dominated and effectively replaced the music-based language, and the music continued to exist only as a form of not-quite-language (details of which I will give).

(It is also worth mentioning, in relation to method 3, that even in modern times we have a very strong sense of "melodic identity", whereby for some melodies we strongly perceive the identities of those melodies, and we can easily learn to associate specific melodies with specific things, such as "Happy Birthday", or a wedding march. This can be considered evidence that our brains have evolved an ability to assign meanings to specific melodies, and that such an ability may have played a more significant role in communication in the distant past.)

Expressiveness and Efficiency of Music versus Words

I will soon give some more details of specific steps likely involved in the evolution of word-based language and the changing relationship between music and words.

But first I want to consider questions of expressiveness and efficiency.

On the one hand, the music-based language allowed a degree of expressiveness somewhat greater than that of any other animal language (including even compared to other hominid relatives at the time).

On the other hand, the music-based language was fundamentally limited in how many different things it could express, and it was limited in how efficiently it could express them. And these limitations applied even when the music-based language included method 3 listed above whereby melodic identity was used to express specific meanings.

The overall picture suggested is that the music-based language evolved because it was better than existing alternatives, and at the same time the music-based language allowed the word-based language to evolve, but finally the word-based component of the language became so sophisticated and expressive and efficient that the music-based component had to be "retired" in some sense, because the limitations of the music-based language were holding us back from enjoying the full power of the word-based language.

Units of Meaning

In linguistics there is the concept of the morpheme which is a unit of meaning.

In word-based language a morpheme can be a word, or a part of a word, or very rarely, it might be two or more words.

In terms of length, a morpheme in word-based language can be just one syllable, maybe two or three or more, sometimes zero (eg "'s" in "my mother's cat").

Based on my hypothesis of proto-musical language, it's difficult to measure exactly how long a unit of meaning was, but quite likely it was the length of a melody, or at least a substantial component of a melody, like a verse or a chorus of a song.

Even very short melodies are at least 10 syllables, and 20 or 30 is more common.

In word-based languages, the units of meaning are pretty much as short as they possibly can be.

In the proto-musical language, the units of meaning were much larger, and there was no way for them to evolve to be shorter, because melodies are defined by a certain minimum amount of structure.

The only way for a language based on melodic identity to have a greater information density is to have a larger number of melodies.

But there is only a limited number of melodies that any human being can remember.

In the modern world there exist millions of items of music, but most people are probably only familiar with a few thousand at most.

And without modern technology, the number of melodies that can be created and remembered and passed down from one generation to the next is probably much smaller, probably no more than a few hundred.

Also, different melodies cannot be combined freely according to rules of syntax in the way that words can be combined.

Also, music has a form of "syntax" that has a discernible tree-like structure not totally disimilar to the syntax trees of word-based language. But the musical syntax is contained inside individual melodies, ie inside the units of meaning, and it cannot be used to create larger units of meaning from smaller units of meaning in the way that happens with word-based language.

Another limitation is that musical melody generally has to be repeated, to get maximum effect.

This is not the case with word-based language.

All of these differences mean that a music-based language was both less expressive and less efficient than word-based language would be, ie:

Music-based language was less expressive because at most it could represent distinct meanings based on a few hundred (or maybe a few thousand) melodies.
Music-based language was less efficient because each unit of meaning was at least 10 syllables, and it had to be repeated, whereas word-based language has an average size of unit of meaning somewhere between 1 and 2 syllables.

Also, the music-based language could only express assertions of shared emotion, ie "we should all feel such-and-such an emotion because of this thing". Whereas word-based language can express many different things, some of which do not necessarily involve any shared emotions or any emotion at all.

Based on the general logic of evolutionary theory, we can surmise that if music-based language existed, and then word-based language evolved as an additional component of this music-based language, the general superiority of the word-based language would cause it to eventually fully supersede the music-based language.

On the one had we can observe that music as we know it isn't really used as a language. And we observe that even when music contains words, it is still not used to communicate information in the way that normal non-musical word-based language is used to communicate.

This does leave one thing unexplained though. If music is an obsolete form of language, how come it still continues to exist at all?

The Secondary Function of Music

One possibility is that music evolved to not be a form of language in such a way that it managed to acquire some secondary function.

However, given that noone knows of any convincing evidence that music has any biological function at all, one must presume that even this secondary function has, over time, ceased to be relevant.

The question arises as to how much evidence we have that music as we know it has existed for a sufficiently long time that it must, in its current form, have some biological function, or, it must have had some biological function until very recently, where by "very recently" one means not long enough for a non-useful function to evolve into nothing.

The best evidence for the existence of music in its current non-language form comes from evidence for the existence of musical instruments, particularly musical instruments capable of playing notes from fixed scales.

This evidence includes 42,000 year old flutes and it also potentially includes the famous Neanderthal Flute. If the Neanderthals had flutes, this possibly pushes the potential to appreciate music played on a fixed scale of notes back to the common ancestor of modern humans and Neanderthals, which is at least 500,000 years ago.

(For a more detailed analysis and a possible hypothesis about the "recent" secondary biological function of music, see this earlier article of mine.)

The Evolution of Word-Based Language

One major problem in the evolution of human language is the chicken-and-egg problem, ie why would speakers want to speak if noone was yet listening, and why would anyone want to learn how to listen if no-one was yet speaking?

There is also the problem of words and syntax. Did words evolve first, by themselves? If they did, how could individual words be useful? And how then did syntax evolve?

The hypothesis of words-within-music potentially solves these problems, ie:

The music-based language already existed, and was being "spoken" by speakers and "listened to" by attentive listeners.
A motive already existed to incrementally add information to the content of music-based language, ie to add detail to the descriptions of the "somethings" that caused the emotions expressed by the music-based language.
Even if melodic identity was used to supply additional information, this component of language was not sufficiently expressive to say everything that needed to be said, so additional words would still be useful.
When multiple words were being added to music-based melodies, there would come a point where the relationships between those words might supply additional information, and this would be the initial evolution of syntax as we know it.

A couple more observations:

As mentioned above, music already has a tree-like syntax. This syntax could have provided some type of "scaffolding" for the development of tree-based syntax in word-based language. At the very least, the existence of musical syntax implied that our ancestors had an ability to perceive and process syntax trees.
In non-tonal languages, almost all the information content of words is contained in consonant and vowel distinctions, and, to a first approximation, such distinctions are not relevant to the musical quality of a melody (ie you can sing "lalala" or "doodoodoo" and it's still the same tune). Word-based language used these distinctions to represent information, because it had to evolve as something inside the music, and the consonant and vowel distinctions were the only variability available to represent extra information while still being part of the melody.

The Un-Language-ification of Music

The hypothesis so far is that the ancestor of music evolved as a form of language, and then word-based language evolved inside the music, ie like song lyrics.

But now we live in a world where music isn't a language in the normal sense (even though it still feels like it's some kind of language), and normal language used for normal communication is always words without music.

The hypothesis must account for how music ceased to be a component of language.

So, what happened?

The strongest clue, I think, comes from the hypothetical nature of the emotions expressed by music.

In the modern world, music expresses emotions about things that are hypothetical.

The hypothetical content that the musical emotion applies to can be the words of a song. It can also be the content of a film (where the music is the score), and for some people it can be their own daydreams (ie for those people identified as "maladaptive daydreamers").

For example, if I perform the song "Scarborough Fair", I am not telling my audience that I actually an ex-lover living in Scarborough, and none of my listeners understand me to be saying that I have an ex-lover living there.

What the song expresses is that, if, hypothetically, I had an ex living in Scarborough, then this is what I would say and also this is the emotion that I would feel, and which my audience would equally feel, assuming that they (the audience) have my interests strongly at heart.

If I did want to tell a friend of mine that I have an ex-lover living in Scarborough, I would have to use normal speech, without any music. My friend might not believe me, but they would at least interpret my speech as an actual claim about something real.

Winning Arguments

As the word-based component of language became more complex and more expressive, it would have reached a point where some things could be said using only the words, and there would be no reason to include music, because what was being said did not have any emotional component.

So sometimes language would be a mixture of music and words, where the right kind of emotions were involved, and other times the language would be words only, because the content was not emotional at all.

Some situations might arise where there was a discussion on some topic, and some of the discussion was music and words, and some of the discussion was words only.

The problem might then arise that the emotional combination of words and music would always win the argument, even though the non-emotional arguments expressed using only words were actually more persuasive, if the emotional element was left out of the discussion.

So it might happen that some of our ancestors at the time evolved a tendency to downgrade their reponse to musical content, to give the words-only content a fairer hearing.

And it might have happened that this downgrading was achieved by hypotheticalising the content of any musical speech (as opposed to, for instance, ignoring such content altogether).

As a consequence of this hypotheticalisation, music could still be used to make statements about what emotions one would feel if certain things were hypothetically true.

But, if you wanted to make a claim that something was actually true, and your listeners were hypotheticalising all musical content, then you had no choice but to make your claims using words only, and not including any music, if you wanted to be taken seriously by those listeners.

Song Lyrics

In the modern world, we tend to think of music as being one thing, and word-based language as being another thing, and songs are something you get when you put the two of them together.

But according to the hypothesis I am presenting here, words were not just "put together" with the words. Rather, what actually happened was:

Music was a thing in itself, a form of language.
Words developed as a thing, inside the music.
At some later time, the music was removed and "un-language-ified", leaving only the words.

To put it another way, it wasn't:

Words + Music = Songs with lyrics

Instead it was:

Music => Songs with lyrics => Words without music

This has implications for anyone trying to write song lyrics.

Because when you write song lyrics, you are not just writing words understood by modern humans and then adding those words to music as appreciated by modern humans.

Actually you are writing a combination of music and words, as understood by an audience of human ancestors that lived hundreds of thousands of years ago.