A Theory Of Language Evolution That Explains Song Lyrics

7 August, 2021

First there was proto-music, a language that expressed abstract emotional meanings.
Then this was supplemented with the first word-based language, where the words were embedded in the proto-music, a bit like song lyrics (except they weren't actually songs because it wasn't actually music).
Then the word-based language evolved into a form that completely replaced proto-music, and proto-music evolved into music, which no longer acts as a form of communication.

by Philip Dorrell

The Mystery of Music and Communication

One of the mysteries of music is that music feels like it's some kind of communication, but actually it isn't communication.

The mystery increases when we consider song. The most popular form of music is music that has spoken language embedded within it, ie songs with lyrics.

The primary purpose of spoken language is to communicate, and including spoken language within music increases the feeling that music is a form of communication.

But, even with words embedded, music is not a form of communication. If we have something that we want to tell to another person or group of people, we do this by speaking – we never do it by singing.

A Possible Evolutionary Explanation: Proto-Music

One possible solution to this mystery is that music feels like communication because it has evolved from some ancestral predecessor that was a form of communication, but, over time, this predecessor has evolved to be something else, ie not a form of communication.

For the sake of discussion, I will call this hypothetical communicative ancestor of music proto-music, or proto-musical language.

("Proto-music" is a term typically used to describe some hypothetical ancestral form of music. Of course each author who uses this term has their own ideas about what such an ancestral form might have consisted of, and what purpose it might have served. This is a good place also for me to explain that I sometimes use the expression "modern music", by which I just mean "music", as opposed to "proto-music". "Modern music" is probably at least at least 42,000 years old; "proto-music" could be as old as 2 million years.)

Questions

If we consider the hypothesis that proto-music was a communicative language, then a few questions can be asked:

What information did proto-music communicate?
How was this information represented?
Why did proto-music evolve to be something that isn't communication?
If music, which is the descendant of proto-music, isn't a form of communication, then what is it?

Why Proto-Music Evolved to be Non-Communicative

I will start with question 3 in the list, because it's not that hard to think of a possible answer.

(Later in this article I will attempt to answer question 1 fairly fully, and I will say a little about question 2. About question 4 I will not say much, at least not in this article, but I do deal with it at length in many other articles in this blog.)

Proto-Music Got Replaced

So, if proto-music was a form of communication, why did it evolve to not be a form of communication?

A very simple answer to this question suggests itself, starting with the following observation:

For modern humans, the dominant form of communication is word-based spoken language.

Spoken language is the one thing that makes us human, and more than any other single characteristic, it is the thing that most distinguishes us from all other non-human animals.

Human spoken language is, as far as we know, more powerful, more expressive and more comprehensive than any other animal "language", in as much as other animals can be considered to have "languages" that they use to communicate with members of their own species.

The power of human language is the one thing that gives humans the ability to completely dominate the rest of life on the planet.

So it's quite plausible to suppose that, if proto-music was a form of language, that it was not as powerful, expressive or comprehensive as modern human word-based language.

Which gives us a straightforward explanation of why proto-music ceased to be used as a means of communication – it got completely replaced by the much superior word-based language.

Words

I use the term "word-based", to describe modern human spoken language, because that highlights the most significant feature of modern human spoken language in relation to this hypothesis.

Words are the specific feature of spoken language that did not exist at all within the earliest form of proto-music, and it is the properties of words, in particular how they represent information, that would cause spoken language to eventually completely replace the earlier proto-music as the primary means of communication in modern humans.

In particular, words are, to a first approximation, units of meaning, and communications in word-based language consist, for the most part, of linear sequences of words strung together. (I say "to a first approximation" – we can consider the example of "running", which is one word identifiable as two units of meaning consisting of "run" and "ing".)

Initially, word-based language had to evolve in a manner such that it supplemented and co-existed with the proto-musical language.

But, over time, word-based language evolved to a point where it was superior to proto-musical language in all respects.

Also, word-based language evolved into a form where it could no longer co-exist with proto-musical language – so the proto-music had to go.

The Meanings Expressed by Proto-Music

To fully understand how the transition from proto-musical language to word-based language occurred, we have to return to question 1 in the list above, ie:

What information did proto-music communicate?

The exact of form and function of a hypothetical proto-musical language no longer exists – it is lost in the depths of prehistoric time.

Our best clue is to consider what types of meaning modern music has.

Unfortunately, any attempt to determine the "meaning" of music is very subjective, especially given that music is not a means of communication, so there is no objective method to determine what meanings music would be communicating if it was a form of communication.

At this point I will state my own current hypothesis about the type of meaning that proto-musical language expressed, based on my own subjective evaluation of actual music that I listen to, combined with a certain amount of intelligent guesswork:

Proto-music was used to assert shared emotions to a group, that is, the communicating individual was expressing the opinion that all members of the group should be feeling the specified emotion.
This assertion was based on the existence of some particular thing or situation that the emotion was in response to, which the communicating individual knew about, possibly something that had happened, or something that was happening, or something that was going to happen. (But the proto-music did not directly express any details of what this something was, not even if it was past, present or future.)
The proto-music also expressed a degree of intentionality, ie the communicating individual expressed an opinion that the members of the group should all make some level of effort to act in response to this thing or situation (whatever it was), where that level might range from "urgent" to "don't bother doing anything at all".

The meaning expressed by an utterance in the proto-musical language was very abstract, and it did not contain any details of what was the thing or situation that the recommended emotion and degree of intentionality was in response to.

The communication of this type of abstract emotional meaning would make sense if it occurred within a species consisting of social groups that had a strong tendency to work together to achieve common goals. Quite likely this level of cooperation was higher than what we experience as modern humans in any modern human society (and just to be clear, by "modern" I mean during the last ~100,000 years, not just the last 100 years).

Because of the abstractness of this type of communication, every utterance in the proto-musical language was a puzzle that had to be solved by the listeners.

That is, it is not much use knowing that you should feel a certain emotion about something and have a certain level of willingness to act, if you don't know what the something is, and, therefore, you don't know what sort of action you would take.

To make sense of the abstract meaning communicated by a proto-musical utterance, listeners would have had to make use of any available clue they could, to determine the details of what that utterance was referring to.

In some cases, the thing or situation would be visible or otherwise perceptible in the immediate environment of the group, and the communication would have had the effect of directing the attention of the listeners to the possible presence of some "thing" that merited the recommended emotional response.

In other cases the communicating individual might have initiated action in response to the "thing", for example heading off to take advantage of a food gathering opportunity, and the listeners, if inclined to trust the communicator's judgement, would have followed.

Some of the time the listeners might just have failed to figure it out at all, in which case the communication would have failed in its immediate purpose. Although, even then, it might have become apparent at some later time what the fuss was all about.

The Initial Evolution of Word-Based Language

If word-based language evolved, starting from a point where words as such did not exist at all, then it had to initially exist in a very simple form, and for this initial evolution to occur, this simplest form of word-based language had to be useful, that is, useful to both speaker and listener.

Modern spoken language consists of sentences constructed from sequences of words taken from large vocabularies, where those sentences (usually) have an implicit tree-like grammatical structure.

If word-based language developed initially as something simple, then we might suppose that utterances in that language consisted initially of just single words.

But then what does a single word mean?

A single word does not have any grammatical structure.

In modern human spoken language, a single word might have a discernible meaning, if it was the answer to a previously posed question.

But of course that question would have to consist of much more than just one word.

So, considering our human ancestors, in the process of developing word-based language from very simple beginnings – why, in any particular situation, would a speaker say a single word, and how would a listener assign a full meaning to that word?

If we allow for the prior existence of a functional but abstract language like the proto-musical language that I have hypothesised, then this difficulty can be overcome.

Let us suppose that one of our ancestors communicates an abstract emotional meaning in proto-music. If the speaker can add even just one word into the utterance somehow, then that one word can help the listeners to determine what the abstract meaning is referring to. And if a second word can be added, then even better. And it doesn't necessarily matter what order the words are in.

The important conclusion is that each individual word added provides immediate value to the listeners, and the words don't need to be part of any grammatical structure. And if the listeners understand better, that also immediately benefits the speaker, because the speaker's goal in communicating is to motivate the whole group to take part in some common action with respect to a common goal of the group.

I will given a more specific example, which relates to the idea of confrontational scavenging, which I have already identified in a previous article as a possible trigger for the evolution of proto-musical language (and see this paper by Derek Bickerton who first identified confrontational scavenging as something that might have contributed strongly to the early evolution of human language).

If confrontational scavenging was a significant component of food gathering, for an ancestral human species, then it would have been advantageous for individuals of that species to develop a means of asserting shared emotions in relation to shared group goals, in a manner that could encourage urgent co-operative action from all members of a group when a suitable scavenging opportunity arose.

For example, an individual might have come from the scene of a recently dead antelope carcass being scavenged by hyenas.

If modern word-based language was available, then this individual might have said something like:

I've just come now from the river and there's the carcass of an antelope which has not been dead for very long and it's being actively scavenged by hyenas. We should all pick up some rocks and go now together to scare the hyenas away and take the meat for ourselves!

But, actually, if the individual could only use proto-musical language to express abstract emotional meanings, then they would have expressed something like the following:

There's something exciting, if a little scary, but a big opportunity for us. We should all act strongly right now!

Those listening might have supposed that this utterance related to an opportunity for confrontational scavenging, or they might have thought it was about something else.

If the communicating individual could have added even a few words into the utterance, then this would have been enough to clarify the likely details of the situation. Words such as:

carcass
antelope
hyenas
confrontational scavenging (not one word for us, but maybe it would have been a single word for members of a species that did confrontational scavenging on a regular basis)
river

Even just one of these words would have provided a significant clue to the listeners about what the situation was, and each additional word would have been an additional useful clue. Also, there would be no particular need to impose any grammatical structure, because the original proto-musical utterance would give enough information to determine the likely role that each word played in the situation, ie

The carcass is the opportunity.
More specifically, the opportunity is a potential source of antelope meat.
The hyenas are what makes the situation scary (but not so many hyenas that we can't deal with them).
We should do some confrontational scavenging right now!
The river is where the carcass is (and where we need to go).

A Proposed Evolutionary History of Proto-Music and Word-Based Language

So we can see that, given the prior existence of a proto-musical language capable of expressing abstract emotional meanings, word-based language could have evolved starting from very simple beginnings.

The full history of human language would then have consisted of three major stages:

Only proto-music, capable of expressing abstract emotional meanings.
Proto-music, with words embedded, where the words give additional clues about the situation that the abstract emotional meanings refer to.
Word-based language, which has evolved to the point where it completely replaces the proto-musical language, and proto-music ceases to have any use as a language, but, proto-music gets repurposed as music, which serves a different purpose (of some kind).

There are various constraints that would have determined both the evolution of words embedded within the proto-music, and, finally, the complete disappearance of proto-music.

(There is a story of ungratefulness here – proto-music enabled the birth of word-based language, but eventually the word-based language killed the proto-musical language, and proto-music only survived by turning into not-a-language music.)

The first constraint relevant to the evolution of word-based language related to consonant and vowel distinctions.

Music as we know it does contain consonant-like sounds and vowel-like sounds, even in the case of non-vocal music.

However, consonant and vowel distinctions play a relatively minor role in determining musical quality. One can sing a tune as "la la la", and one can sing it as "doo doo doo", and it is still recognisable as essentially the same tune.

On the other hand, for word-based languages, consonant and vowel distinctions carry almost all of the information.

Conversely, musical quality is substantially determined by melody and rhythm.

And although melody and rhythm (in a non-musical form) do exist as perceptible components of speech, they carry relatively little information compared to the information contained in the sequence of consonant and vowel sounds. For example, you can talk in a monotonous deadpan voice, and listeners will still understand the meaning of what you say. (And of course reading and writing does not require the transcription or annotation of any melodic or rhythmic aspect of the written content.)

A plausible reason for this separation of components of information representation is that word-based language was constrained to make use of those possible variations in the expression of proto-musical content that were not important to the representation of the information that was conveyed by the proto-music.

In other words, consonant and vowel variation was available as an option when communicating proto-musically, but it was not actually being used by the proto-music to express meaning to any significant degree, therefore the evolving word-based language could grab that possible variation for itself and use it to express components of meaning additional to the meaning already being expressed by the proto-music.

The conclusion is that word-based language initially evolved as what were in effect song lyrics, except they weren't actually songs in the sense that we know of songs, because it wasn't actually music in the sense that we know of music.

Superstimulus Features of Music

In previous articles I have identified music as a superstimulus for some aspect of speech perception.

Two specific features of music that relate to it being a superstimulus are pitch scales and nested regular beat.

Both of these features relate to the competitive nature of musical performance, ie it takes practice to sing in tune and on-scale, and it is also quite difficult to keep precise regular time.

The identification of these superstimulus aspects of music are not of central importance to the current article. However I do want to make the point that proto-music probably did not sound like modern music does, and if you are imagining our distant ancestors talking to each other in the key of C major and 4/4 time, then you are probably imagining it wrong.

If modern humans could hear the ancestral proto-music, my guess is that they would perhaps recognise the emotional "feel" of the proto-music, but they would not consider it to be worth listening to for the purpose of entertainment (or for any other purpose that modern human cultures use music for).

Replacement and Obsolescence

Word-based language initially evolved as a simple supplement to the proto-musical language, providing hints as to what specific details were behind the abstract emotional meanings expressed by the proto-music.

But, over time, word-based language evolved to become more complex and sophisticated.

Eventually word-based language could completely express all aspects of meaning by itself, and the proto-music was no longer necessary.

Also, word-based language evolved into a form that conflicted with the form of the proto-music, so the proto-music had to be dropped altogether.

Words and the Decline of Cooperativity

One consequence of the evolution of word-based language is that it would have changed the very nature of social relations within the group.

The original proto-musical language made sense for a species that lived a very co-operative lifestyle with common group goals.

But the evolution of more sophisticated word-based language made more nuanced forms of cooperation possible.

It was no longer necessary to assert shared emotions, and indeed there was no longer any point in doing so.

A speaking individual could communicate all the specific details of the situation, and each listener could independently form their own opinion about what emotional response they had to that situation, and what degree of urgency or otherwise might be required to act in response.

Individuals expecting cooperation from other individuals would be forced to engage in more complex negotiations if they wanted others to cooperate in some endeavour.

The final result was that the base assumption of shared emotion and shared goals was no longer a reality, even though it might have continued to exist as a social fantasy (a fantasy which continues to this very day, where a group of listeners might publicly pretend to agree with an asserted shared emotion, but, privately, many or most of the individual listeners have only a lukewarm enthusiasm for the asserted shared emotion in question).

Information Density

Word-based language also expressed meaning more densely than proto-music.

As I have already mentioned, words are approximately equivalent to units of meaning. A sequence of words in word-based language consists of a sequence of units of meaning – usually one unit per word, sometimes a bit more, or a bit less – and most of those units of meaning fit into only 1, 2 or maybe 3 syllables.

We can compare this to music, where even a simple musical item requires at least a few bars, consisting of, at the very least, a dozen notes, to define the "feeling" of that musical item. Treating a "note" as more-or-less equivalent to a syllable, and equating this to proto-music, we have that a dozen syllables would have been required to express just one abstract emotional meaning.

Also a dozen notes is a very short tune – most tunes are at least 20 to 40 notes long.

The implication is that modern word-based language has an information density at least 10 times higher than that of proto-musical language.

Context: Static and Dynamic

There is also the question of context.

All modern spoken language expresses meaning within a context. If a speaker has started speaking, then the initial context depends on the general current circumstances of the speaker and the listener (or listeners).

But once the speech has started, the words being spoken are themselves changing that same context.

This applies whether it is just one person speaking, or whether other people are also taking turns to speak.

In the case of music, as we know it, especially with regard to simple musical items typical of modern pop music (and traditional songs and folk music), each item expresses one unique musical quality, and this quality lasts effectively for the duration of the item.

This suggests that the context of proto-musical communication was much more static than is the case for modern spoken language.

Any given item of proto-musical communication only expressed a single meaning, and lasted for a certain period of time.

When words started to appear in the proto-music, these words did not change the context as they do in modern spoken language – rather they served to clarify the static context being created by the proto-musical communication.

But as word-based language evolved, and became more sophisticated, there would have developed a tension between the words, which wanted the freedom to create and respond to a fast-changing dynamic context, and the proto-music, which was confine to expressing meanings to create a more static context.

As the words became more powerful, and more valuable, the proto-music had to stop being part of the communicative process, because it was just getting in the way.

However, for some reason, proto-music got repurposed as something else, and that is what music is today.

(The whole question of what the purpose of music might be, if indeed it has a purpose, is a major field of speculation in itself, and goes beyond the scope of the theory that I am presenting in this article. If you want to read about my own speculative efforts to understand what the purpose of music is, you could start with this.)

Summary

This article contains quite a few assumptions, observations, deductions, wild guesses etc.

Here is a summary of the main points of the theory that I have developed:

Music feels like it is a form of communication, but it is not actually a form of communication – we do not use music when we have information that we need or want to communicate to another person.
The most popular form of music is song, ie music with words embedded.
Even with words embedded, music is not a form of communication – we don't converse with other people by singing to them.
I propose that music evolved from an earlier ancestral proto-music which was a form of communication – a language of some type. This explains why music feels like it is communication, even though it isn't.
Proto-music became obsolete when it was replaced by modern spoken word-based language.
In the long run, proto-music was replaced by word-based language because word-based language could evolve to be more flexible, powerful and efficient in ways that proto-music could not.
However, before this could happen, word-based language had to evolve initially embedded within the proto-musical language.
Considering how they each represent information, music primarily uses melody and rhythm to express the quality that it expresses, whereas word-based language primarily uses consonant and vowel distinctions to encode information.
Word-based language was constrained to encode information in consonant and vowel distinctions, because that was the major source of variation available that was not used by proto-music to represent information.
Proto-music expressed the assertion of shared emotions in relation to a situation known to the speaker, and also expressed a shared degree of intention to act in response to that situation.
These assertions of shared emotion and intention made sense for a species consisting of small social groups that lived a very co-operative lifestyle – more co-operative than is the case for modern humans.
The meaning expressed by a proto-musical utterance was very abstract, that is, it did not include any details of what the relevant situation was, and it did not include any details of what action should (or should not) be taken in response to that situation. Listeners had to figure those details out by themselves, from whatever clues were available.
Words initially evolved to provide specific details to give listeners additional clues about the abstract meanings expressed by proto-music. Used in this way, words did not need to be part of complex grammatical structures.
The pre-existence of the abstract proto-musical language enabled the initial evolution of word-based language from very simple beginnnings, ie even one word by itself could usefully add to the meaning of a proto-musical utterance.
Modern word-based language represents meaning very densely – the average unit of meaning is less than 2 syllables long.
With proto-music, each "tune" represented a single abstract meaning, and these tunes were similar in length to simple modern musical tunes – much longer than the equivalent of 2 syllables (given that 1 syllable is roughly equivalent to 1 note). So modern word-based language is probably at least 10 times more information dense than proto-music was.
Word-based language depends on the existence of a context, and this context is in turn constantly updated by the utterances of those speaking – the context is very dynamic.
With proto-music, the context was more static, and persisted for the length of the "tune". Words embedded in the proto-music did not change the context, rather they served to clarify it.
As word-based language evolved to become more sophisticated, and to represent all the types of meaning that modern spoken language can represent, the words could no longer function embedded in proto-music – the constraints of context conflicted, and also words could express meanings that were unrelated to the meanings that proto-music could express.
Over time the lifestyle of the speakers of word-based language became less co-operative, and the assertion of shared emotions no longer made as much sense, or at least much less frequently. This change may itself have been enabled by the development of word-based language.
Eventually the word-based language "broke free", and proto-music ceased to be part of the communicative language used by our human ancestors.
Nevertheless, proto-music did not completely fade away – instead it was repurposed and became modern human music, the actual purpose of which remains somewhat mysterious to us ...
The repurposing of proto-music to become music likely involved changes in some aspects of its structure. For example, the modern musical aspects of scale and regular nested beat quite possibly did not exist as part of proto-music.

And in Conclusion, How to Write Song Lyrics

Can these hypotheses about the evolution of word-based language tell us anything about modern song lyrics?

Can this theory tell us how to write better song lyrics?

These are questions that I think need exploring, but I'm not yet ready to come up with any definitive answers.

So I will finish with some observations, some of which can tentatively be related to my hypotheses about the evolution of word-based language:

Song lyrics are in some ways very similar to "ordinary" prose or speech. They usually consist of grammatical sentences constructed from valid words.
Deviations from grammaticality are usually very minor. On the other hand, sometimes the words are complete nonsense.
Individual sentences in song lyrics usually have a coherent meaning.
But, there is a sense in which song lyrics do not consist of ordinary communicative speech or writing. That is, a song lyric is usually distinct in some way from what a person would actually write or say if they had a specific intention to communicate.
Non seqiturs are quite common in song lyrics.
Song lyrics usually rhyme.
Song lyrics are more likely to "paint a picture", rather than tell an unfolding story.
When song lyrics do tell a story, they sometimes do so by repeating a sequence of almost similar situations (so it's kind of dynamic, but at the same time it's kind of static).

A very vague hypothesis can be stated, which is that song lyrics represent a form of recapitulation of the evolutionary history of music and language, in particular that words in song lyrics function less like how words function in modern spoken language, and more like how words originally functioned when they evolved embedded within proto-music.

Of course recapitulation is not a perfect theory, because the present is not actually exactly the same as the past – modern word-based language is more grammatically sophisticated than the presumed initial form of word-based language, and modern music is different in various ways from the presumed communicative proto-musical language.

Nevertheless, I'm guessing that the evolutionary history of proto-music, music and word-based languge might still explain something about how song lyrics are different from normal non-musical spoken language, and if I'm very hopeful, I might believe that this could inform the development of techniques by which one could more easily write better song lyrics.

Appendix: Two Existing Theories about the Origin of Music (and Language)

There have been many people who have thought about the evolution of music and language and possible relationships between those two things.

I will not attempt to do here a full literature review, but I will mention recent two theories that may have some overlap in their analyses with my theory – however in both cases their approaches either differ from mine in important details (or they are sufficiently vague in those details that I can't say for sure).

The two authors are:

Steven Brown, who coined the term "musilanguage" to describe a presumed common ancestor of music language.
Joseph Jordania, who has a developed a theory of the evolution of music that specifically assigns a role to confrontational scavenging

"Why Do People Sing? Music in Human Evolution" by Joseph Jordania

Joseph Jordania is an evolutionary musicologist who has developed a theory that relates the evolution of music to the lifestyle of our early ancestors, in particular the strategy of confrontational scavenging, and the requirement for members of a group to act cooperatively to engage in this food gathering strategy.

A full description of this theory can be found in his book Why Do People Sing? Music in Human Evolution (2011).

The technical details of his theory are somewhat different to those of the theory I explain here. Jordania's theory is less concerned with explain specific details of the relationships between music and word-based language, and he seems to assume that ancestral music was similar in its nature and function to modern day music.

With my theory there is a sharper distinction between proto-musical language, which was a communicative language, and modern music, which does not function as a communicative language.

"A Joint Prosodic Origin of Language and Music" by Steven Brown

Steven Brown (currently a Professor in the NeuroArts Lab in Hamilton, Ontario) has developed the "musilanguage" hypothesis. He published a book in 2000 to present this hypothesis.

More recently, in 2017, A Joint Prosodic Origin of Language and Music was published in Frontiers in Psychology.

I assume that the latter paper best represents the current state of Brown's theory, and I will include some quotes here, to help clarify how similar or different his hypothesis is from mine (and to what extent I consider issues which he does not deal with).

Firstly, on the subject of embedding words in the music (or proto-music):

If we contend that the vocal expression of emotion was the precursor to speech, then the evolution of the phonemic combinatorial mechanism had to find a way to create words (strings of segmental units) and phrases in the context of communicating emotional meanings by filling out a prosodic scaffold.

This is somewhat similar to my hypothesis that word-based language initially evolved embedded in proto-music.

Secondly, presenting the rationale for assumming that there was a "bifurcation" from a common ancestor:

I argued in Brown (2000a) that, given that language and music possess both shared and distinct features, it would be most parsimonious to propose that their shared features evolved first, and that their domain-specific features evolved later as part of a branching process (see also Mithen, 2005), making language and music homologous functions (Brown, 2001). This idea would stand in contrast to models contending that music evolved from speech (Spencer, 1857), that speech evolved from music (Darwin, 1871; Jespersen, 1922; Fitch, 2010), or that music and language's similarities arose independently by convergent evolution.

(and further on ...)

With this description in mind of two sequential precursor-stages shared by language and music, we can now examine the bifurcation process to form full-fledged language and music as distinct, though homologous, functions, as well as their (re)unification in the form of songs with words, including call-and-response chorusing.

Brown's logic is that music and language are similar in some ways, therefore they had a common ancestor, which later split into two separate things.

So Brown's theory is:

Musilanguage exists as one something – one language? one system of "communication"?
Musilanguage evolves and splits into two somethings: music and spoken (ie word-based) language.

My theory is:

Proto-music exists as a language that uses some kind of melody and rhythm to express emotional abstract meanings.
(Optionally, see this for details) Proto-music evolves melodic identity, which supplements the abstract meanings with culturally assigned meanings.
Proto-music evolves words as yet another means of supplementing the abstract meanings.
The word-based language embedded within the proto-music evolves to become more sophisticated.
The embedded word-based language evolves into a form where the original proto-musical aspects of the language become both obsolete and irrelevant.
But, the now obsolete proto-musical aspects get repurposed to form music, which is not a language.

When music appears, it contains altered versions of all the components of the final form of the proto-musical language system:

Melody and rhythm to express emotional abstract meanings. Melody and rhythm acquire the properties of pitch scales and regular beat, which is what distinguishes music from communicative language. Those properties are superstimuli for an aspect of speech perception intended to mute the listener's response to forms of speech that are not genuine honest communication from the speaker.
Melodic identities, but meanings can no longer be assigned to those identities
Words embedded in the melody – but the words are no longer communicative (because the listener no longer responds to the words as though there are part of a communication).

In my theory there is a split, but it is not a split from one language to two languages, rather it is a split from one language to two things, where one of those two things is a language, and the other thing is not a language.

A high level view of the difference between my theory and Brown's musilanguage theory is that we have two completely different starting points:

I start with observations about music, and the odd features that it has, ie that it seems like a form of communication, but actually it isn't, why could this be?
Brown starts with the observation that there are two things, music and (spoken) language, and these two things share certain features, why could this be?

Brown's approach has a certain symmetry to it, and it gives "equal time" to music and language.

The only problem with that approach is that our understanding of music and language is not symmetrical.

Music is much more mysterious than language.

It is true that we do not know the details of how language evolved, and also we do not fully understand how the human brain acquires and processes language (just like all sorts of other things that we don't understand about the human brain). But, we don't have any difficulty understanding what is the function of word-based language. And we don't need any deep understanding of the details of human language to know this – it is fairly obvious that any human being without the ability to create or consume word-based language is someone who has a fairly serious disability.

In the case of music, we don't understand what the function of music is. We are not at all sure that music actually has a function. There are people who do not respond emotionally to music, and it is not obvious that those people have any disability at all (other than the fact that they don't share that particular common interest with all the other people who do respond emotionally to music).

My meta-hypothesis is this: if you treat music and language as equal things with an equal air of mystery about them, and make informed guesses about the prehistoric evolution of music and language that treat those two things equally, you're probably going to guess wrong.