Music Is A Superstimulus For The Perception Of Non-Spontaneous Non-Conversational Speech

2 June, 2020

Music is a superstimulus for the perception that a speaker's utterance lacks conversational value, and therefore the listener should not bother making any immediate effort to evaluate the truth value of that utterance.

The effect of music on a listener is to interrupt the normal sequence of information processing steps that apply to conversational speech.

This interruption leaves the listener's brain in an intermediate state, where the meaning of an utterance is determined, and the hypothetical emotional significance of the utterance is determined, but the listener has made no attempt to determine their beliefs about the truth value of the utterance.

by Philip Dorrell

The Superstimulus Theory

In 2005, in my book "What is Music? Solving a Scientific Mystery", I proposed a hypothesis about music, the Superstimulus Hypothesis:

Music is a superstimulus for an aspect of the perception of speech.

In the book, I showed how the superstimulus hypothesis plausibly explains some of the analogies that exist between relationships of speech rhythm to musical rhythm and relationships of speech melody to musical melody.

However, the hypothesis is incomplete, because, as it stands, it leaves a question unanswered:

Which aspect of speech perception is music a superstimulus for?

A second major difficulty with the superstimulus hypothesis, which I never properly confronted, is to answer this question:

If music is somehow a "better speech than speech", why is music not used in all contexts where speech occurs?

Music is used in some contexts where speech might occur, but there are also some contexts where music is never used.

For example, people do not sing to each other in lieu of making conversation.

To show why this is a problem for the superstimulus hypothesis, I will consider a different example of a superstimulus relating to human perception.

An Example of a Superstimulus: Makeup

Makeup is something that can be considered to be a superstimulus.

In particular, women can apply facial makeup to make themselves appear more attractive.

A woman who applies makeup has made herself more attractive, by applying her makeup in a manner that acts on our natural instinctive recognition of female attractiveness.

A recent scientific paper on this subject is Cosmetics as a Feature of the Extended Human Phenotype: Modulation of the Perception of Biologically Important Facial Signals.

One consequence of the general effectiveness of makeup as an attractiveness enhancer is that, in almost any type of situation or environment, at least some women will wear makeup, some or all of the time.

Indeed, if there is any situation where women won't wear makeup, it will only be because there is some legal prohibition, or because there is some very pragmatic reason why it might be harmful to wear makeup in that particular situation.

As I have already observed, in the case of music, music is not used in all possible situations where speech occurs.

Positive vs Negative Aspects of Perception

This comparison with makeup as an example of a superstimulus suggests that music is not a superstimulus in the sense of music being a "better speech than speech".

But, what if we suppose the opposite?

What if we suppose that music is a "worse speech than speech"?

What if we suppose that music is a superstimulus for some aspect of speech perception which involves a negative perception of the quality or usefulness of perceived speech?

The observation that music is never used to enhance or replace conversational speech might be a vital clue.

I propose the following hypothesis:

Music is a superstimulus for the perception of non-spontaneous non-conversational speech.

This hypothesis could explain how music alters the brain state of the listener, if we assume:

There is a brain module that processes conversational speech. (Note: here I use the word "module" loosely, to refer to any subset of the physical brain involved in performing a certain function, without making any specific assumptions about physical locality.)
When less conversational speech is perceived, the conversation-processing module is less activated.
Music, as a superstimulus for non-spontaneous non-conversational speech, completely suppresses the activation of the conversation-processing module.

The Concept of "Non-Spontaneity"

I have included the concept of spontaneity as an important component of what constitutes valid "conversational" speech.

The reason for this is that genuine conversation consists of a constant back and forth of information flow between two people, where the context of the conversation is constantly changing as a result of what is being said in the conversation.

It follows that a speaker cannot plan in advance what will be said in a genuine conversation, because there is no way of knowing in advance what will be said, and therefore no way of knowing what the context will be at any point in the conversation.

The most valuable speech in a conversation is speech that is completely the result of the speaker's thoughts based on the current state of the conversation.

Any speech that is pre-planned is speech that is already "out-of-date".

Of course there are always reasons why people need to pre-plan some of what they want to say to other people. So a listener has to tolerate some degree of non-spontaneity in the speech of a speaker.

But, it remains the case that, other things being equal, more spontaneous speech is of more value to the listener than less spontaneous speech.

Spontaneity also relates to the characteristics of music. Indeed almost all characteristics of music are characteristics that require pre-planning.

A Simple Example: Rhyming

A simple example of an attribute that implies non-spontaneity is rhyming.

It is arguable whether rhyming is actually a "musical" attribute, although certainly it is very commonly and consistently used in song, including modern popular music.

But I want to give it as a simple example that demonstrates the general principle that the appearance of a certain characteristic in putative conversational speech implies non-spontaneity on the part of the speaker.

Rhyming gives us an example of a feature of speech which is not normally used in conversational speech, and, it is easy to explain why rhyming implies non-spontaneity.

Indeed, to include a rhyme in something you say requires an effort to think ahead of time about exactly what you are going to say and how it is going to sound. The occurrence of rhymes in your speech will alert your listener to the possibility that whatever you just said is not a spontaneous translation of your current thought processes. In other words: "Does your rhyming speech come from what you are thinking right now, or does your speech rhyme because you have planned it out so that it rhymes?"

If the listener decides, as seems reasonable, that rhyming implies non-spontaneity, then the listener will not be inclined to respond to rhyming speech as if it was genuine conversational speech.

Under my hypothesis that music is a superstimulus for the perception of non-spontaneity in speech, rhyming is a common feature of musical speech (ie song lyrics), because rhyming actually is a musical characteristic.

(I will not go into any more detail in this article about other specific characteristics of music and how they might indicate non-spontaneity, because that is a whole major topic of analysis and research in itself, including a whole lot of research that I haven't yet done. Also, the "characteristics" of music are not themselves fully understood – for example there is no known mathematical formula that reliably distinguishes melody from non-melody. Explaining how all known characteristics of music might be a superstimulus for something is difficult to do if we don't properly know what those characteristics actually are.)

Processing Conversational Speech

To develop the non-conversational superstimulus hypothesis further, I need to make some assumptions about how the brain processes conversational speech. These assumptions will include assumptions about what special difficulties are involved in the processing of conversational speech, and how those difficulties might be overcome.

A fundamental issue with conversational speech, for both speaker and listener, is uncertainty about the value of the interaction for both parties.

I will analyse this issue by considering how a listener processes the speech of a speaker.

Of course the very nature of conversation means that the roles of speaker and listener are constantly changing back and forth.

But it will be useful to start with a detailed analysis of how the listener processes one utterance spoken by the speaker and directed to that listener.

The specific issues with regard to uncertainty about the value of a conversational transaction include:

The listener may wonder if what the speaker says is a genuine expression of the speaker's current thought processes.
The speaker may wonder if the listener actually believes the truth of what they, the speaker, just said.
If the listener goes to the effort to determine what they believe about the truth of what the speaker said, and gives a response to the speaker including some indication of this belief state, but actually the speaker's speech was not an honest indication of that speaker's private thought processes, then the listener has given information to the speaker without getting anything of value in return.

These considerations reflect two basic aspects of cost and value:

the effort required to process information,
the value of information, and the downside of a first party providing information about their own private mental state to a second party without receiving quid pro quo.

In an ideal situation, a conversation would be mutually beneficial to both parties.

But, in practice, any transaction between self-interested parties requires each party to make efforts to ensure, as much as possible, that the transaction is as beneficial to them as it is to the other party.

The listener's sensitivity to possible characteristics of a speaker's speech implying non-spontaneity is what helps the listener to make sure they are getting a "fair deal" out of the whole conversational transaction.

How Music Could be a Superstimulus for the Perception of Non-Spontaneous Non-Conversational Speech

The reader may be wondering exactly how it is that music could be a superstimulus for the perception of non-spontaneous non-conversational speech.

I can speculate on what the detailed answer to this question might be – perhaps starting with the example of rhyming discussed above, and proceeding to look for plausible accounts of how all the complex features of melody, harmony and rhythm could be superstimuli for the perception of non-spontaneity.

However I will start by pointing out that we can legitimately hypothesize that something is a superstimulus for something else, without necessarily being able to state the details up front about how this happens.

For example, in the case of makeup, we can readily believe, from our experience of seeing women made up and not made up, that makeup involves superstimuli for the attractiveness of the female face, even though we might not be able, in the first instance, to explain the detailed mechanics of how makeup succeeds in causing us to perceive a woman as being more attractive.

Taking this consideration into account, the following is an abstract outline of how it is that music could be a superstimulus for the perception of non-spontaneity in speech directed at the listener:

When a speaker talks spontaneously to say something to a listener that reflects their current and honest thoughts about something, speech has certain characteristics.
When a speaker is saying something that they have previously rehearsed, or they are saying something while being strongly aware of how the resulting speech will sound, or in some other respect their speech is the result of deliberate planning, then that speech has characteristics different from those of genuine spontaneous conversational speech.
Music has characteristics which are an extreme version of the characteristics of speech which is not spontaneous conversational speech.

Modelling how the Listener Processes and Responds to Conversational Speech

To develop the superstimulus hypothesis further, I need to construct a model of how a listener processes putative conversational speech uttered by a speaker, in a manner that reflects the listener's need to deal with these uncertainties about the cost and value of processing and responding to the speaker's speech.

I propose the following model:

Every spoken utterance has a meaning, which the listener needs to determine.
The meaning needs to be determined before the listener can take any action to evaluate its possible truth value.
Thus the meaning exists in an initial state of being purely hypothetical.
To justify the effort that might be made to evaluate the truth of the meaning, the listener needs to make some estimate of the significance of the meaning of the utterance, should that meaning turn out to be true.
This significance can be encoded in terms of the emotional response that the listener would have.
Initially, this emotional reponse exists in a form that is purely hypothetical, since we are still at a point where no effort has been made to evaluate the truth value of the utterance.
Any decision about whether it is worth making the effort to evaluate the truth of the utterance will be based on two things:
- the intensity of the emotional reponse, ie the significance of the meaning of the utterance should it turn out to be true – the more intense the emotional response, the more effort it is worth making to evaluate the truth value of the utterance
- the perception that the speech is genuine conversational speech reflecting the current thought processes of the speaker – the more that the speech is judged to be genuine conversational speech, the more effort it is worth making to evaluate the truth value of the utterance

Prior to actual evaluation of the truth value, the emotional reponse exists in a purely hypothetical form based on the determined meaning of the utterance.

After an estimation of the truth value has occurred, the intensity of the emotional response will be reduced according to the degree to which the utterance is determined to be possibly not true.

This reduction of emotional response happens because it would be incorrect for the listener's brain to retain a representation of intense emotion with respect to an assertion made by another party which has been judged to be probably not true.

Finally, once the listener has evaluated the truth value of the utterance, the listener can respond to the speaker, where the listener's response includes an indication of their belief about the truth of the original utterance.

(At this point the listener now becomes the speaker, and the speaker becomes the listener, and the process starts again, with the roles reversed.)

Fitting Music Into The Model

I will now describe how I believe music interacts with this proposed model of conversational speech processing.

In the first instance, I will assume that the music contains actual speech, ie it consists of a song with lyrics.

(The more general case, where music might not include any words, will depend on extensions to the model that include the secondary evolution of music as a function in itself, ie not just a side effect of how conversational is processed – I will elaborate on this later on in this article. Also, there is an in-between case where the lyrics have a meaning that is partly indeterminate, which constitutes a state halfway between the state where the lyrics specify the meaning and the state of having no lyrics at all.)

Recall my previous statement of the hypothesis about music as a negative superstimulus:

Music is a superstimulus for the perception of non-spontaneous non-conversational speech.

In effect, music tells the listener: "This speech is not the spontaneous translation of the speaker's current thought processes into speech."

As a result, item 7 in the model applies, and the listener's brain suppresses all effort to evaluate the truth of the meaning of the spoken utterance (ie the song lyrics).

A further consequence of this suppression is that the intensity of the emotional response to the hypothetical meaning remains unchanged.

Thus, the listener's brain remains stuck in an intermediate state where:

The meaning of a spoken utterance is represented in the brain.
The listener's beliefs about the truth of the utterance remain undetermined.
The emotional response to this hypothetical meaning is represented in the brain, and it is represented at full strength, undiminished by any possible evaluation of the truth value of the utterance.

It is this mental state that accounts for the observed emotional effects of music, and it also accounts for the preferential association of music with hypothetical or fictional situations, as opposed to real situations.

A Note on Deferred Truth Evaluation

The effect of music does not necessarily prevent the listener from evaluating the truth of an utterance forever.

In practice, even when a listener does not immediately evaluate the truth of a speaker's utterance, and respond to the speaker with information about that evaluation, the listener may still remember what was said, and, at some later time, think about what the speaker said, and whether or not it was believable.

My hypothesis about music is more about the listener's perception of the value of putative conversational speech at the time that it happens, and whether it is worth it for the listener to make an immediate effort to evaluate the truth of what has just been said, and whether the speaker is perceived to have a commitment to spontaneous on-going conversation that justifies the listener being happy to provide honest spontaneous and immediate feedback to the speaker.

Summary: Processing of Speech, Without Music vs With Music

Without Music

You, as speaker, say something, to me the listener, where your speech has no musical qualities:

I determine the meaning of what you said.
I determine what my emotional response would be to that meaning, if that meaning was true.
I determine my beliefs about the truth value of what you said.
The intensity of my emotional response is reduced in accordance with my estimate of the truth value.
I respond to what you say, and my response includes an indication of my beliefs about the truth value of what you said.

With Music

You, as speaker, say something, to me the listener, where your speech is musical:

I determine the meaning of what you said.
I determine what my emotional response would be to that meaning, if that meaning was true.
Because of the musical quality of your speech, I do not make any effort to evaluate the truth value of what you said.
Processing stops: my brain remains in an intermediate state, where the meaning of what you said is represented as a hypothetical meaning with no estimated truth value, and my emotional response continues to be represented with full intensity, as if the meaning was true.

The Secondary Evolution of Music, as a Function

The model I have provided accounts for some of the basic properties of music.

However it proposes that the effect of music on the listener exists purely as a side-effect of the detailed logic of how the brain processes conversational speech.

If music is just an accidental side-effect of something, and not at all useful in itself, should there not be evolutionary pressure for this effect to disappear?

After all, we know that music uses up substantial amounts of time, money and effort in the lives of those people who enjoy listening to music.

Also, looking at the model described above, we can see that, in the superstimulus case, executing the initial steps to determine the meaning and emotional response is wasteful, because if the superstimulus is sufficiently intense, then no matter what the level of the emotional reponse, the decision will be made regardless not to execute any more steps in the process. It would follow that there should be evolutionary pressure to not even start the sequence of steps to process perceived speech, in those cases where the perceived non-conversationality is sufficiently extreme.

So if music is totally useless, it would be beneficial for the human species to evolve a way to avoid the associated waste of time, money and effort.

Dating the Origin of Music

A secondary question, relevant to the question of explaining why music has not evolved away, is that of how long music has existed as an aspect of human behaviour.

For example, if music came into existence a very short time ago, on the scale of evolutionary history, then it may be the case that music is essentially useless, and the reason that it still exists is because it hasn't had enough time to evolve into non-existence.

The evidence for the prehistoric existence of human music is rather limited.

However, we can be reasonably sure that music has existed for at least 42,000 years. The evidence for this comes from flutes discovered in the Geissenkloesterle Cave in Germany.

Even 42,000 years is quite a long time – if we assume 20 years per generation, that's 2100 generations, and quite a lot of evolution can occur in that number of generations if there is sufficient selective pressure.

The Intrinsic Benefit Provided by Music

Given that music is still a thing, at least 42,000 years after its origin, and it hasn't evolved into non-existence, we must conclude that there is some intrinsic benefit to the mental state that music induces, ie a mental state where, according to the model that I am proposing:

A hypothetical meaning is represented in the listener's brain.
An emotional reponse is determined for the meaning.
No attempt is made to evaluate the possible truth of said meaning.

What benefit could such a mental state provide?

At this point I will assume, without giving any specific reasons, that such a mental state does provide a benefit.

In other words, for some reason, it is beneficial for a person to be able to consider hypothetical meanings, and the emotional responses attached to those hypothetical meanings, all without considering the actual likely truth value of those hypothetical meanings.

(Actually, we could speculate a little bit about what the benefit might be. For example, it might be beneficial to imagine things without any prior reason to believe they will come true, and to understand one's full emotional responses to those imagined things, as if they were true, just for the sake of better understanding one's own emotions.)

If there is such a benefit, then there is no particular reason why the hypothetical meanings should have to be derived from actual speech (ie the song lyrics), and we can formulate a plausible secondary hypothesis, which is:

The response to music has evolved so that it applies to hypothetical meanings from any source, such as the music listener's own thoughts.

This secondary hypothesis releases us from my initial assumption made above, that the emotional effect of music only applies to the emotional implications of the contents of song lyrics.

Instead it is sufficient to assume that music is being heard, and that something is suggesting a hypothetical meaning to be represented in the brain of the listener, be it from song lyrics, or the listener's own thoughts, or something else (like the content of a film, or of a music video).

In this more general case, just as in the more specific case where the meaning comes from the content of lyrics, the emotional response to the hypothetical meaning is determined, and the music acts to suppress the estimation of the truth value of the hypothetical meaning, and as a consequence, the emotional response retains its full intensity in the brain of the listener.