I have developed the hypothesis that music is a super-stimulus for the glial perception of slow speech tempo.
The main assumption of this hypothesis is that glial cells perceive speech tempo, and use their perception of speech tempo to regulate all neuronal processing that occurs "downstream" of speech.
This regulation can be regarded as a form of synchronisation, that is – if someone speaks faster, then speed up all the downstream neuronal processing involved in parsing and understanding the speech, and if someone speaks slower, then slow down the downstream processing.
A secondary assumption implicit in the hypothesis is that this type of regulation is quite specific to speech.
Speech is a complex dynamic phenomenon, and when we listen to speech our brains devote considerable resources and machinery to the problem of efficiently and correctly processing speech.
But what is it about speech that is different? What makes speech so special that it needs this particular type of neural modulation?
Most other dynamic scenarios that our brains need to process involve physical motion. For example -
- Being chased by someone or something, for example a large dangerous predator.
- Chasing or chasing someone or something, for example hunting a prey animal.
- Being in a fight.
- In general, moving oneself from one place to another, whether it be running, jumping or climbing.
The most fundamental difference between perception of speech and perception of physical motion has to do with time-scaling invariance.
In the case of speech, a person can talk faster or slower, but the speech tempo is not directly relevant to the meaning of the speech.
It is true that speech tempo may tell us something about the identity or mental state of the speaker.
But for the purpose of parsing and understanding the meaning of the speech, the tempo is completely irrelevant.
The mechanics of speech perception will be made more robust and simpler if the overall speech tempo can be somehow factored out altogether. In other words, if the speech tempo could be calculated, and the corresponding time scaling could be applied globally to all the downstream neural processing systems, then no further adjustment would be required to ensure the correct processing of perceived speech independently of speech tempo variation.
In the case of the perception of physical motion, perception is not time-scaling invariant.
That is, absolute speed is always relevant. Absolute time intervals are always relevant. Most physical motion is influenced and strongly constrained by gravity, and gravity, on the surface of the Earth, has a single fixed value.
Absolute speed almost always matters in physical situations. A sentence spoken at 100 bpm or 150 bpm is the same sentence in either case, with the same syntax and the same meaning. But a large carnivore chasing you at 15 km/hr is not the same thing as a large carnivore chasing you at 10 km/hr, and the optimal action on your part might be quite different in each case. For example, the difference between "Yes I can make it to the tree" vs "I'm not going to reach the tree and I need to look for a weapon because my only option is to confront".
It follows that the evolution of speech and speech perception created a situation where it was beneficial to evolve a system for applying overall time-scaling to the neural processing of speech and all the downstream processing. Prior to the evolution of speech, such a system would not have been of any benefit at all with regard to the neural perception of physical motion.
The implication is that if the human species has evolved such a system specific to the efficient perception and processing of speech, then we would not expect it to exist in other non-human animals, because other non-human animals do not have anything like human speech.
And if music is a side-effect of the existence of such a system, then this explains why non-human animals do not respond in any significant way to music.