What is Music? Pitch is determined by naturally occurring periodic sounds

Pitch is determined by naturally occurring periodic sounds

Pitch is determined by naturally occurring periodic sounds is another important paper by two of the authors of The Statistical Structure of Human Speech Sounds Predicts Musical Universals. (Unfortunately the full text of "Pitch is determined ..." is not currently available online, unless you pay US$30.)

Although "Pitch is determined ..." has less direct relevance to my own theory than "Statistical Structure ...", it is still of interest for a number of reasons. These reasons fall more or less into two groups: questions that the paper answers, and questions that it leaves unanswered.

"Pitch is determined ..." is a continuation of a general theory of perception initiated by Dale Purves. His theory suggests that human perception is often a function of expectations, where the expectations are derived empirically from experience. Most of the development of this theory has related to vision, and can be read about at Purves' website, and in his book Why We See What We Do (which I admit I have not yet read).

Applying the general theory to pitch perception goes like this:

Acquire a database of voiced speech fragments
For each speech fragment, determine a good estimate of the fundamental frequency (= pitch), and store it in the database.
For each sound for which pitch is to be determined, compare the sound to each sound in the database, and from those comparisons, calculate a best estimate of the fundamental frequency.
This estimate is the perceived pitch value.

The conclusion is that perceived pitch of a sound is equivalent to the estimated fundamental frequency of the sound, on the assumption that the sound is actually a voiced speech sound. Applying this conclusion to the perception of musical pitch, this implies a direct link between the perception of music and the perception of speech.

What makes Schwartz and Purves' analysis convincing is that they are able to explain a number of features of human pitch perception, including:

Virtual pitch (determining the fundamental from harmonics, even if the fundamental itself is missing)
Pitch shift of the residue (changes to perceived pitch when harmonics are altered in such a way that they no longer have a common divisor greater than 70Hz)
Spectral dominance (when there is a conflict between the fundamental implied by different harmonics, priority is given to those harmonics in a particular frequency range, which happens to correspond to those frequencies dominant in human voiced speech sounds)
Pitch strength (which is how "strong" our sense of perceived pitch is, i.e. how certain our pitch-perceiving system is about a particular pitch value)

So what are the unanswered questions? Purves and Schwartz's method for calculating the estimated fundamental frequency is derived purely from theoretical considerations, and they specifically do not claim to be stating a theory about what calculation is performed when people actually perceive pitch, or even whether the calculation is performed by evolution or by the brain.

But if I do calculation X, and calculation Y, and these calculations always give the same result, then in some sense calculation X and calculation Y are the same calculation. Of course sometimes there are different ways of doing what we might consider to be the same calculation. For example, we can add two numbers by starting at the first number, and then counting a number of steps equal to the second number, until we get to the answer (a bit like doing addition on your fingers). Or, we can use the normal method of adding digits from the right, carrying if necessary, to get the answer. The latter method is a bit quicker, but it gives the same answer. Another way to do addition is to precalculate the answers for a certain set of numbers, to create a lookup table. Then do the actual calculations by looking them up in the table. This will work as long as the numbers we are required to add have corresponding entries in the table.

With respect to the calculation described in the paper, we can note the following:

The calculation of pitch of a given sound involves calculating the cross-correlation of the sound with each sound sample in the database.
The calculation of the cross-correlation is done in the time domain, which involves a non-trivial amount of integration (and this is for each sound sample in the database and for each possible time offset used to align the sample and the sound being processed).

The first problem with this model is that there would appear to be no time-domain representation of sound anywhere in the human brain, other than the initial physical vibrations that enter the ear. Sounds that come into the ear are immediately converted into:

Frequency representation, where frequency is represented positionally (in the sense that different frequencies are represented by activity in different neurons, or different sets of neurons, and different neurons occupy different positions within the brain and nervous system).
Phase-locked representation, where neurons (at lower frequencies only) fire in phase with the original sound.

However, a cross-correlation is actually a convolution, and a convolution can be transformed via Fourier analysis from the time domain to a frequency/phase domain. This transformation is specified by the Cross-Correlation Theorem.

Looking at the derivation of the theorem, of particular interest is the second last line (of the first set of equations):

\[\int^\infty_{-\infty} \overline{F}(\nu)G(\nu)e^{-2\pi i \nu t}d\nu\]

The term $\overline{F}(\nu)G(\nu)e^{-2\pi i \nu t}$ can be interpreted as the amplitude of F (|F|) at frequency $\nu$ multiplied by the amplitude of G (|G|) at frequency $\nu$, multiplied by a factor $\frac{\overline{F(\nu)}}{|F(\nu)|}\frac{G(\nu)}{|G(\nu)|}e^{-2\pi i \nu t}$ representing the relative phase between F and G at that frequency assuming given time offset t (remember that both F and G are actually complex functions of $\nu$ even though their time domain equivalents f and g are real functions).

The formulation of the cross-correlation in the frequency domain in terms of amplitudes and relative phase factors highlights the question of what effect relative phase has on the human perception of sound and of pitch in particular.

With regard to Schwartz and Purves' model of pitch perception, this raises the following questions:

Do relative phases between different harmonics of a sound affect the perceived pitch of that sound?
Do relative phases between different harmonics of a sound affect the estimated pitch as calculated by Schwartz and Purves' model?
And, are the answers to the first two questions consistent with each other? (I.e., it would not be good if the model had dependence on relative phases of harmonics and actual human pitch perception didn't.)

The expression "phase deafness" refers to the observation that most aspects of human hearing do not depend on relative phases of harmonics within a given sound. We might therefore want to consider altering the model of pitch perception to ignore relative phase. This is easy enough to do; we can alter the frequency domain correlation formula to be:

\[\int^\infty_{-\infty} | F(\nu)| |G(\nu)| d\nu\]

This value will be an upper bound for the original formula for the cross-correlation, and it is easier to calculate because it is independent of the offset t. It would be interesting to apply this simplified formula to Schwartz and Purves' data to see if it also successfully reproduced the various observed features of pitch perception modelled by the original formula. A more refined model might attempt to take into account the more limited representation of phase that does exist in the cortex, i.e. neurons phase-locked to vibrations for frequencies up to 4000Hz.

Calculation: Evolutionary or "In-Brain"?

Purves' general theory of cognitive expectation does not make any assumption about whether the empirical data used to generate the expectations are accumulated by evolution (via natural selection), or whether they are accumulated in the brain over the individual organism's lifetime. Every time the authors of "Pitch is determined ..." make some reference to how such expectations are calculated, they carefully allow for it to be either done by evolution or by the individual person's (or animal's) brain. However, there are very good reasons for supposing that, for any significant accumulation of empirical data, almost all the data is accumulated over the individual's lifetime. In fact there are two major reasons:

Evolution is too slow. It is too slow to write data, and if anything changes in the species' circumstances, it is to slow to rewrite new data.
The amount of room in the genome available to write data is too small.

Of course evolution is able to write some data, especially if that is the only way to process experience for the benefit of a species, but the brain, and in particular the human brain, is enormously faster and bigger than natural selection acting on the genome. To give specific comparisons:

The human genome contains 3000 million base pairs, which is equivalent to 750 megabytes. However, not all of that information is meaningful. And only some of the information is relevant to information processing that occurs in the brain and nervous system. Recent analysis of the human genome has identified 30000 genes. Genes consists of exons and introns, where exons contain actual sequences that code for proteins, and introns are sequences that are deleted before protein assembly occurs (they are actually deleted from m-RNA). There are about 40 million base pairs in exons in the human genome, which equals 10 megabytes. There are of course areas outside of exons that carry useful information, in particular promoter sequences that identify the start of genes, which also may contain sequences that provide targets for proteins that control the expression of genes.
A precise measure of the practical information content of the human genome would involve finding an optimal lossy compression algorithm. A well known example of a lossy compression algorithm is MP3, which is used to encode recordings of music. The decompressed file may not be identical to the original file, but for the purposes of enjoying the music there is no significant difference. (Note that MP3 may be far from optimal as a music compression algorithm. For example, many items of music can be accurately described as a MIDI file, which can easily be 100 times smaller than a corresponding MP3 file.) Applying this concept analogously to the human genome, we would look for a compression algorithm which could compress the human genome, such that the decompressed genome resulted in an organism not significantly different from a human being grown from the original genome. Given uncertainty as to what possibilities exist for compressing the human genome in this way, it is difficult to put any firm lower bound on the true information content of the human genome. Further uncertainty follows from the discovery that there are other conserved regions of the genome (i.e. conserved by evolution). The conserved nature of these regions implies that they have significant information content, even though they do not appear to be actual genes (i.e. which encode for a proteins).
The human brain contains about 1000 million neurons, each with an average of 10,000 synapses. It is reasonable to assume that each synapse is capable of storing at least 1 bit of information, which adds up to about 1.2 million megabytes. Comparing to the estimates for the genome, this is anywhere from 1000 to 1,000,000 times as much information.
A comparison of the speed at which information can be written favours the brain even more. We would expect that each synapse of the human brain would have information written to it at least once over the lifetime of the owner. Even if each synapse has 1 bit written to it per 100 years, that gives 12,000 megabytes per year. The difference between human and chimpanzee DNA is estimated to be about 1% of the genome, which gives an upper bound of 7.5 megabytes. Even if most of that difference has occurred on the human branch of the evolutionary tree, in the 5-8 million years since our ancestors split from those of the chimpanzee, that comes out to no more than 1 byte per year, and perhaps less if some portion of the 1% consists of redundant information (i.e. information which would be eliminated by the optimal lossy compression algorithm described above).
In fairness, it must be observed that these comparisons overlook one simple advantage that evolutionary calculation and storage of information has over that which occurs in the brain, which is that once an evolutionary calculation has been performed, and propagated to the population of the species in question, it does not need to be repeated. Whereas "in-brain" calculations must be repeated for each individual in each generation.

Assuming that most of the data accumulation does occur in-brain, and that evolution's role is restricted to defining how the accumulation occurs, we are left with the question of how to represent a database of speech fragments in some part of the human brain, in a form which can be used to perform some equivalent of the multiple cross-correlation calculations required by the pitch perception model.

One problem with the basic model is that the number of data items accumulated in the database and used for each pitch calculation is proportional to the number of speech samples observed. In practice we would expect the brain to devote a finite portion of itself to such a purpose. Furthermore, whereas a computerised database with finite storage would just "fill up" until it "ran out of room", we would expect the neural network equivalent to use all its neurons to store all the information received, in some fuzzy distributed manner, and in such a way that new information always improved the performance of the network. And we would expect that loss of information due to "overflow" would occur in a gradual manner and in such a way that the information lost came from all the accumulated data, and not just from (for instance) the most or least recent data.

It would be a useful exercise to construct a mathematical model which had these properties, including bounded memory useage, and which retained the predictive properties of Schwartz and Purves' current model of pitch perception.

Pitch Perception vs. "Consonance Perception"

There is one significant difference between the "Pitch is determined ..." paper and the "Statistical Structure ..." paper, which has to do with the relationship between the empirical data accumulated and the perception that that data is used for, in each case.

In the case of pitch perception, the empirical data consists of speech samples and estimates of the fundamental frequencies of those speech samples, which are then used to calculate an estimated fundamental frequency (i.e. the pitch) of a given sound (where in many cases the sound in question is itself a speech sound).

In the case of consonance perception, empirical data about the co-occurrence of harmonic frequencies within individual sounds is used to calculate the estimated "consonance" of the interval between the pitch values of two different sounds, where the two different sounds may or may not be occurring simultaneously. So the thing being measured is quite distinct from the measurements contained in the accumulation of empirical data being used to measure it. Comparing this to Purves' work on visual perception, the model of pitch perception fits exactly into the general model, whereas the model of consonance perception is something different: the calibration of one perception by means of historical accumulation of information from a different perception.

One consequence of this distinction is that it is not at all clear what perceived consonance means, since it is not really an expected value of anything. My own theory solves this problem by showing that consonance is not really a property of intervals, but is actually a property of pairs of pitch values, and as such is then used by the brain to help it determine which intervals are to be deemed equal to which other intervals (by means of continuity: if (1) the interval X1 to Y1 is consonant, (2) the interval X2 to Y2 is consonant, (3) X1 is close to X2 and (4) Y1 is close to Y2, then the two intervals are deemed to be equal).