Pitch is determined by naturally occurring periodic sounds

Pitch is determined by naturally occurring periodic sounds is another important paper by two of the authors of The Statistical Structure of Human Speech Sounds Predicts Musical Universals. (Unfortunately the full text of "Pitch is determined ..." is not currently available online, unless you pay US$30.)

Although "Pitch is determined ..." has less direct relevance to my own theory than "Statistical Structure ...", it is still of interest for a number of reasons. These reasons fall more or less into two groups: questions that the paper answers, and questions that it leaves unanswered.

"Pitch is determined ..." is a continuation of a general theory of perception initiated by Dale Purves. His theory suggests that human perception is often a function of expectations, where the expectations are derived empirically from experience. Most of the development of this theory has related to vision, and can be read about at Purves' website, and in his book Why We See What We Do (which I admit I have not yet read).

Applying the general theory to pitch perception goes like this:

The conclusion is that perceived pitch of a sound is equivalent to the estimated fundamental frequency of the sound, on the assumption that the sound is actually a voiced speech sound. Applying this conclusion to the perception of musical pitch, this implies a direct link between the perception of music and the perception of speech.

What makes Schwartz and Purves' analysis convincing is that they are able to explain a number of features of human pitch perception, including:

So what are the unanswered questions? Purves and Schwartz's method for calculating the estimated fundamental frequency is derived purely from theoretical considerations, and they specifically do not claim to be stating a theory about what calculation is performed when people actually perceive pitch, or even whether the calculation is performed by evolution or by the brain.

But if I do calculation X, and calculation Y, and these calculations always give the same result, then in some sense calculation X and calculation Y are the same calculation. Of course sometimes there are different ways of doing what we might consider to be the same calculation. For example, we can add two numbers by starting at the first number, and then counting a number of steps equal to the second number, until we get to the answer (a bit like doing addition on your fingers). Or, we can use the normal method of adding digits from the right, carrying if necessary, to get the answer. The latter method is a bit quicker, but it gives the same answer. Another way to do addition is to precalculate the answers for a certain set of numbers, to create a lookup table. Then do the actual calculations by looking them up in the table. This will work as long as the numbers we are required to add have corresponding entries in the table.

With respect to the calculation described in the paper, we can note the following:

The first problem with this model is that there would appear to be no time-domain representation of sound anywhere in the human brain, other than the initial physical vibrations that enter the ear. Sounds that come into the ear are immediately converted into:

However, a cross-correlation is actually a convolution, and a convolution can be transformed via Fourier analysis from the time domain to a frequency/phase domain. This transformation is specified by the Cross-Correlation Theorem.

Looking at the derivation of the theorem, of particular interest is the second last line (of the first set of equations):

\[\int^\infty_{-\infty} \overline{F}(\nu)G(\nu)e^{-2\pi i \nu t}d\nu\]

The term \(\overline{F}(\nu)G(\nu)e^{-2\pi i \nu t}\) can be interpreted as the amplitude of F (|F|) at frequency \(\nu\) multiplied by the amplitude of G (|G|) at frequency \(\nu\), multiplied by a factor \(\frac{\overline{F(\nu)}}{|F(\nu)|}\frac{G(\nu)}{|G(\nu)|}e^{-2\pi i \nu t}\) representing the relative phase between F and G at that frequency assuming given time offset t (remember that both F and G are actually complex functions of \(\nu\) even though their time domain equivalents f and g are real functions).

The formulation of the cross-correlation in the frequency domain in terms of amplitudes and relative phase factors highlights the question of what effect relative phase has on the human perception of sound and of pitch in particular.

With regard to Schwartz and Purves' model of pitch perception, this raises the following questions:

The expression "phase deafness" refers to the observation that most aspects of human hearing do not depend on relative phases of harmonics within a given sound. We might therefore want to consider altering the model of pitch perception to ignore relative phase. This is easy enough to do; we can alter the frequency domain correlation formula to be:

\[\int^\infty_{-\infty} | F(\nu)| |G(\nu)| d\nu\]
This value will be an upper bound for the original formula for the cross-correlation, and it is easier to calculate because it is independent of the offset t. It would be interesting to apply this simplified formula to Schwartz and Purves' data to see if it also successfully reproduced the various observed features of pitch perception modelled by the original formula. A more refined model might attempt to take into account the more limited representation of phase that does exist in the cortex, i.e. neurons phase-locked to vibrations for frequencies up to 4000Hz.

Calculation: Evolutionary or "In-Brain"?

Purves' general theory of cognitive expectation does not make any assumption about whether the empirical data used to generate the expectations are accumulated by evolution (via natural selection), or whether they are accumulated in the brain over the individual organism's lifetime. Every time the authors of "Pitch is determined ..." make some reference to how such expectations are calculated, they carefully allow for it to be either done by evolution or by the individual person's (or animal's) brain. However, there are very good reasons for supposing that, for any significant accumulation of empirical data, almost all the data is accumulated over the individual's lifetime. In fact there are two major reasons:

Of course evolution is able to write some data, especially if that is the only way to process experience for the benefit of a species, but the brain, and in particular the human brain, is enormously faster and bigger than natural selection acting on the genome. To give specific comparisons:

Assuming that most of the data accumulation does occur in-brain, and that evolution's role is restricted to defining how the accumulation occurs, we are left with the question of how to represent a database of speech fragments in some part of the human brain, in a form which can be used to perform some equivalent of the multiple cross-correlation calculations required by the pitch perception model.

One problem with the basic model is that the number of data items accumulated in the database and used for each pitch calculation is proportional to the number of speech samples observed. In practice we would expect the brain to devote a finite portion of itself to such a purpose. Furthermore, whereas a computerised database with finite storage would just "fill up" until it "ran out of room", we would expect the neural network equivalent to use all its neurons to store all the information received, in some fuzzy distributed manner, and in such a way that new information always improved the performance of the network. And we would expect that loss of information due to "overflow" would occur in a gradual manner and in such a way that the information lost came from all the accumulated data, and not just from (for instance) the most or least recent data.

It would be a useful exercise to construct a mathematical model which had these properties, including bounded memory useage, and which retained the predictive properties of Schwartz and Purves' current model of pitch perception.

Pitch Perception vs. "Consonance Perception"

There is one significant difference between the "Pitch is determined ..." paper and the "Statistical Structure ..." paper, which has to do with the relationship between the empirical data accumulated and the perception that that data is used for, in each case.

In the case of pitch perception, the empirical data consists of speech samples and estimates of the fundamental frequencies of those speech samples, which are then used to calculate an estimated fundamental frequency (i.e. the pitch) of a given sound (where in many cases the sound in question is itself a speech sound).

In the case of consonance perception, empirical data about the co-occurrence of harmonic frequencies within individual sounds is used to calculate the estimated "consonance" of the interval between the pitch values of two different sounds, where the two different sounds may or may not be occurring simultaneously. So the thing being measured is quite distinct from the measurements contained in the accumulation of empirical data being used to measure it. Comparing this to Purves' work on visual perception, the model of pitch perception fits exactly into the general model, whereas the model of consonance perception is something different: the calibration of one perception by means of historical accumulation of information from a different perception.

One consequence of this distinction is that it is not at all clear what perceived consonance means, since it is not really an expected value of anything. My own theory solves this problem by showing that consonance is not really a property of intervals, but is actually a property of pairs of pitch values, and as such is then used by the brain to help it determine which intervals are to be deemed equal to which other intervals (by means of continuity: if (1) the interval X1 to Y1 is consonant, (2) the interval X2 to Y2 is consonant, (3) X1 is close to X2 and (4) Y1 is close to Y2, then the two intervals are deemed to be equal).