Statistical Structure of Human Speech Sounds and Calibration of Interval Perception

In The Statistical Structure of Human Speech Sounds Predicts Musical Universals (published 6 August, 2003 in The Journal of Neuroscience), David Schwartz, Catherine Howe and Dale Purves published an analysis which shows that the human perception of consonant and dissonant intervals can be related to the observed relationships that occur between harmonics of speech sounds. Their work is based on an approach that they had previously applied to the understanding of how the brain performs visual perception. (There is also a "popular version" of the paper available.)

The ideas in the paper are similar to the theory of calibration described in chapter 12 of my book. The central shared concept is that internal percepts are made against empirically observed statistical distributions built up from past experience.

Of course the paper was published before I even started writing the book, so I cannot claim any priority for the basic idea. But if you read the paper, and read chapter 12 of my book, you will notice some similarities, and some crucial differences:

Because my theoretical development comes from a different direction, I think I can claim that my work and that of Schwartz et al. are complementary. In fact their work largely ignores the question of pitch translation invariance, and they factor it out of their analysis. They do this by calculating "normalised" frequencies of Fn = F/Fm, where F is the observed frequency and Fm is the frequency with maximum amplitude (both frequencies a function of time). Crucially they state:

This method of normalization avoids any assumptions about the structure of human speech sounds, e.g. that such sounds should be conceptualized in terms of ideal harmonic series.
But in doing this they have made an assumption about how the brain normalises frequency information. They choose a particular way of normalising frequency (and amplitude), they do the statistical analysis, and this gives the right sort of answer, in that it picks out the intervals that we all regard as being consonant. Getting the right answer is sufficient justification for the way the analysis is done, and there is no discussion in the paper as to whether there are variations in the analysis method that might more accurately model what the brain actually does.

In my analysis, I start off with the assumption that the calculation of frequency ratios is the very problem requiring solution, and I come to the conclusion that the brain uses empirical observation of harmonic frequencies in speech to calibrate this calculation. So if we combine my analysis with that of Schwartz et al., we end up with two statistical calculations: the first to calibrate calculation of frequency ratios, and the second to determine which ratios count as consonant. But maybe there aren't really two different processes of observation and calculation. Maybe these two processes are really just one process.

Implicit in the analysis that Schwartz et al. perform is the assumption that this corresponds to the analysis that the brain itself performs. So we are assuming that the brain calculates a density function for Fn = F/Fm (based on normalised amplitudes). But if the brain has no pre-existing way to do the crucial division of F by Fm, then what can it do? One possibility is that it can avoid the division by simply leaving F and Fm as separate values, and accumulate a 2-dimensional distribution, where density is a function of (F, Fm), instead of a 1-dimensional distribution where density is a function of F/Fm. In fact once we treat F and Fm as separate values, there is no particular reason to restrict the second value to be only the frequency of maximum amplitude: the brain can observe all mutual occurrences of pairs of frequencies F1 and F2, and there is no need to identify which frequency has maximum amplitude at each point in time. (There still remains the question of how amplitude normalisation is then applied, but I will leave this as an unresolved detail.)

What will this 2-dimensional distribution look like? Ideally I should draw a picture, but for the moment I will describe it. If we imagine the F1 as being the X-axis and F2 as being the Y-axis, and density is represented by gray levels from white (low) to black (high), then the expected density distribution would consist of a series of dark and light diagonal stripes, with the stripes going from left bottom to top right. Each dark stripe corresponds to an observed consonant interval. In fact the dark stripes correspond directly to the peaks in the 1-dimensional distributions shown in Figure 2 and Figure 4 in Schwartz et al.'s paper.

So now we have one empirically observed distribution, which represents both perception of consonance and our pitch translation invariant perception of frequency ratios. It seems quite plausible that this distribution density function could correspond to a single 2-dimensional cortical map. How would neurons in a given "stripe" identify themselves as representing a single frequency ratio? We could suppose that neurons representing consonant intervals could distinguish by some means between close neighbours representing consonant intervals and those representing dissonant intervals. Then it would be sufficient for each consonant neuron to connect itself in a certain way with those immediate neighbours that were also consonant (because they must be in the same stripe), and then for these neurons to connect themselves with the neurons connected to the neurons that they were connected with, thus transitively causing all the consonant neurons in one stripe to be connected to each other, so that they formed a single perceptual response.

I will call this proposed cortical map the consonance cortical map. But be aware that this proposed map does more than just perceive consonance and dissonance, as it is actually also the "subtraction table" (or "division table", if we don't take logarithms) that provides pitch translation invariant calculation of pitch intervals. We must also allow for the possibility that the brain requires the calculation of frequency ratios in more than one place, and that therefore there may exist different consonance maps, and these maps may exhibit subtle differences in how they accumulate statistical information from experience of speech, and corresponding differences in how they respond to perceived frequency ratios.

Octave Translation Invariance

One of the secondary issues of pitch translation invariant interval perception (which is raised in my book) is that all aspects of music which are pitch translation invariant are also octave translation invariant. We must take this into account when explaining the functionality of the consonance cortical map. Octave translation invariance refers to aspects of music which are independent of octave. We consider such aspects as being a function of pitch modulo octaves. If the consonance cortical map is to process input data in an octave translation invariant fashion, there are two different points where reduction modulo octaves could occur. The first is when accumulating statistics about harmonics of speech sounds. The second is when receiving input pitch values to be subtracted from each other.

Now the frequencies of perceived harmonics are unambiguous frequencies, and to reduce these frequencies modulo octaves would require some means of choosing a canonical frequency modulo octaves for each actual frequency. But if we have a consonance map constructed from a limited range of perceived actual harmonic frequencies, then an easy way to reduce input pitch values modulo octaves is simply to take the full set of harmonics for each pitch value, and remove those harmonics which are not powers of 2 times the base frequency. The consonance map will then respond to whichever of the remaining power-of-two harmonics are in the right range for input into the map. (However this explanation does have one problem, which is the processing of pitch values which are devoid of harmonics, and whose fundamental frequencies lie outside the input range of the consonance map. We might assume supplementation of harmonics by generated values that are powers of 2 times the input frequency, such supplementation itself necessarily calibrated by past empirical observation of where those harmonics normally occur in relation to the base frequency.)

One of the most well-known phenomena associated with octave translation invariance is the tritone paradox. This phenomenon, discovered by Diana Deutsch, relates to the perception of the direction of intervals between Shepard tones, which are tones that have an unambiguous pitch value modulo octaves, but a maximally ambiguous absolute value. The effect occurs when subjects are asked to judge which of two Shephard tones separated by exactly half an octave (= the "tritone") is the higher tone. It turns out that the perception of which pitch value is higher is a function of its location on the modulo octave scale, and this function varies between subjects, and this variation appears to depend on the speech environment that the subject has been exposed to. Of course this suggests a relationship to the hypothesis that our perception of consonance and dissonance is a function of the speech data that we are exposed to. (The tritone happens to be dissonant, but I don't think that in itself is relevant, as it is just chosen because it is exactly half an octave.)

The tritone effect can be regarded as a subtle deviation from octave translation invariance, in as much as it is an aspect of pitch perception which is not octave translation invariant. Special contrivances are required to make it appear:

Mathematically, the set of pitch values modulo octaves exists in a circular space. The tritone effect is a bit like looking at a circle, which you know has been constructed from a straight line bent round and glued to itself, and seeing if you can find the "join". Relating this to the consonance cortical map, the join occurs when a frequency moves off one edge and then reappears on the opposite edge. It either moves off the right edge, and reappears on the left edge at the same height, or it moves off the top and reappears at the bottom edge, at the same horizontal position. To make this work as a full explanation of the tritone effect, we need to make some assumptions:

Given these first two assumptions, then the maximally ambiguous tritone intervals will occur when one of the frequencies is found at the "edge" of the consonance map (and the other is necessarily at the centre), in which case there will be activity in two locations, on opposite edges, in one place saying "up", and in the other place saying "down". To explain why the locations of the edges of the consonance map are a function of speech environment (and not just arbitrary), we must further assume that the location of these edges, and therefore the assignment of neurons to pitch values modulo octaves, is determined by exposure to pitch value data. Exactly how this would occur is uncertain, although it could be categorised as yet another example of cortical plasticity to be explained when cortical plasticity in general has been explained. It is somewhat easier to give a reason as to why such cortical plasticity would occur, as the consonance map may prefer to map itself to frequency according to which set of frequency values give it the best data, i.e. the clearest set of black and white diagonal stripes from which it can construct a system for perceiving intervals between pitch values in a pitch translation invariant manner.