What is Music?
Solving a Scientific Mystery

Statistical Structure of Human Speech Sounds and Calibration of Interval Perception

In The Statistical Structure of Human Speech Sounds Predicts Musical Universals (published 6 August, 2003 in *The Journal of Neuroscience*),
David Schwartz, Catherine Howe and Dale Purves published
an analysis which shows that the human perception of consonant and dissonant intervals can be related to
the observed relationships that occur between harmonics of speech sounds. Their work is based on
an approach that they had previously applied to the understanding of how the brain performs visual perception. (There
is also a "popular version" of the
paper available.)

The ideas in the paper are similar to the theory of calibration described in chapter 12 of my book. The central shared concept is that internal percepts are made against empirically observed statistical distributions built up from past experience.

Of course the paper was published before I even started writing the book, so I cannot claim any priority for the basic idea. But if you read the paper, and read chapter 12 of my book, you will notice some similarities, and some crucial differences:

- Both myself and Schwartz
*et al.*come to the conclusion that the human perception of consonance and dissonance is learned from exposure to actual human speech. (Actually they never quite state this, because they allow for the possibility that the "learning" has happened by means of evolution, but in practice I think evolution would be both too slow and inflexible.) But we come to this conclusion by different routes. - Schwartz
*et al.*are attempting to explain why we perceive some intervals as consonant and others as dissonant. They develop the hypothesis that the distinction between consonant and dissonant may be based on the statistics of empirically observed ratios between simultaneously occurring frequencies in human speech. - I start off with a totally different question, which is
*How does the human brain calibrate the perception of musical intervals as frequency ratios?*. The pitch translation invariance of interval perception, and of music perception generally, implies a calibrated four-way relationship between pitch values, in particular the ability to recognise the equality of frequency ratios W/X and Y/Z for quadruples of frequencies W, X, Y and Z. (The terminology of pitch "translation" arises from treating pitch as log frequency, which converts division into subtraction.) I then determine that a natural model is required, and the intervals between harmonic frequencies of sounds like the human voice are the most likely candidate. In a pre-industrial society, and not counting musical instruments (which are a consequence of pre-existing pitch translation invariant perceptions, and therefore less likely to be an original cause of said perceptions), the human voice is the major source of sounds with harmonic frequencies that have fixed ratios between them independently of base frequency. - Schwartz
*et al.*have done statistical analyses on various speech corpora. I have not done any such analyses myself. - In my book I suggest that calibration from natural models is likely to be on-going, and therefore it may be possible to observe mis-calibration by deliberately exposing subjects to sounds (and in particular human speech sounds) that have harmonics in the "wrong" places.

Because my theoretical development comes from a different direction, I think I can claim that
my work and that of Schwartz *et al.* are complementary. In fact their work largely ignores
the question of pitch translation invariance, and they factor it out of their analysis.
They do this by calculating "normalised" frequencies of F_{n} = F/F_{m}, where
F is the observed frequency and F_{m} is the frequency with maximum amplitude (both frequencies
a function of time). Crucially they state:

This method of normalization avoids any assumptions about the structure of human speech sounds, e.g. that such sounds should be conceptualized in terms of ideal harmonic series.But in doing this they have made an assumption about how the brain normalises frequency information. They choose a particular way of normalising frequency (and amplitude), they do the statistical analysis, and this gives the right sort of answer, in that it picks out the intervals that we all regard as being consonant. Getting the right answer is sufficient justification for the way the analysis is done, and there is no discussion in the paper as to whether there are variations in the analysis method that might more accurately model what the brain actually does.

In my analysis, I start off with the assumption that the calculation of frequency
ratios is the very problem requiring solution, and I come to the conclusion that the brain uses
empirical observation of harmonic frequencies in speech to calibrate this calculation. So if we
combine my analysis with that of Schwartz *et al.*, we end up with *two* statistical
calculations: the first to calibrate calculation of frequency ratios, and the
second to determine which ratios count as consonant. But maybe there aren't really two different
processes of observation and calculation. Maybe these two processes are really just
one process.

Implicit in the analysis that Schwartz *et al.* perform is the assumption that
this corresponds to the analysis that the brain itself performs. So we are assuming that the
brain calculates a density function for F_{n} = F/F_{m} (based
on normalised amplitudes). But if the brain has no pre-existing way to do the crucial division
of F by F_{m}, then what can it do? One possibility is that it can avoid the
division by simply leaving F and F_{m} as separate values, and accumulate a
2-dimensional distribution, where density is a function of (F, F_{m}), instead of
a 1-dimensional distribution where density is a function of F/F_{m}. In fact once
we treat F and F_{m} as separate values, there is no particular reason to restrict
the second value to be only the frequency of maximum amplitude: the brain can observe
all mutual occurrences of pairs of frequencies F_{1} and F_{2}, and there
is no need to identify which frequency has maximum amplitude at each point in time. (There
still remains the question of how amplitude normalisation is then applied, but I will leave
this as an unresolved detail.)

What will this 2-dimensional distribution look like? Ideally I should draw a picture,
but for the moment I will describe it. If we imagine the F_{1} as being the X-axis
and F_{2} as being the Y-axis, and density is represented by gray levels from
white (low) to black (high), then the expected density distribution would consist of a series of dark and
light diagonal stripes, with the stripes going from left bottom to top right.
Each dark stripe corresponds to an observed consonant interval. In fact the dark stripes
correspond directly to the peaks in the 1-dimensional distributions shown
in Figure 2 and
Figure 4 in
Schwartz *et al.*'s paper.

So now we have *one* empirically observed distribution, which represents both
perception of consonance *and* our pitch translation invariant perception of
frequency ratios. It seems quite plausible that this distribution density function could correspond to a single
2-dimensional cortical map. How would neurons in a given "stripe" identify themselves as
representing a single frequency ratio? We could suppose that neurons representing consonant intervals
could distinguish by some means between close neighbours representing consonant intervals
and those representing dissonant intervals. Then it would be sufficient for each consonant neuron to
connect itself in a certain way with those immediate neighbours that were also consonant (because they must be
in the same stripe), and then for these neurons to connect themselves with the neurons connected
to the neurons that they were connected with, thus transitively causing all the consonant neurons
in one stripe to be connected to each other, so that they formed a single perceptual response.

I will call this proposed cortical map the **consonance cortical map**. But be aware
that this proposed map does more than just perceive consonance and dissonance, as it is actually also
the "subtraction table" (or "division table", if we don't take logarithms) that provides
pitch translation invariant calculation of pitch intervals.
We must also allow for the possibility that the brain requires the calculation of frequency ratios
in more than one place, and that therefore there may exist different consonance maps, and these
maps may exhibit subtle differences in how they accumulate statistical information from
experience of speech, and corresponding differences in how they respond to perceived frequency ratios.

One of the secondary issues of pitch translation invariant interval perception (which is raised in my
book) is that all aspects of music which are pitch translation invariant are also octave translation
invariant. We must take this into account when explaining the functionality of the consonance
cortical map. Octave translation invariance refers to aspects of music which are independent of
octave. We consider such aspects as being a function of pitch *modulo octaves*. If
the consonance cortical map is to process input data in an octave translation invariant fashion,
there are two different points where reduction modulo octaves could occur. The first is when
accumulating statistics about harmonics of speech sounds. The second is when receiving input
pitch values to be subtracted from each other.

Now the frequencies of perceived harmonics are unambiguous frequencies, and to reduce these frequencies modulo octaves would require some means of choosing a canonical frequency modulo octaves for each actual frequency. But if we have a consonance map constructed from a limited range of perceived actual harmonic frequencies, then an easy way to reduce input pitch values modulo octaves is simply to take the full set of harmonics for each pitch value, and remove those harmonics which are not powers of 2 times the base frequency. The consonance map will then respond to whichever of the remaining power-of-two harmonics are in the right range for input into the map. (However this explanation does have one problem, which is the processing of pitch values which are devoid of harmonics, and whose fundamental frequencies lie outside the input range of the consonance map. We might assume supplementation of harmonics by generated values that are powers of 2 times the input frequency, such supplementation itself necessarily calibrated by past empirical observation of where those harmonics normally occur in relation to the base frequency.)

One of the most well-known phenomena associated with octave translation invariance is
the **tritone paradox**. This phenomenon, discovered by Diana Deutsch, relates to the perception
of the direction of intervals between **Shepard tones**, which are tones that have an
unambiguous pitch value modulo octaves, but a maximally ambiguous absolute value. The effect
occurs when subjects are asked to judge which of two Shephard tones separated by exactly half
an octave (= the "tritone") is the higher tone. It turns out that the perception of which pitch
value is higher is a function of its location on the modulo octave scale, and this function
varies between subjects, and this variation appears to depend on the speech environment that the
subject has been exposed to. Of course this suggests a relationship to the hypothesis
that our perception of consonance and dissonance is a function of the speech data that we are
exposed to. (The tritone happens to be dissonant, but I don't think that in itself is relevant,
as it is just chosen because it is exactly half an octave.)

The tritone effect can be regarded as a subtle deviation from octave translation invariance,
in as much as it is an aspect of pitch perception which is *not* octave translation invariant. Special contrivances are required to make it appear:

- The Shephard tones are carefully constructed to have ambiguous absolute pitch, in that the only harmonics are powers of 2 times the fundamental frequency, and the fundamental frequency is itself very weak compared to the second harmonic.
- The interval chosen is exactly half an octave, so the brain cannot make use of the default assumption that the interval size should be assumed to be the smallest possible (so that, for example, if there was a choice between an interval being 4 semitones and being 8 semitones, 4 semitones would be a preferred choice, and this would determine which of two notes was considered the highest).

Mathematically, the set of pitch values modulo octaves exists in a circular space. The tritone effect is a bit like looking at a circle, which you know has been constructed from a straight line bent round and glued to itself, and seeing if you can find the "join". Relating this to the consonance cortical map, the join occurs when a frequency moves off one edge and then reappears on the opposite edge. It either moves off the right edge, and reappears on the left edge at the same height, or it moves off the top and reappears at the bottom edge, at the same horizontal position. To make this work as a full explanation of the tritone effect, we need to make some assumptions:

- that the consonance map is capable of assigning one axis to "first" frequency, and another axis to "second" frequency, even though the map was initially constructed and wired based entirely on observations of intervals between pairs of frequencies occurring simultaneously
- that the map assigns a direction to an interval, for example, if the first frequency is assigned the X-axis and the second frequency is assigned the Y-axis, the left-top half represents "up" and the right-botton half represents "down".