4. Measuring Voice Pitch & Intonation
Learning Objectives
- to learn about the characteristics of distributions of fundamental frequency
- to make measurements of the centre, range and regularity of voice pitch
- to experience instrumental methods of studying intonation: measurement, modelling and manipulation
- to learn about different quantitative models of intonation
Topics
- Why study voice pitch?
The distribution of the voice fundamental frequencies used by a speaker is an important aspect of individuality. Measurements of the centre and breadth of the Fx distribution are seen to shift with speaking style, emotional state and physiological condition. Global measures of voice regularity in connected speech are useful in voice quality assessment, as are analyses of how regularity varies with Fx. Pitch changes have important linguistic functions: marking prominence, marking lexical tone and marking sentence function. Intonational changes are important in helping the listener choose between alternative interpretations of the word string, not only in disambiguating parses but also in the exploitation of irony or sarcasm. Continuity of fundamental frequency aids the integration of speech pattern elements in perception, helping distinguish them from a noisy background. Text-to-speech conversion systems, and concept-to-speech conversion systems need to generate intonational contours which reflect the desired information structure of communicated messages.
- Fundamental Frequency (Fx) Contours
A graph of fundamental frequency against time is called an Fx contour (otherwise called a pitch track); it shows how the pitch of the voice changes through an utterance which is a key aspect of its intonation. When we look at an Fx contour we can see many features: (i) changes in fundamental frequency that are associated with pitch accents; (ii) the range of Fx used by the speaker; (iii) voiced and voiceless regions; and (iv) regular and irregular phonation.
- Linguistic functions of fundamental frequency changes
Changes in fundamental frequency have linguistic function through prominence, lexical tone and intonation. Prominence and lexical tone operate at the level of syllables, while intonation operates over the prosodic foot, the prosodic phrase and dialogue turn.
Prominence: Pitch can be used make words in utterances more salient, either by giving the stressed syllables a higher pitch or a changing pitch compared to the surrounding syllables. Pitch prominent syllables are called 'accented' syllables and serve to indicate emphasis or focus.
Lexical tone: Pitch changes within syllables can also provide information to disambiguate lexical items having the same segmental string in some languages. These lexical tones may co-occur with changes in duration and voice quality which also aid identification.
Intonation: Pitch movements that occur over the domain of a whole prosodic phrase and which are related to the function or meaning of the whole phrase are called Intonation. The intonation of a phrase provides additional information to the listener about its intended meaning, whether for example the speaker is certain about the facts expressed, or is requesting a response from the listener.
Note: there is no single widely-accepted phonological model of intonation, i.e. no agreement on the basic contrastive elements of intonation or on rules for their combination. Some models treat intonation as a sequence of contrasting high and low tones, others as a superposition of pitch accents and phrasal accents. A simple model was proposed by O'Connor and Arnold (1973) in which the intonation of a prosodic phrase is divided in up to four parts:
- The pre-head - all the initial unaccented syllables.
- The head - between the pre-head and the nucleus.
- The nucleus - the main accented syllable.
- The tail - all the syllables after the nucleus.
O'Connor and Arnold then identified 10 different intonational patterns with different meanings.
The primary intonational distinction in English is between falling and rising pitch patterns expressed on the last lexical stress in the phrase. O'Connor & Arnold call this the "nuclear accent" or "nuclear tone". A falling nuclear tone indicates to the listener that the phrase is complete or definite:
Differences in interpretation can also be found to depend on the size of the fall or rise. A high-falling tone is more definite than a low-falling tone.
It is important to acknowledge individual variation in the realisation of intonation. In particular, different speakers have different Fx ranges which leads to the idea that listeners normalise or standardise intonation prior to phonological categorisation.
- Distributions of Fundamental frequency
Using pitch epoch detection from the speech or Laryngograph signal we can establish the duration of each individual vocal fold cycle in a phrase or passage; this data is called fundamental period data, or Tx for short. From this data we can calculate the instantaneous fundamental frequency value for each period: this is the frequency the period would have if that cycle were repeated for one second. A stream of such Fx values from an utterance plotted against the time at which they occur gives us an Fx contour. We can also use this stream of Fx values (from a 2 min passage, say) to calculate a fundamental frequency distribution or histogram, called Dx for short. From this distribution we can take measurements of central tendency (median or mode) and also measurements of range (percentiles). Typically Fx distributions are plotted on a logarithmic frequency scale, with the vertical axis indicating the amount of time spent at each frequency.
- Measurement of Fx Regularity
We can also use the stream of instantaneous Fx values to make measurements of regularity. Individual Fx periods can be considered part of ‘regular’ voicing if they have a duration similar to their neighbours. If individual periods are dissimilar to their neighbours, then they must be part of irregular voicing. Thus we can compare a Dx plot of all periods (Dx1) with the Dx plot of regular periods only (Dx2). The difference is a measure of irregularity: if the speaker has a regular voice quality, there will be little difference between the two plots; conversely a large difference shows the use of an irregular voice quality. Irregularity can also be shown on a two-period scatterplot (Cx) in which adjacent periods are plotted against one another. In regular voicing, adjacent periods will have similar values and points are plotted along the diagonal of the scatterplot; in irregular voicing, adjacent periods will have different values and points are plotted off the diagonal. The percentage of period pairs plotted off the diagonal can also be used as a measure of irregularity in the voice.
- Estimation of Fx from the speech signal
Methods such as the cepstrum and the autocorrelation are popular means to estimate vocal fold repetition frequency from the speech signal. However these need to be supplemented with heuristics to produce a relatively clean trace of Fx over time. Firstly it is necessary to have a means for deciding when the vocal folds are vibrating, so that Fx measures are only provided in voiced regions. Secondly, constraints need to be applied to sequences of Fx frame estimates to limit the rate of change of Fx and ensure continuity of Fx.
- Modelling of intonation contours
While statistics about average pitch and pitch range can be estimated from averages of short-time fundamental frequency measures, the measurement of intonation requires the modelling of Fx contours over much larger time frames corresponding to syllables, prosodic feet or intonational phrases. The general approach is to devise a model which describes the time course of Fx over a particular stretch of the signal, then to estimate the parameters of that model from the acoustic signal. There are two main types of Fx contour model: those that model the contour shape regardless of its alignment to the segmental content of the speech, and those that are anchored to the segmental content through annotations. The advantage of the former are that Fx contours can be modelled without introducing assumptions about the domains over which pitch accents operate. The advantage of the latter is that alignment makes it easier to connect properties of pitch accents to the properties of linguistic units found in the utterance.
Contour Stylisation Models
Models that seek only to stylise the pitch contour include the IPO system (’t Hart, Collier, & Cohen, 1990) and the MOMEL system (Hirst & Espesser, 1993). The IPO system models the contour as a sequence of piecewise linear segments chosen such that the stylised result should be perceptually indistinguishable from the original. The outcome of analysis is an undefined number of line segments, each with an Fx height and slope.
The MOMEL system is similar but uses a quadratic spline function to model a smoothed version of the contour. The outcome of analysis are the times and heights of the "knots" which define a contour that matches the original to some degree of precision.
Syllable Stylisation Models
Models of defined regions of the contour simply seek to represent pitch movements within a given domain in a small number of parameters. The domain is typically the syllable, although larger domains are clearly possible. Most models used in speech synthesis simply measure a small number of F0 values at fixed proportions through the domain, for example Black and Hunt (1996) use three F0 values per syllable: at the start, at the end and at mid-vowel.
One useful technique is to fit a straight line over a given stretch of contour. The method of least squares may be used to obtain the height (a) and slope (b) of the best-fitting line:
In the TILT model (Taylor, 1998), each accented syllable is labelled with a "tilt event" which describes a rising-falling shape of particular degree using three parameters: amplitude, duration and shape. Parameters for each tilt event can be calculated from the F0 contour through an analysis-by-synthesis procedure, while the contour over unaccented syllables is assumed to follow a quadratic interpolation between the tilt events.
In the qTA model (Prom-on, Xu, & Thipakorn, 2009) the contour in each syllable is described by a pitch "target", to which the contour is said to approximate over time. The pitch target can itself be rising or falling, and the analysis procedure extracts two parameters per syllable from the F0 contour on the assumption that the contour approximates the sequence of pitch targets using a third-order critically-damped system.
- Manipulation of pitch
A number of techniques exist to modify the fundamental frequency contour for an utterance. The Pitch-Synchronous Overlap-Add (PSOLA) method is widely used. PSOLA operates by breaking apart the speech signal into pitch-period-sized sections, then re-assembling the signal with the sections pushed closer together or further apart, repeating or deleting sections as required to maintain overall timing.
- General Linear Model
A common situation in inferential statistics is one where some measurement or dependent variable (DV) is influenced by a number of factors or covariates called the independent variables (IV). The goal is then to uncover which IV have some significant effect on the measured DV and the size of their effect. Such a situation is addressed by a number of statistical approaches, such as t-tests, analysis of variance, regression and other parametric models. All these may be accommodated within a single framework called the General Linear Model (GLM).
At its heart, a GLM is a model of the data consisting of an algebraic equation that relates the IVs to the DV according to the values of a number of coefficients:
Where Y is the dependent variable, {X} represent the independent variables and the {β} represent coefficients that are learned from the data. Note the the X's can be continuous covariates (like age, say) or factors (like gender).
After the GLM is fitted to the data, the result of the analysis provides both the estimated values of the {β} coefficients and a confidence interval for these values ("parameter estimates"). If the confidence interval does not include zero, then we can assume that the relevant factor or covariate has a significant effect on the DV.
GLM is useful in experimental work since we are not only interested in whether an IV has significant effect on the DV but also to obtain an estimate of the size of its influence.
References
- ’t Hart, J., Collier, R., & Cohen, A. (1990). A Perceptual Study of Intonation. Cambridge, UK: Cambridge University Press.
- Black, A. and Hunt, A. (1996). Generating F0 contours from ToBI labels using linear regression. Proceedings of lCSLP 96, Philadelphia, vol 3:1385-1388.
- Hirst, D., & Espesser, R. (1993). Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de l'Institut de Phonétique d'Aix , 15, 71-85.
- Taylor, P. (1998). The Tilt intonation model. International Conference on Spoken Language Processing, (pp. 1383-138). Sydney, Australia.
- Prom-on, S., Xu, Y., & Thipakorn, B. (2009). Modeling tone and intonation in Mandarin and English as a process of target approximation. Journal of the Acoustical Society of America , 405-424.
- Fujisaki, H. and Hirose, K. “Analysis of voice fundamental frequency contours for declarative sentences of Japanese”. In Journal of the Acoustical Society of Japan (E), 5(4): pp. 233-241, 1984.
Readings
- Abberton, Howard and Fourcin, Laryngographic assessment of normal voice: a tutorial. Clinical Linguistics and Phonetics, 3 (1989), 281-296.
- Prom-on, S., Xu, Y. (2010). The qTA Toolkit for Prosody: Learning Underlying Parameters of Communicative Functions through Modeling. Speech Prosody 2010, Chicago.
Reading Passage
Arthur the Rat
There was once a young rat named Arthur who would never take the trouble to make up his mind. Whenever his friends asked him if he would like to go out with them he would only answer, "I don't know." He wouldn't say "Yes" and he wouldn't say "No" either. He could never learn to make a choice.
His aunt Helen said to him "No-one will ever care for you if you carry on like this. You have no more mind than a blade of grass." Arthur looked wise but said nothing.
One rainy day the rats heard a great noise in the loft where they lived. The pine rafters were all rotten, and at last one of the joists had given way and fallen to the ground. The walls shook and the rats' hair stood on end with fear and horror. "This won't do," said the old rat who was chief. "I'll send out scouts to search for a new home."
Three hours later the seven scouts came back and said, "We've found a stone house which is just what we wanted. There's room and good food for us all. There's a kindly horse named Nelly, a cow, a calf and a garden with an elm tree." Just then the old rat caught sight of young Arthur. "Are you coming with us?" he asked. "I don't know," Arthur sighed, "The roof may not come down just yet." "Well," said the old rat angrily, "We can't wait all day for you to make up your mind. Right about face! March!" And they went off.
Arthur stood and watched the other rats hurry away. The idea of an immediate decision was too much for him. "I'll go back to my hole for a bit," he said to himself, "just to make up my mind."
That night there was a great crash that shook the earth, and down came the whole roof. Next day some men rode up and looked at the ruins. One of them moved a board, and under it they saw a young rat lying on his side, quite dead, half in and half out of his hole.
Laboratory Exercises
- F0 Distributions
- Analyse a speech + Lx recording of a 2-minute read passage and print first and second order Dx histograms, a cross-plot and a table of statistics using EFxHist.
- Intonational Contrast
- Use the EFxHist waveform display to print speech and Fx displays from recordings of this intonation contrast:
- They saw twenty \Snowmen.
- They saw twenty /Snowmen?
- Annotate the print with the words in the sentences, aligned to the contour. What are the major changes in contour between the two versions? What differences do you observe between accented and unaccented syllables?
- Compare the fundamental frequencies used in the sentences with the distribution of fundamental frequency plotted for your reading of the passage. What parts of your range did you use in what parts of your sentences? How might you use knowledge of your range to normalise your fundamental frequency contour?
- Intonation manipulation
- Record the snowmen statement (or another statement of your choice) into Praat, then display and stylise the contour. You will need the Praat commands:
- New | Record Mono Sound
- "To Manipulation"
- "Edit"
- Pitch | Stylise Pitch
- Use Praat to modify the pitch contour for the statement into a question using the information you obtained from your earlier analysis. How well does this work? What else is different between the statement and question apart from fundamental frequency?
- Variation in intonation with accent
- The CSV file y:/EP/ivienorm+slope.csv contains some analyses of pitch contours from the IVIE corpus of intonational varieties of English. Specifically there are contours for five types of sentence which have been normalised to each speaker in semitones relative to their mean F0, and normalised to duration by dividing each sentence into 25 equal sections of time. Recordings have been chosen for two accent areas of the British Isles: Belfast (bs) and London (ls). The sentence types are Coordinations (coo), Simple Statements (dec), Questions without morphosyntactic markers (dqu), Wh- Questions (whq), Inversion Questions (yno). Samples of the recordings may be found in y:/EP/ivie.
- Using Graphs | Legacy Dialogs | Error bar and "summaries of separate variables", plot some graphs of the F0 values showing the average pitch contour for each type of sentence (rows) for each accent (columns). Can you see differences in the contour according to sentence and/or accent?
- One of the calculated features of each sentence is the slope of the pitch contour at time point 20. Using Graphs | Legacy Dialogs | Boxplot, with "Clustered summaries for groups of cases" plot a graph showing how the SLOPE variable varies with STYPE (category) and ACCENT (cluster). Does SLOPE vary across accents?
- Using Analyze | General Linear Model | Univariate, test whether SLOPE varies significantly across accents and sentence types. Set the Dependent variable to SLOPE and the fixed factors to ACCENT and STYPE. Under Plots, set ACCENT as horizontal axis and STYPE to join lines. Under Options, select Parameter Estimates. Interpret the GLM output and the interaction plot.
First stylise with a 0.5 semitone resolution, then 1.0, 2.0, 4.0 and 8.0 semitones. At each stage examine the stylised contour. At which point can you hear a difference between the original and the stylised version?
Reflections
- What differences are there between a stressed syllable and an accented syllable?
- For hypothesis testing, a measured parameter must be both sensitive to wanted variation yet insensitive to unwanted variation. Discuss this with respect to measures of central tendency of an Fx distribution, and with respect to measures of range of an Fx distribution.
- How might the choice of recording material (spoken text) affect the shape of the fundamental frequency distribution?
- What is meant by perceptual 'normalisation' in general? Think of some examples of normalisation occurring with other human senses.
Word count: . Last modified: 14:06 30-Jan-2018.