SFS Manual for Users

4. SFS Data Sets

In the following sections we shall look in a little more detail at the different item types, describing what information is stored for each type, how types are typically converted to other types, and how instances of a data type may be automatically given a text label that can be more meaningful than the processing history.

4.1 Types in detail

In the table below, the standard data format for each item type is described. Component parts of data records named in parentheses, e.g. (posn), refer to labels used by sdump when listing the contents of a data set. More technical descriptions of the data records are given in section 4 of the Programmer's Manual.

SPEECH, LX:: 1 frame is 1 waveform sample. By default all waveforms are assumed to be 16-bit data, so data of 12 or 8 bits should be shifted left in the word before use. For replay of 12-bit waveforms that are right shifted, replay and Es support a -12 switch, which changes the default to 12 bits.
TX:: 1 frame is the duration of 1 pitch period (in units of FrameDuration as found in item header). Unvoiced periods are pitch periods longer than 25ms (i.e. lower than 40Hz). Values are not cumulative, so that to find out when a pitch period took place it is necessary to count from the beginning.
FX:: 1 frame is a fundamental frequency value in Hz. Value is zero for unvoiced.
ANNOT:: 1 frame is an annotation comprising a position in time (posn), a duration (size) and a text label (label).
PHONETIC:: 1 frame holds information about the physical realisation of a phonetic segment, including duration, pitch and acoustic parameters. The data record comprising a symbolic name (sname), two segment durations (length1 & length2) (first synthetic, second natural), two fundamental frequency values (pitch1 & pitch2) (first synthetic, second natural) and a list of feature values (alist). The mapping of the values in the list to features (VOICE, PLACE, MANNER, etc) is performed through the phonetic dictionary (phon.dic). PHONETIC items are used in synthesis-by-rule.
SYNTH:: 1 frame is a set of parameters for one control cycle of a formant synthesizer. The values in the frame are stored in Hz for fundamental frequency and formant frequencies and db/10 for amplitudes. Other control values are scaled in arbitrary ways (e.g. voicing in range 0..248) depending on what is expected by the synthesizer. Most work has been done with the JSRU parallel formant synthesizer (in hardware and software). The standard frame for this synthesizer has 19 values: Frequency, Bandwidth and Amplitude of five formants, plus Fx, Voicing, Amplitude of Noise Excitation and Mark-Space Ratio of Voiced Excitation.
WORD:: 1 frame is an arc from a chart used in syntactic analysis of a passage of speech or for a word lattice in recognition. Each frame holds details of one word or one syntactic constituent. Each frame comprises the word or constituent name (label), the start and end nodes of the word in the chart (start & end), a floating point value (score), and the attribute list for the word or constituent (alist).
DISPLAY:: 1 frame is an encoded spectrum. Each record comprises a position in time (posn) a duration (size) and a vector of grey-level values (data).
COEFF:: 1 frame describes a spectrum. Each record comprises a position in time (posn), a duration (size), some flags (flag), a voicing mixture (mix), a frame gain (gain) and a vector of energies (data).
FORMANT:: 1 frame encodes a set of spectral peaks for a period of time. Each record comprises a position in time (posn), a duration (size), some flags (flag), a frame gain (gain), a count of the number of peaks (npeak), and a series of peak descriptors containing the frequency, bandwidth and amplitude of each peak.
LPC:: 1 frame encodes the LPC polynomial coefficients and the LPC gain in a record as for CO.
MARKOV:: The whole data set encodes a Hidden Markov model.
TRACK:: 1 frame encodes an acoustic parameter as a floating point number.

4.2 Interconversion

This section gives a brief guide to the types of data set interconversion possible using the SFS data types. For most of the major types, the table below gives names to some potential interconversion processes. Example program names are given, but these do not form a definitive list, nor are they a set of recommendations.

SPEECH:

Items	Description	Programs
-> SP	import data into a speech item	slink
-> SP	record speech data	record
-> SP	generate test signal	testsig
SP -> SP	filtering, etc	genfilt
SP -> SP	inverting, AGC, pre-emphasis	prep
SP -> SP	concatenating, editing	spancat, spaned
SP -> SP	change sampling rate, speaking rate, pitch	resamp, respeed, repitch, repros
SP -> TX	time-domain fundamental frequency estimation	pp
SP -> FX	frequency-domain fundamental frequency estimation	fxanal, fxac, fxcep
SP -> AN	manual annotation	Es
SP -> SY	formant analysis (usually SP -> FM -> SY)	fmanal & fmtrack
SP -> DI	spectrographic analysis (usually SP -> CO -> DI)	spectran & dicode
SP -> CO	spectral/filterbank analysis	spectran, voc8, voc19, voc26, filtbank
SP -> CO	Mel-scale cepstral analysis	mfcc
SP -> FM	formant estimation	fmanal
SP -> TR	waveform envelope	envelope
SP -> TR	periodicity track	noisanal
SP ->	export speech signal to text/binary files	splist, sfs2wav, wave
SP ->	replay speech signal	replay

LX:

Items	Description	Programs
-> LX	import Laryngograph data	slink
-> LX	record Lx data	record
LX -> LX	filtering, resampling	genfilt, resamp
LX -> TX	instantaneous fundamental frequency estimation	vtx, txgen
LX ->	export Lx signal to text/binary file	splist, sfs2wav, wave

TX:

Items	Description	Programs
-> TX	import Tx data	slink
-> TX	record Tx data	intx
TX -> FX	time-domain/frequency domain conversion	fx
TX -> AN	pitch period annotations	txan
TX ->	export Tx data	wave

FX:

Items	Description	Programs
-> FX	import Fx data	fxload
FX ->	export Fx data	fxlist, wave

ANNOT:

Items	Description	Programs
-> AN	import annotations	anload
AN -> TX	hand drawn pitch periods	antx
AN -> AN	annotation mapping	anmap
AN -> TR	annotation to track	antr
AN ->	list/export annotations	anlist

SYNTH:

Items	Description	Programs
-> SY	import synthesizer control data	syload
SY -> SP	software formant synthesis	soft
SY ->	export synthesizer control data	sylist
SY ->	replay synthesizer control data	srusyn11

COEFF:

Items	Description	Programs
-> CO	import coefficients	coload
CO -> SP	filterbank synthesis (usually CO & FX -> SP)	vocsyn
CO -> AN	Automatic annotation	andict, annotate, vcalign
CO -> DI	grey-level display	dicode
CO ->	export coefficients	colist

FORMANT:

Items	Description	Programs
-> FM	import formant estimates	fmload
FM -> SY	formant tracking	fmtrack
FM ->	export formant estimates	fmlist

TRACK:

Items	Description	Programs
-> TR	import track data	trload
TR ->	export track data	wave

There are, of course, many other possible interconversions and other existing programs for the interconversions shown.

4.3 Text labels & 'slook'

Introduction

Each data set is produced by a processing program, and each processing program generates a processing history for each data set it produces. The expanded history of a data set is a construction of all of the processing histories of all the data sets on the processing path. Thus an FX item may have the history:

Item 4.01  fx(3.01)

That is it was generated by the program fx operating upon item number 3.01. Item 3.01 is a TX item and it might have the history:

Item 3.01  HQtx(2.01)

That is it was generated by the program HQtx operating upon item 2.01. This was an LX item which, let us say, had the history:

Item 2.01  inwd(freq=12800)

The expanded histories are formed by substituting the item numbers in these histories by the history string of the item referred to. Thus for the items above the expanded histories are:

Item 3.01  HQtx(inwd(freq=12800)) 
Item 4.01  fx(HQtx(inwd(freq=12800)))

To examine the expanded histories for items in a given file, use the '-l' switch (for 'long') on the program summary:

% summary -l testfile
1. SPEECH (1.01) 16640 frames from inwd/SP(freq=12820,linked) 
2. LX (2.01) 16640 frames from inwd/LX(freq=12820,linked) 
3. TX (3.01) 75 frames from tx(inwd/LX(freq=12820,linked); 
thresh=1,height=4)

The expanded histories give a full account of the processing performed on each data set independently from other items in the file, but they are difficult to read. SFS contains a mechanism called text labelling for generating text descriptions of data sets from the expanded history. These text labels are useful in providing titles for graphs as well as for keeping track of the data sets in a particularly complex file. Text labels are used by Ds and pick to provide simple descriptions of data sets. To list the text labels for the items in a file, use the program slook:

% slook testfile 
1. SPEECH (SP.01) 16640 frames of natural speech 
2. LX (LX.01) 16640 frames of natural lx
3. TX (TX.01) 75 frames of tx from lx

The text labelling mechanism can be customised by the user to incorporate new programs or to give different amounts of technical information in the label. For example, the text label for item SP.01 in the above example could include the sampling frequency. The rest of this section deals with changing the default set of text labels.

Text Label Customisation

The mapping from expanded history to text descriptions is performed using one or more files of pattern matching information; these are called label files. There is a system label file: $SFSBASE/data/labels which provides simple text descriptions for the most common speech processing programs used with SFS. This file can be supplemented with labels files of your own or your work group which can take precedence over the system label file. These files are searched for text descriptions of expanded histories by SFS routines built in to programs such as Ds and slook.

The format of each line in a label file is a pattern-match string followed by a text-replacement string, optionally followed by a number of item history codes. The fields in the line are separated by ':'. The item history codes are short-hand tags for the pattern-match string that can be used to locate items in the file using the standard '-i item' convention in command lines (see itspec).

The pattern-match strings are constructed in the format described by histmatch, which allows the use of '*' to match zero or more characters, and '?' to match a single character. The patterns '%%', '%0' and '%1' match any substring that contains matching parentheses; the last two constructions allow the matching substrings to be returned for use in the text description (see below). To create a text description from an expanded history, the pattern matches in the label file are tested against the history string in turn until a match is found. If no '%' type matches are found, the text description of the history is simply the text string following the first pattern match. If '%' type pattern matches are used, the substrings located may themselves be searched in the label file to generate text description sub-strings. These operations may be seen in the examples below.

Given the expanded history:

tx(inwd(freq=12800);thresh=4)

the label file line

tx(*):Tx from Lx:

would generate the text label

Tx from Lx

While the label file line:

tx(*;thresh=%0):Tx with threshold %s:

would generate the text label:

Tx with threshold 4

That is, the matching sub-string of the '%' type match, namely '4' is substituted in the text description at the point indicated by the marker '%s'. A maximum of two substrings may be extracted from the expanded history using '%0' and '%1'. The use of '%%' is indicated by the following example; given the expanded history:

proc1(proc2(inwd(freq=12800);thresh=1,limit=1);thresh=2,limit=3)

the label file line:

proc1(*;thresh=%0,*):proc1 with threshold %s:

would not be guaranteed to locate the correct match to the pattern '%0' since there are two possible matches. Since the pattern '%%' is guaranteed to match a substring containing matching parentheses, the label file line

proc1(%%;thresh=%0,*):proc1 with threshold %s:

would produce the required text label

proc1 with threshold 2

The recursive matching of substrings is demonstrated in the following example. Given the expanded history:

scopy(inwd/LX(freq=12800))

and the label file lines

inwd/LX(freq=%0):%sHz Lx:
scopy(%0):copy of %s:

the text label would be

copy of 12800Hz Lx

The first match is to the second line, matching '%0' with 'inwd/LX(freq=12800)'. This is then matched again to derive the label '12800Hz Lx' which is substituted into the text label for the first match.

The main body of a label file consists of pattern match lines as described above, and detailed in LABELS. However, to speed up access to these files, indexing information must be added to them. This indexing of labels files must be performed after every change to the file. When this is not performed, the label searching routines print a warning such as 'label file out of date'. The indexing program is called prolab and is run simply by prolab filename.

The environment variable SFSLABEL controls the selection of label files used in the matching process. If SFSLABEL is not defined, only the system label file is used. If SFSLABEL is defined, it is taken to be a list of label files separated by ':' in the same manner as PATH and SFSPATH. The label files are searched in the order in the variable, and normally the system label file will appear last in the list. In the following example, a user's own label file is created and used to pre-empt the text label attached by the system label file.

% echo $SFSLABEL 
'SFSLABEL' undefined 
% summary file 
1. SPEECH (1.01) 16640 frames from inwd/SP(freq=12820,linked) 
2. LX (2.01) 16640 frames from inwd/LX(freq=12820,linked) 
% slook file 
1. SPEECH (SP.01) 16640 frames of natural speech
2. LX (LX.01) 16640 frames of natural lx
% cat /usr/mark/.labels 
inwd/SP(freq=%0,*):natural speech at %sHz: 
% prolab /usr/mark/.labels 
% setenv SFSLABEL /usr/mark/.labels:/usr/sfs/data/labels 
% slook file 
1. SPEECH (SP.01) 16640 frames of natural speech at 12800Hz
2. LX (LX.01) 16640 frames of natural lx

Next Section