zyxwvut
zyxwv
zyx
zyxw
zyxwvutsrq
zyxwvutsrqpo
zyxwvutsrqpon
IEEE TRANSACTIONS ON ACOUSTICS,
SPEECH,
AND
SIGNAL
PROCESSING,
[7] J. M. Heinz and K. N. Stevens, “On the properties of voiceless
fricative consonants,” J. Acoust. SOC. Amer., vol. 33, pp. 589596,1961.
[8] J. M. Heinz, “Analysis of fricativeconsonants,” Mass.Znst.
Technol. R. L. E. Quart. Prop. Rep., no. 60,pp. 181-184,1961.
191 J. L. Flanagan and L. Cherry, “Excitation of vocal-tract synthesizers,” J. Acoust. SOC.Amer., vol. 45, pp. 764-769,1969.
noise for fricativeand
[ l o ] K. N. Stevens, “Airflow and
stop consonants:Staticconsiderations,
J.Acoust. SOC.Amer.,
turbulent:
VOL. ASSP-26,
NO.
V O ~ .50,pp.
1 , FEBRUARY 1978
21
1180-1192,1971.
[ 111 H. Fujisaki and 0. Kunisaki, “Analysis and recognition of voiceless fricativeconsonants in Japanese,” J. Acoust. SOC. Japan,
V O ~ .31, pp. 741-742, 1975.
[12] 0.Kunisaki, T.Matsuo,and H. Fujisaki,“Perceptualstudyof
voiceless fricativeconsonants
using syntheticstimuli,”
Rec.
SpringMeeting, Acoust. SOC.Japan, pp. 327-328,1976.
[ 131 L. L. Thurstone, The Measurement of Values. Chicago: Chicago
Univ. Press, 1963.
A Speaker-Independent Speech-Recognition System
Based on Linear Prediction
speech recognition hasnot been very successful [3]. Recently,
Itakura [3] proposed a measure of distance based on .a maximumlikelihoodestimatewhichworks
well for a single
speaker.
Speaker-independent recognition of isolated words poses a
number of problems.Eachspeakerhas
his ownpeculiar
speech characteristics. Speakersenunciateatdifferent rates,
emphasize different parts of the utterance, and have different
regional accents. All these facts combine t o make the task of
speaker-independent
recognition
of
spoken
words
rather
difficult. This has evidently led most researchers working in
thespeech-recognition area to concentrateononespeaker.
Therecognitionprocess
is furthercomplicated bythefact
thatthefeature
parametersexhibitcomplicatedpatterns.
This fact has led to the belief in this project that it would be
impossible to obtain very high recognition rates withonly
onereferencepatternforeachword
in thevocabulary. It
INTRODUCTION
seems more plausible thata numberofreferencepatterns
INEAR prediction is becoming increasingly important obtainedfromspeakerswith
varying speech characteristics
rates. This
in speech analysis because of the accuracy with which it will more likely yield acceptablerecognition
forecasts time series data and the speed of computation of
project is an attempt to explore the validity of this idea. The
its coefficients. The linear predictor coefficients and the experimental runs made in this project completely justify the
autocorrelation valuescanbeused
to find theformant fre- above ideas.
quencies,the spectral envelope,etc.,of the speech samples
In this papera
new distancemeasure is described for
[ 11, [9]. A significant amountof previous workhasbeen
speaker-independent
recognition
of
isolated words.
This
done in this area. Real-time linear predictive coding of speech measure is tested together with the nearest-neighbor decision
has been implemented on the SPS-41 [7]. However, the use rule to recognize isolated utterancesspokenbydifferent
of linear predictor coefficients as feature parametersfor
speakers. Investigations show thata
combinationofthe
nearest-neighbor rule andtheK-nearest-neighbordecision
Manuscript received December 23,1975; revisedAugust 10, 1976 rule gives the best results. This measure is also compared with
and May 31, 1977. This paper is taken in part from a thesis submitted
by V. N. Gupta to the Department of Electrical and Computer Engi- the one proposed by Itakura [ 3 ] . The two distance measures
neering, Clemson University, Clemson, SC, in partial fulfiiment of the give comparable results. It is also shown that approximately
requirements for the Ph.D. degree. This work was supported in part
10 to 14 reference utterances for each wordin the vocabulary
by the National Science Foundation under Grant GK-42109.
are
necessary for
good
speaker-independent
recognition
The authors arewith the Department of Electrical and Computer
results.
Engineering, Clemson University, Clemson,SC 29631.
Abstract-This paper describes a speaker-independent speech-recognition system using autoregression (linear prediction) on speech samples.
Isolated words from a standard
40-word reading test vocabulary are
spoken by 25 different speakers. A reference pattern for each word is
stored as coefficients of the Yule-Walker equations for 50 consecutive
overlappedtime windows. Various distance measures are then proposed and evaluated in terms of accuracy of recognition and speed of
computation. The best measure gives 90.3 percent rate of recognition.
Both the nearest-neighbor and K-nearest-neighbor algorithms are used
in the decision scheme implemented. The computation is minimized
by making sequential decisions after a fixed number of iterations. It
is observed that computationally this distance measure coupled with
anonlineartime-warpedfunction
for matching of windows gives
optimal results. Thenumber of speakers was then increased to 105
to show the statistical significance of the results obtained in this project. The recognition rate obtainedwith the bestprocedure for 105
speakers was89.2 percent. The recognitiontime for thisprocedure
was 9.8 seconds per utterance.
zyxwvut
L
zyxwv
0096-35 18/78/0200-0027$00.75 0 1978 IEEE
28
zyxwv
zyxwvutsr
zyxwvutsr
zyxwv
zy
zyxwvutsrqp
zyxwvutsrq
zyxwvu
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING,
VOL.
SOMEPROPOSED
DISTANCE MEASURES
By dividing the utterance into a sufficientnumber of windaws, eachwindow can be assumed to be stationary [I].
This is equivalent to approximating vocal tract shape with a
succession of stationary shapes. Under these circumstances,
each window can be modeled asan autoregressive process of
order p , or an A R ( p ) process [ 121; that is, the present sample
can be modeled as the weighted sum of the past p values:
Z t = @ l Z t -+l @ 2 Z t - 2 + . . . t I $ p Z t - p + a t
where a, is a “white noise” process with zero mean and variance 02. The least squares estimate ofthe parameters I$1,
I$2, * , @p can beobtained
by solving the Yule-Walker
equations
-
ASSP-26, NO. 1, FEBRUARY 1978
require about
the
same computation
time.
Distance measure
( 5 ) gives about 2 percent h v % r rate of recognition and requires slightly moretime for computation, since it requires
the logarithm to be
p times*
COMPUTATION
OF DISTANCEBETWEENTWO
UTTERANCES
The distance is calculated between each labeled or reference
utterance and tke input utterance. For each labeled utterance,
the parameter q3 is stored
for
each of the W fixed windows.
The reference utterance is thusAstored as a time-varying pattern of theestimated parameter @,
that is,
I\
Q, = ($(I), h 2 ) ,
-
, $(W>>,
* *
where
P
Pk=c@iPli-kl,
i=l
k=l,2,‘*‘,p
(1)
by replacing the theoreticalautocorrelations p k by the estiFated autocorrelations r k . Since the least squares estimate
@ of @ = ( G 1 , iP2, * * , @ p ) is a maximum log likelihood estimate of the parameter @ [ 5 ] , [ 121, a distance measure of the
form
2 w )=
32(w),
* * >
3,(w>),
w = 132, *
* *
, W.
The autocorrelation coefficients of the W windows of the
unknown utterance can be expressed as
= ( 4 ) , r(2),
*
--
>
r(WN,
where
r(w)=(ro(w),r1(W);**,rp(w)),w = l , 2 , * * . , W
and
r o ( w )w==l l, , 2 ; . * , W .
was considered. The
are the estimated linear predictor
coefficients of the reference speech sample while rk are the
autocorrelation coefficients of the unknown speech sample.
It should be clear that the distance measure is independent
of energy in each window since r k are normalized autocorrelation coefficients (ro = 1). Distance measures of the form
The distance between the nth window of the labeled utterance and the mth window of the unknown is given by either
(3), (4), or (5). The distance measure for (3) can be written
as
d ( n , m ) =log C,
=-lo,
if C # 0
if C = 0,
where
Note the above procedure is followed to avoid taking the log
of zero. The distance d(n, m ) = -10 corresponds to setting
C equal to e-”, a small positive value instead of zero. Experimentally, C was never observed to be 0.
The total distance between the labeled and unknown uttershould all give excellent results. (None of these expressions ance is given by
represents a true metric, and thus they are actually similarity
W
measures instead of distance measure. The use of the term
D
=
d(n(m),m),
“distance” in this paper should be interpreted in this light.)
m=1
It has been found that only distance measures (3), (4), and
( 5 ) give excellent results. This seems intuitively obvious, where n is a function of m.
In order to specify n as a function of m, thatis,
since some summationsin distance measure (2) may have
errors which cancel because of opposite signs. This is not
n = f(m),
possible for (3), (4), and (5). Equation ( 5 ) also is expected
to perform worse than (3) and (4), since there is higher proba- two functions have been considered. The first, fl ,is specified
bility of the rk fitting exactly one of the p equations (1) and, below, along with its continuity conditions [3]. It should be
hence, giving zero residual and showing a very high degree of clear from the definition of f l that it is a monotone, increasmatch. Experimentally, it has been confirmed that distance ing, integer-valued function. The function never has the same
measures (3) and (4) give the best results. Also (3) and (4) value for more than 2 consecutive integers (windows).
or
zyxwvutsrqponm
zyxwvutsrqponm
zy
zyxwvuts
zyxw
zyxwvutsrq
GUPTA et al. : SPEAKER-INDEPENDENT
SPEECH-RECOGNITION
SYSTEM
29
TABLE I
VARIABLE
x WITH ITS OBSERVED
FREQUENCIES
AKD THEORETICAL
PROBABILITIES
USINGNEAREST-NEIGHBOR
DECISION
RULE
RANDOM
z
-t.;l
a=:
z
0
U
Probability
n.
Observed CeI I
Frequency
Expected
Va I ue
e.
(Oi
-
ei12
01
zyxwvutsrqponmlkjih
zyxwvutsrqponm
,00705
zyxwvutsrqponmlkjihg
P
(-7
0
gs
Wv;
4
1
E
1
0
W
fi
0
s
E)
zyxw
Observed
Frequency
x
uu,
cc
5:OO
8
,0856
8.388
19
,1432
15.036
19
1.0450
@ PLUT USING
K-NERREST
FI AND
4
13
,1772
18.606
19
0.0083
5
21
. I754
18.417
21
0.3623
6
5
,1447
15.1935
5
6.8389
12
. 1023
10.7415
12
0.1474
7:OO
9:OO
1'1.00
15.00
lki.00
7
zyxwvutsrqp
PER UTTERRNCE
Fig. 1. Recognition using functions f1 and f2 and nearest-neighbor
8
and
K-nearest-neighbor
algorithms.
a
7
9
K = 1, 2,
fl(k) = w,
,06335
6.65175
,03485
3.65925
i
10
2
.0 1725
1.81125
>IO
0
,0140
1.470
1 .o
105
TOTALS
fl(m) = fl(m - 1) + K ,
K = 0, 1, 2,
0.04096
0.1086
3
NUMBER OF REFERENCESRMPLES
I,
4
a
2
PLET USING Fa RNO K-NERREST
@ NEIGHBOR ALGORITHH
3l.00
3.6855
USING
@ PLBT
NEAAEST
Fa AND
NEIGHEOR RLGORITHH
NEIGHBEA RLGBRITHM
y
V I1.00
'
.0351
105
7
0.01823
10
1.3487
105
9.91839
m22
for all f l (m) f fl(m - 1)
for all fl(m) = f l (m - 1)
if fi(m - 1)= W.
Another function that has been considered
is
f2(m) = m.
There is a special reasonforconsideringthefunction
f2.
Thisfunction is 2.5 times faster to computethan fl and
gives results close to thoseobtained whenusing f i . The
recognitionprocedure consists of recognizing theunknown
utterance as the labeled utterance which gives the minimum
distance to theunknown utterance.
APPLICATION
OF NEAREST-NEIGHBOR
AND
K-NEAREST-NEIGHBOR
RULES TO SPEECH
RECOGNITION
A combinationof the nearest-neighbordecision rule and
the K-nearest-neighbor decision rule is used to recognize the
unknown utterance. If X I , X,, * * , X , are n labeled samples
and if the unknown sample X is closest to the labeled sample
X , thenearest-neighbor rule assigns X the label associated
with X,. The nearest-neighbor rule is a suboptimal procedure;
however, with an unlimited number of samples, the error rate
is neverworse than twice Bayes rate [13]. This implies that
if enough samples are available for comparison, good results
can be obtained. Thus, a large number of reference patterns
of each utterance must be stored for comparison. The recognition rate has been tested with n = 1, 2,
, 14 reference
samples of each utterance.
Graphs 1 and 2 in Fig. 1 show results for n = 1, 2, * * , 14
for distance measure (3) and functionsfl and f 2 , respectively.
The order p of the linear predictive process was 14 for this
figure, and 40 wordsspoken by 25 speakers wereused for
recognition purposes. This figure shows that the best achieved
recognition rate was approximately88.6percent for fl and
84.8percent for f 2 , using the nearest-neighboralgorithm.
In calculating the recognition rates therewerenoreference
patternscorresponding to theunknownspeaker.Theques-
-
- -
tion that now arises is how much confidence can be associated
with the recognition rate obtained forthe speaker-independent
isolated wordrecognitionexperiment.
To answerthisquestion, it becomes necessary t o give some statistical arguments
for
confidence
in the above percentages.
The
following
statistical analysis was performed using n = 14, p = 14, and
the nearest-neighbor decisionrule.
Suppose that x represents the number of incorrectly recognized words for a speaker for the 40 word vocabulary. After
observing the value of x for different speakers, it was hypothesized that the random variable x has a Poisson distribution.
To test this hypothesis, a sample consisting of 105 speakers
was studied. Table I shows that the observed x ranged from
1 t o 10. Theexpected value for x was estimatedfromthe
data to be 4.95. Two statistical tests of the hypothesis that
x has a Poisson distribution were performed. The chi-square
test [16] was first performed. The chi-square test statistic
is
given by
x2
=
2 (Oi i=l
ei
ed2
,
where Oi is the observed frequency in cell i and e j is the expected frequency in cell i under the null hypothesis, that is,
ei = 105 Pr {x falls in cell i } , for i = 1, 2,
, k . The data in
Table I are used for the calculation of the chi-square value.
Cells are chosen so that ei > 4 in accordance with standard
statistical practice [16]. Thechi-square value obtained was
9.92. The 5 percent significance level critical statistic for this
test would bethe 95th percentile ofthechi-square
distribution with 7 degrees of freedom. This value is found to be
14.1 fromstandard
tables forthe chi-square distribution
function. Since 9.92 < 14.1, we fail-to-reject (accept) at the
5 percent level of significance the null hypothesis that x has
the Poisson distribution with mean 4.95.
The Kolmogorov-Smirnov OneSample test [ 161wasalso
performed t o test the hypothesis that x has a Poisson distribution. The Kolmogorov-Smirnov test statistic obtained was
0.0528. The 20 percent significance level critical statistic for
30
zyxwvutsrqponm
zyxwvuts
z
IEEE TRANSACTIONS ON ACOUSTICS,
SPEECH,
AND SIGNAL
PROCESSING,
VOL. ASSP-26, NO. 1, FEBRUARY 1978
zyxwv
TABLE I1
ERROR
RATEIN COMBINED
DECISION
RULE
DECISION
RULE
zyxwvut
y
zyxwvutsrqponmlkji
zyxwvutsrqpon
Number of
“
LPC
10
14
12
14
14
14
10
Decisions
Necessary
Number
Misclassified
Increase
i n Recog.
Rate
101
26
2%
85
21
1.5%
70
19
10
105
30
1.6%
12
10
83
23
2.2%
14
10
75
19
1.7%
10
6
118
35
2.1%
12
6
110
30
2.3%
14
6
104
33
2.4%
2.6%
which have the same label. Define 6 = K , K 2 . Every time
6 was greater than 2, the M-nearest neighbor rule was used to
recognize the unknownutterance.
For 6 greater than zero
and less than three, the following procedure was used. If the
difference between the distance of the nearest neighbor and
that of the nearest neighbor of a different label was more than
2 percent of the distance of the nearest neighbor, then the
nearest-neighbor rule was chosen over the K-nearest-neighbor
rule. If the above difference was less than 2 percent, then the
K-nearest-neighbor rule was chosen over the nearest-neighbor
rule. This combined decision procedure gave, on the average,
a 2 percent increase in recognition rate. The ratio of over 2
percent shows a very good match with the nearest reference
pattern and hence there is a high probability of the new
sample having the label associated with the reference sample
showing a good match with it. When several reference patterns
show a close match with the unknown sample, the reference
patternsinmajority are expected to give the correct result,
that is, the K-nearest-neighbor rule is expected to perform
better. The ratio of 2 percent was determined experimentally
while working with a vocabulary size of 20 words. The same
combined decision procedure also works well with the vocabulary size of 40 words. This combined decision procedure is
used when comparing the unknownutterancewith
10 or
more reference utterances. Table I1 shows theerrorrate of
this combined decision procedure. This table was compiled
using a40-word vocabulary spoken by 25 speakers. This
tableindicates,for
example, that for 14 linear predictor
coefficients per window and 10 reference samples per utterance, this combined decision procedure was necessary for 101
out of 1000 samples. Seventy-five of these samples were
classified correctly and 26 were misclassified. This increased
the number of samples recognized by 26 as compared to the
nearest-neighbor algorithm. This gave a 2.6 percent increase
in recognition rate. The recognition rate using the nearestneighbor rule was 87.3 percent. Hence, the combined recognition rate becomes 89.9percent.
When the number of
speakers was increased to 110.5, the increase in recognition
rate was 1.6 percent. The value of p was 14 and 14 reference
samples per word were used forthis example. Thus, the
recognition rate increased from 87.6 percent to 89.2 percent.
Table 111 shows a similar analysis for the combined decision
rule as was done in the case of the nearest-neighbor decision
rule. This analysis uses 40 words spoken by 10.5 different
-’
Fig. 2. Flowchart for the combined decision procedure.
this test is 1 . 0 7 / m = 0.1044. Thus, the null hypothesis is
accepted here also. The sample variance of x was 5.047. This
is close to thetheoretical variance of 4.95.
These tests support the hypothesis that x has a Poisson distribution. The 95 percent confidence limits for the mean of
the Poisson distribution were calculated to be 5.44 and 4.53.
A formula given by Wilks [ 17, p. 3911 was used for this computation. The upper confidence limitfor the mean of the
errors implies a lower confidence level of 86.5 percent for an
observed recognition rate of 87.6 percent.
TheK-nearest-neighbor rule classifies the unknown utterance X by assigning it the label most frequently represented
among the K nearest samples, that is, by examining the labels
on the K nearest neighbors and taking a vote. In case of a
tie, the nearest-neighbor rule was applied. As the number of
labeled samples increases, the K-nearest-neighbor rule is
expected to give better results than the nearest-neighbor rule
[ 131. In the experiments performed, the K-nearest-neighbor
rule showed a slightly lower percentage of recognition than
the nearest-neighbor rule as is evident from Graphs 3 and 4
in Fig. 1. K was chosen to be 7 for this graph. The result is
not surprising [14], since it is difficult t o specify precisely
what a “‘large” sample size is and what the distribution function of the predictor coefficients looks like.
Even though the K-nearest-neighbor rule does not give
better results, it can be used to advantage. A careful evaluation was made of the unknown samples which were recognized incorrectly by one method and correctly by the other
method. K again was set equal to 7 for this evaluation. This
led to a combined decision procedure which improved the
rate of recognition. This combined decision procedure is
only used when the nearest-neighbor and the K-nearestneighbor rules disagree. A flow chart for this combined decision procedure isgiven in Fig. 2. The termK-NN in the
flow chart specifies K-nearest-neighbor rule and NN specifies
nearest-neighbor rule. Let K 1 denote the number of samples
of the largest class and K, denote the number of samples of
the second largest class when the K-nearest-neighbor rule is
implemented. The term classisused to identifyutterances
zyxwvutsrqpo
zyxwvutsrqponm
zyxwvutsrqp
zyxwvutsrqponm
zyxwvu
zy
zyxwvutsrqpo
GUPTA et ai. : SPEAKER-INDEPENDENT SPEECH-RECOGNITION SYSTEM
31
0
TABLE I11
9
RANDOMVARIABLE
x WITH ITS OBSERVED
FREQUENCIES
AND THEORETICAL
PROBABILITIES
USING COMBINED
DECISION
RULE
7
Probability
n.
0
0
,0136
1
8
,0583
6.1215
2
10
,1254
13.167
IO
0.7617
a
3
25
,1738
18.879
25
1.9846
wrr;
4
I9
.I933
20.2966
19
0.0828
8
13
W
0
. 1662
17.451
13
1.1353
11
.I191
12.5055
I1
0.1812
7
12
12
2.4214
7
0.0290
.0 732
7.686
3
8
,0333
4.1265
4
9
,0188
1.974
,0130
1.365
’9
TOTALS
0
105
1 .a
105
I
105
g::
o m
6
5
LUTS USING ITAKURR’S DISTANCE MEASURE
0.02688
6.62288
speakers. Thesamplemeanfortherandom
variable x was ’
4.33. TheKolmogorov-Smirnovtest
statistic obtained was
0.0366andthechi-square
value obtained was 6.62with6
degrees of freedom. Thus, here also the hypothesis that the
distribution observed is Poisson with mean 4.33 is accepted.
The samplevariance of x was 4.18which is close tothe
theoretical value of4.33.FromTables
I and I11 it can be
found that this combined decision procedure gives an increase
in recognition rate of 1.6 percent for 105 speakers.
To test the hypothesis that the
recognition rate using the
combined decision procedure is significantly greater than the
recognition rate obtained by using the nearest-neighbor rule
alone, a sign test [18] was performed. For n = 14, p = 14
andfunction fi , ‘it wasobserved thatfor 59 outof105
speakers the combined decision procedure gave higher
a
recognition rate than the nearest-neighbor method. The two
decisionprocedures gave identical recognition rates for26
speakers. The
nearest-neighbor
decision
rule
gave
better
recognition rate thanthe combineddecisionprocedure
for
20 speakers. According to standard statistical procedure,
half of the 26 speakers giving identical results were added to
eachgroup forthe sign test. Thus,thecombineddecision
procedure wasassumed t o give betterrecognition rate for
72 speakersand the nearest-neighbor decision rule wasassumed to give better results for 33 speakers. The 5 percent
significance level critical value for the one-tailed test is found
to be 43 from standard mathematical tables. Since 33 < 43,
we reject at the 5 percent level of significance the null hypothesis that the recognition rate using the combined decision
procedure is equal to the recognition rate obtained by using
the nearest-neighbordecision rule. Thus,thealternate
hypothesis that the recognition rate using the combined decision
procedure is greater thanthe recognition rate obtainedby
using the nearest-neighbor rule is accepted here. The above
conclusion coupled with the fact that the combined decision
procedureadds very littlecomputationaloverheadwarrants
its use in place of the nearest-neighbor decision rule in the
isolated word-recognition system describedhere.
Thecombineddecision rule wasvaried slightly to see its
effect ontherateof
recognition forthe first 25 speakers.
E
E:
y
@
@
USES Fl RND NERREST-NEIGHBOR RLGORITHM
@
@
USES Fl RND K-NEAREST-NEIGHBBR RLGORITHN
USES F2 AND NERREST-NEIGHBBR
RLGORITHM
USES F2 RNO K-NERREST-NEIGHBOR
RLGBRITHN
zyx
The combined decision rule was tested for a decision boundary of 1.5 percentinstead of the 2 percent as usedabove.
The corresponding increases in recognition rates for 14 linear
predictor coefficients and values of n equal t o 10, 12, and 14
were 2.1 percent, 1.8 percent, and 1.6 percent, respectively.
For the decision boundary using a value of 2.5 percent, the
corresponding increases in recognition rate were 2.5 percent,
1.8 percent,and 1.4 percent, respectively. Similar results
wereobtained
using differentnumberof
linear predictor
coefficients. Clearly, there is no appreciable change in the
recognition rate for small changes in the decision rule.
Figs. 1 and 3 compare the recognition rate using the distance
measureproposedherewith
theoneproposedbyItakura.
For Itakura’s method, the parameters c(m; k ) and b(m; k )
for the mth segment of the kth reference pattern are calculated and stored on the disk. In the recognition process the
distance between the nth segment of the input and the mth
segment of a reference patternis calculated as
zyxwvutsr
zyxwvu
d(n, m ;IC)
= c(m; IC)
+log [(b(m; k)r(n))/(a^(n)r(n)>].
Thenotation used above is thatofItakura[3].
Fromthe
graph it is clear that the recognition rate for Itakura’s distance
measure also increases with an increase in thenumberof
referencepatternsforeachword
inthevocabulary.The
recognition rate reachesasaturation
level afterabout 10
referencepatterns for eachword in thevocabulary.From
Fig. 3 it is clear that the highest recognition rate achieved for
Itakura’s method is approximately 88.4 percent. Fig. 3 was
drawn using 14 linear predictor coefficients. Itakura’s method
also shows that approximately 10 to 14 referencepatterns
for each word in the vocabulary are necessary for achieving
good speaker-independent
recognition
results. The
combineddecisionprocedureworks
well for Itakura’s method
also. For example, for p = 14, n = 10, the combined decision
procedure gives an increase of 1.6 percent in recognition rate
as compared t o the nearest-neighbor decisionrule.
EXPERIMENTAL
PROCEDURE AND RESULTS
Thissectiondescribestheentiresetupandthe
significant
results obtained in this project. The samples were taken on
a PDP-15 computer interfacedto an EAI-680 analog computer.
Fig. 4 showsthe generalsampling setup, whichincludesa
cardioidmicrophone.The
sampleswere taken in thecomputingroomwiththeinherent
highnoiselevel.
Thenoise
32
zyxwvutsrqponml
zy
z
zyxwvutsrqponml
zyxwvutsr
IEEE TRANSACTIONS O N ACOUSTICS,
SPEECH,
AND
SIGNAL
PROCESSING,
VOL. ASSP-26, NO. 1, FEBRUARY 1978
zyxwvutsrqponm
0
10 KHZ
CLOCK
NEXT UNKNO
5 KHZ
CUTOFF
SET-UP FOR
TAKING
SPEECH
SAMPLE
WINDOWS
FOR N
SAMPLE
SPEAKERS
W=50?
I
SAMPLE
'
i
STORE r
AND
LFk
R E J E CUTP P E R
H A L F LABELLED
SAMPLES
J
YES]
ON DISK
FLOWCHART TO CALCULATE
AUTOCORRELATION (rk 1 AND LINEAR
PREDICTOR (LPC)COEFFICIENTS.
Fig. 4. Procedure for generating reference parameters.
FLOW-CHART FOR
RECOGNITION PROCEDURE
Fig. 5. Flow chart for recognition procedure.
level was measured to be 71 dB-SPL. The sampling terminates
The number of linear predictor coefficients p isvaried to
when 2000 consecutive samples stay below a prespecified
level.Falsetriggeringisavoided
by rejecting any sample findits ideal value.Fig. 6 shows the recognition rate using
utterance having less than 0.1-second duration. The samples three different values for p . This figure is drawn with the help
were verified by playing them back through an amplifier con- of 40 words spoken by 25 speakers. It is evident that p = 14
recognition programis
nected to a digital-to-analog converter to make sure that the gives the best results. Theentire
compiled using an H-level Fortran
beginning and theend of theword were not missed. The writteninFortranand
Fig. 6
sampling frequency is 10 kHz. The samples are stored on a compiler with a second level ofoptimization.From
it is clear that for n = 14, p = 14 and p = 10 give the same
magtape, and later classifiedusing an IBM 370/165.The
vocabulary consists of a standard40-word reading test vo- rateof recognition. Thus, distance measure (3) in conjunction with function fi , ' p = 10 and FZ = 14 together with the
cabulary (see Appendix) [ 191, spoken by 105different
combined decision procedureappear to be good choices to
speakers. The speakers wereall male, with a widerangeof
regional accents. The number of samples for the same utter- be used inthe recognition system. This proceduretakes
ance vary from1500 to 8000 depending on the speaker. A about 7.5 seconds to recognize eachutterance. The recogtotal of 4200 isolated utterances were used in the recognition nitionrate using the nearest-neighbor rule in this casewas
scheme. The first 1000 samples were used to plot the various 88.6 percent. The increase in recognition rate using the
graphs. All the 4200 samples wereused for statistical justi- combined decision procedure was1.7 percent. Hence, the
total recognition rate was 90.3 percent. When the number
fication of the results obtained with the smaller sample set.
Fig. 5 shows a flow chart for the recognition process. It was of speakers were increased to 105, the recognition rateobtained was 89.2 percent for p = 14, n = 14, andfunction
foundexperimentally
that differencing oncein
thetime
domain gave the best results. Each utterance isdivided into f i . The recognition timeinthis
casewas 9.8 seconds for
50 windows,whichare
50 percent overlapped. (Note that eachutterance.Thefunction
fi togetherwith p = 10 and
this implies windows of varying lengthfromutterance
to n = 14 takes 3.4 seconds to recognize each utterance. Funcutterance.) AHanningwindowis
applied to each of these tion f2 is about twice as fast as fl but gives approximately
windows, and then autocorrelation coefficients are calculated 5 percentlowerrateofrecognition.
Table IV showsmajor
for each window. The linear predictor coefficients are calcu- misclassifications using 14 linear predictor coefficients, n = 14,
lated for each window. The linear predictor coefficients are function f l , together with the combined decision procedure.
foundby solving the Yule-Walker equations (l), using an The samples of all the105 speakers are used in compiling
this table. Themajority of the misclassifications are due to
iterative method [ 11.
due to
The autocorrelationcoefficientsof
theunknown sample the samevowel sound. Somemisclassificationsare
and the linear predictor coefficients of the known sample are similarities in thebeginning or ending of the utterance.
used to calculate the total distance D. As is evident from the
CONCLUSION
flow chart, a sequential decision procedure is used to reduce
computationtime.After
calculating distances forthe first
Anew
distance measure forspeaker-independent recog4 windows, one half of the reference samplesare rejected. nitionof isolated words isgiven here. It isclear fromthe
These are the reference samples which give higher distances results that the recognition rate increases with p (the number
for the first 4 windows. A similar decision is taken after the of predictor coefficients) and with n (the number of reference
8th, 16th, and 32nd windows. This reduces the computation samples per utterance).The
recognition rate reaches the
to about one fifth while it has practically no effect on the saturation level after a certain number of reference samples
recognition rate.
per utterance. A value of p between 10 and 14 are shown to
zyxw
zyxwvut
zyxwvutsrqponm
zyxwvutsrqponm
zyxwvut
GUPTA et al. : SPEAKER-INDEPENDENT SPEECH-RECOGNITION SYSTEM
TABLE IV
MAJORMISCLASSIFICATIONS
USINGFUNCTION^^, LPC = 14, AND n = 14
7
ln
“1.00
33
Input
Utterance
Number o f
Errors
Recognized A s
zyxwvut
zyxwvu
zyxwvutsrqponmlkj
3.00
5.00
7.00
(a),
what (E),
( 3 ) , horse.
want
one ( 2 0 ) , was
along
what
was ( 1 9 ) , want(1
UD
bump ( 9 ) , PUPPY ( 4 ) -h e l p
( 2 ) , come ( 2 ) .
dark ( 2 ) , b a l l ,w a n t ,
one, what,road
little
l i v e (81, under ( 4 ) , h i l l
b a l l ,w i t h ,a l o n g
bump
along ( 2 ) , up ( 3 1 b
, all
( 3 ) , mother (7.1,
come ( 2 ) w
. hat,
down, o n e ,l o o k ,h e l p ,
little
bump
l ) , one ( 8 ) . run (2)
(31,
road
I
PUPPY
41
41
24
( 3 ) , run,
22
19
run,
I
9.00
11.00
13.00
NUMBER OF REFERENCE SRMPLES PER UTTERANCE
15.00
Fig. 6. Comparison of recognition rate using different number of linear
predictor coefficients.
be optimal. A sequential decision procedure is used to reduce
computationtime.Fora40-wordvocabulary,therecog
nition rate achieved was 89.2 percent with computation time
9.8 seconds per word. This recognition rate has been shown
to be statistically significant. It hasbeenshown
thatthe
numberofwords
misrecognized for eachspeakerfollowsa
Poisson distribution. The mean obtained for this distribution
is 4.33. This implies that 95 percent of the time the recognition rate for a new speakerwill lie between 78.8 percent and
99.6percent. Also, it has been shown thatthe 95 percent
confidence interval forthe average recognition rate of89.2
percent is between 88.1 and 90.15 percent.
APPENDIX
The40-wordreading
test vocabulary usedisas
follows:
here, run, play, is, down, help, ball, mother, see, look, come,
can,little,one,up,baby,
make,want,three,
jump,with,
friends, came, horse, ride, under, was, what, bump, live, very,
puppy, dark, first, wish, basket, food, road, hill, along.
ACKNOWLEDGMENT
Theauthorswould
like to thankthe
constructive comments.
reviewers fortheir
1ive
w i t h ( 7 ) , h i l l ( 2 1 , l i t t l e ( 2 ) . u pv, e r y ,
jump,
make,
one,road,
can
along
come ( 5 ) , down ( 4 ) . b a l l ( 4 ) ,h e l p
jump
food
road ( 4 ) , baby ( 3 ) , look
see, c a nt, h r e ea, l o n g
look
live ( 3 ) , l i t t i e ( 2 1 , up ( 2 ) , was (21, run, one
(Z),
down,
(31, bump (2),
here ( 2 ) .
l i v e (2),
19
I3
17
16
what, bump, puppy, h i l l ,r i d e
(15)
ride
run
very
can ( 4 ) . w i t h( 4 ) ,t h r e e
dark
ride
15
( 2 ) , here ( 2 ) . l i v e (21,
14
(IO), one ( 2 1 , road(2)
h i l l (5),
15
l i v e ( 4 ) , what ( 2 ) , mother, l i t t l e
13
h i l l ( 6 ) . l i v e ( 3 ) , can ( 2 ) , h e l p
12
was ( 6 ) .w a n t( 2 ) ,w i t h ,w h a t ,p l a y
11
came ( 3 ) ,f o o d( 3 ) ,
make ( 2 ) ,h e r e ,
come, l i v e
11
[6] -, “On the optimum quantization of featureparametersin
the Parcor speech synthesizer,” in IEEE Con8 Rec., 1972.
[7] M. J. Knudsen, “Real-time linear predictive coding of speech
on the SPS-41 triple-microprocessor machine,” IEEETrans.
Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 140-145,
Feb. 1975.
[8] J. Makhoul, ‘‘Spectralanalysis of speech by linear prediction,”
IEEETrans. Audio Electroacoust., vol. AU-21, pp. 140-148,
June 1973.
[9] -, “Linear prediction: A tutorial review,” Proc. of the IEEE,
vol. 63, pp. 561-580, Apr. 1975.
[ l o ] J. D. Markel,“Digitalinverse fdtering-A new tool for formant
trajectory estimation,” IEEETrans. Audio Electroacoust., vol.
AU-20, pp. 129-137, June 1972.
[ I l l R. W. Shafer and L.R. Rabiner, “Digital representationsof
speech signals,” Proc. of the IEEE, vol. 63, pp. 662-677, Apr.
1975.
[12] G. E. Box and G. M. Jenkins, Time Series Analysis Forecasting
and Control. San Francisco, CA: Holden-Day, 1970.
[13] R. 0. Duda and P. E. Hart, Pattern Classification and Scene
Analysis. New York: Wiley, 1973.
[14] G. M. White and P. J. Fong, “K-nearest-neighbor decision rule
performance in a speech recognition system,” IEEE
Trans.
Systems, Man, Cybernet., vol. SMC-5, p. 389, May 1975.
[ 151 E. Parzen, “Some recent advances in time series analysis,” in Statistical Models and Turbulence. New York: Springer-Verlag,
1972.
[16] B. W. Lindgren, Statistical Theory. New York: MacMillan
Company, 1960.
[17] S. S . Wilks, Mathematical Statistics. New York: Wiley, 1962.
[18] J. D. Gibbons, Nonparametric Statistical Inference. New York:
McGraw-Hill, 1970.
[19] R. L. Slosson, Slosson Oral Reading Test. New York: Slosson
Educational Publication, 1963.
zyxwvuts
zyxwvutsrqp
REFERENCES
[ 11 B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by
linear prediction of the speech wave,” J. Acoust. SOC.Amer.,
vol. 50, no. 2,pp. 637-655,1971.
[2] B. Gold and C. M. Rader, Digital Processing of Signals. New
York: McGraw-Hill, 1969.
[3] F. Itakura, “Minimum prediction residual principle applied to
speech recognition,” IEEETrans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 67-72, Feb. 1975.
[4] F. Itakura and S. Saito, “Analysis synthesis telephony based on
the maximum likelihood method,”in
Rep.6thInt.
Cong.
Acoust., Y. Kohasi, Ed., pp. C17-C20, paper C-5-5, Aug. 1968.
[SI - , “A statistical methodforestimation
of speech spectral
density and formant frequencies,” Electron. Commun. (Japan),
vol. 53-A, no. 1, pp. 36-43,1970.