Speaker independent speech recognition system and method

Khaled Assaleh

Research in automatic speech and speaker recog- nition has now spanned five decades. This paper sur- veys the major themes and advances made in the past fifty years of research so as to provide a tech- nological perspective and an appreciation of the fun- damental progress that has been accomplished in this important area of speech communication. Although many techniques have been developed, many chal- lenges have yet to be overcome before we can achieve the ultimate goal of creating machines that can com- municate naturally with people. Such a machine needs to be able to deliver a satisfactory performance under a broad range of operating conditions. A much greater understanding of the human speech process is required before automatic speech and speaker recog- nition systems can approach human performance.

Underlying of speech data refers the speaker features which are useful in speech recognition, speech processing, speech coding, and speech clustering. We described a brief of the area of speaker recognition, speech applications, and their underlying techniques. The review of automatic speech recognition (ASR) will discuss some of the positive and negative aspects of speaker recognition technologies and also outline the potential trends in research, development and applications.

Speech recognition is about what is being said, irrespective of who is saying. Speech recognition is a growing field. Major progress is taking place on the technology of automatic speech recognition (ASR). Still, there are lots of barriers in this field in terms of recognition rate, background noise, speaker variability, speaking rate, accent etc. Speech recognition rate mainly depends on the selection of features and feature extraction methods. This paper outlines the feature extraction techniques for speaker dependent speech recognition for isolated words. A brief survey of different feature extraction techniques like Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding Coefficients (LPCC), Perceptual Linear Prediction (PLP), Relative Spectra Perceptual linear Predictive (RASTA-PLP) analysis are presented and evaluation is done. Speech recognition has various applications from daily use to commercial use. We have made a speaker dependent system and this system can ...

zyxwvut zyxwv zyx zyxw zyxwvutsrq zyxwvutsrqpo zyxwvutsrqpon IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, [7] J. M. Heinz and K. N. Stevens, “On the properties of voiceless fricative consonants,” J. Acoust. SOC. Amer., vol. 33, pp. 589596,1961. [8] J. M. Heinz, “Analysis of fricativeconsonants,” Mass.Znst. Technol. R. L. E. Quart. Prop. Rep., no. 60,pp. 181-184,1961. 191 J. L. Flanagan and L. Cherry, “Excitation of vocal-tract synthesizers,” J. Acoust. SOC.Amer., vol. 45, pp. 764-769,1969. noise for fricativeand [ l o ] K. N. Stevens, “Airflow and stop consonants:Staticconsiderations, J.Acoust. SOC.Amer., turbulent: VOL. ASSP-26, NO. V O ~ .50,pp. 1 , FEBRUARY 1978 21 1180-1192,1971. [ 111 H. Fujisaki and 0. Kunisaki, “Analysis and recognition of voiceless fricativeconsonants in Japanese,” J. Acoust. SOC. Japan, V O ~ .31, pp. 741-742, 1975. [12] 0.Kunisaki, T.Matsuo,and H. Fujisaki,“Perceptualstudyof voiceless fricativeconsonants using syntheticstimuli,” Rec. SpringMeeting, Acoust. SOC.Japan, pp. 327-328,1976. [ 131 L. L. Thurstone, The Measurement of Values. Chicago: Chicago Univ. Press, 1963. A Speaker-Independent Speech-Recognition System Based on Linear Prediction speech recognition hasnot been very successful [3]. Recently, Itakura [3] proposed a measure of distance based on .a maximumlikelihoodestimatewhichworks well for a single speaker. Speaker-independent recognition of isolated words poses a number of problems.Eachspeakerhas his ownpeculiar speech characteristics. Speakersenunciateatdifferent rates, emphasize different parts of the utterance, and have different regional accents. All these facts combine t o make the task of speaker-independent recognition of spoken words rather difficult. This has evidently led most researchers working in thespeech-recognition area to concentrateononespeaker. Therecognitionprocess is furthercomplicated bythefact thatthefeature parametersexhibitcomplicatedpatterns. This fact has led to the belief in this project that it would be impossible to obtain very high recognition rates withonly onereferencepatternforeachword in thevocabulary. It INTRODUCTION seems more plausible thata numberofreferencepatterns INEAR prediction is becoming increasingly important obtainedfromspeakerswith varying speech characteristics rates. This in speech analysis because of the accuracy with which it will more likely yield acceptablerecognition forecasts time series data and the speed of computation of project is an attempt to explore the validity of this idea. The its coefficients. The linear predictor coefficients and the experimental runs made in this project completely justify the autocorrelation valuescanbeused to find theformant fre- above ideas. quencies,the spectral envelope,etc.,of the speech samples In this papera new distancemeasure is described for [ 11, [9]. A significant amountof previous workhasbeen speaker-independent recognition of isolated words. This done in this area. Real-time linear predictive coding of speech measure is tested together with the nearest-neighbor decision has been implemented on the SPS-41 [7]. However, the use rule to recognize isolated utterancesspokenbydifferent of linear predictor coefficients as feature parametersfor speakers. Investigations show thata combinationofthe nearest-neighbor rule andtheK-nearest-neighbordecision Manuscript received December 23,1975; revisedAugust 10, 1976 rule gives the best results. This measure is also compared with and May 31, 1977. This paper is taken in part from a thesis submitted by V. N. Gupta to the Department of Electrical and Computer Engi- the one proposed by Itakura [ 3 ] . The two distance measures neering, Clemson University, Clemson, SC, in partial fulfiiment of the give comparable results. It is also shown that approximately requirements for the Ph.D. degree. This work was supported in part 10 to 14 reference utterances for each wordin the vocabulary by the National Science Foundation under Grant GK-42109. are necessary for good speaker-independent recognition The authors arewith the Department of Electrical and Computer results. Engineering, Clemson University, Clemson,SC 29631. Abstract-This paper describes a speaker-independent speech-recognition system using autoregression (linear prediction) on speech samples. Isolated words from a standard 40-word reading test vocabulary are spoken by 25 different speakers. A reference pattern for each word is stored as coefficients of the Yule-Walker equations for 50 consecutive overlappedtime windows. Various distance measures are then proposed and evaluated in terms of accuracy of recognition and speed of computation. The best measure gives 90.3 percent rate of recognition. Both the nearest-neighbor and K-nearest-neighbor algorithms are used in the decision scheme implemented. The computation is minimized by making sequential decisions after a fixed number of iterations. It is observed that computationally this distance measure coupled with anonlineartime-warpedfunction for matching of windows gives optimal results. Thenumber of speakers was then increased to 105 to show the statistical significance of the results obtained in this project. The recognition rate obtainedwith the bestprocedure for 105 speakers was89.2 percent. The recognitiontime for thisprocedure was 9.8 seconds per utterance. zyxwvut L zyxwv 0096-35 18/78/0200-0027$00.75 0 1978 IEEE 28 zyxwv zyxwvutsr zyxwvutsr zyxwv zy zyxwvutsrqp zyxwvutsrq zyxwvu IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. SOMEPROPOSED DISTANCE MEASURES By dividing the utterance into a sufficientnumber of windaws, eachwindow can be assumed to be stationary [I]. This is equivalent to approximating vocal tract shape with a succession of stationary shapes. Under these circumstances, each window can be modeled asan autoregressive process of order p , or an A R ( p ) process [ 121; that is, the present sample can be modeled as the weighted sum of the past p values: Z t = @ l Z t -+l @ 2 Z t - 2 + . . . t I $ p Z t - p + a t where a, is a “white noise” process with zero mean and variance 02. The least squares estimate ofthe parameters I$1, I$2, * , @p can beobtained by solving the Yule-Walker equations - ASSP-26, NO. 1, FEBRUARY 1978 require about the same computation time. Distance measure ( 5 ) gives about 2 percent h v % r rate of recognition and requires slightly moretime for computation, since it requires the logarithm to be p times* COMPUTATION OF DISTANCEBETWEENTWO UTTERANCES The distance is calculated between each labeled or reference utterance and tke input utterance. For each labeled utterance, the parameter q3 is stored for each of the W fixed windows. The reference utterance is thusAstored as a time-varying pattern of theestimated parameter @, that is, I\ Q, = ($(I), h 2 ) , - , $(W>>, * * where P Pk=c@iPli-kl, i=l k=l,2,‘*‘,p (1) by replacing the theoreticalautocorrelations p k by the estiFated autocorrelations r k . Since the least squares estimate @ of @ = ( G 1 , iP2, * * , @ p ) is a maximum log likelihood estimate of the parameter @ [ 5 ] , [ 121, a distance measure of the form 2 w )= 32(w), * * > 3,(w>), w = 132, * * * , W. The autocorrelation coefficients of the W windows of the unknown utterance can be expressed as = ( 4 ) , r(2), * -- > r(WN, where r(w)=(ro(w),r1(W);**,rp(w)),w = l , 2 , * * . , W and r o ( w )w==l l, , 2 ; . * , W . was considered. The are the estimated linear predictor coefficients of the reference speech sample while rk are the autocorrelation coefficients of the unknown speech sample. It should be clear that the distance measure is independent of energy in each window since r k are normalized autocorrelation coefficients (ro = 1). Distance measures of the form The distance between the nth window of the labeled utterance and the mth window of the unknown is given by either (3), (4), or (5). The distance measure for (3) can be written as d ( n , m ) =log C, =-lo, if C # 0 if C = 0, where Note the above procedure is followed to avoid taking the log of zero. The distance d(n, m ) = -10 corresponds to setting C equal to e-”, a small positive value instead of zero. Experimentally, C was never observed to be 0. The total distance between the labeled and unknown uttershould all give excellent results. (None of these expressions ance is given by represents a true metric, and thus they are actually similarity W measures instead of distance measure. The use of the term D = d(n(m),m), “distance” in this paper should be interpreted in this light.) m=1 It has been found that only distance measures (3), (4), and ( 5 ) give excellent results. This seems intuitively obvious, where n is a function of m. In order to specify n as a function of m, thatis, since some summationsin distance measure (2) may have errors which cancel because of opposite signs. This is not n = f(m), possible for (3), (4), and (5). Equation ( 5 ) also is expected to perform worse than (3) and (4), since there is higher proba- two functions have been considered. The first, fl ,is specified bility of the rk fitting exactly one of the p equations (1) and, below, along with its continuity conditions [3]. It should be hence, giving zero residual and showing a very high degree of clear from the definition of f l that it is a monotone, increasmatch. Experimentally, it has been confirmed that distance ing, integer-valued function. The function never has the same measures (3) and (4) give the best results. Also (3) and (4) value for more than 2 consecutive integers (windows). or zyxwvutsrqponm zyxwvutsrqponm zy zyxwvuts zyxw zyxwvutsrq GUPTA et al. : SPEAKER-INDEPENDENT SPEECH-RECOGNITION SYSTEM 29 TABLE I VARIABLE x WITH ITS OBSERVED FREQUENCIES AKD THEORETICAL PROBABILITIES USINGNEAREST-NEIGHBOR DECISION RULE RANDOM z -t.;l a=: z 0 U Probability n. Observed CeI I Frequency Expected Va I ue e. (Oi - ei12 01 zyxwvutsrqponmlkjih zyxwvutsrqponm ,00705 zyxwvutsrqponmlkjihg P (-7 0 gs Wv; 4 1 E 1 0 W fi 0 s E) zyxw Observed Frequency x uu, cc 5:OO 8 ,0856 8.388 19 ,1432 15.036 19 1.0450 @ PLUT USING K-NERREST FI AND 4 13 ,1772 18.606 19 0.0083 5 21 . I754 18.417 21 0.3623 6 5 ,1447 15.1935 5 6.8389 12 . 1023 10.7415 12 0.1474 7:OO 9:OO 1'1.00 15.00 lki.00 7 zyxwvutsrqp PER UTTERRNCE Fig. 1. Recognition using functions f1 and f2 and nearest-neighbor 8 and K-nearest-neighbor algorithms. a 7 9 K = 1, 2, fl(k) = w, ,06335 6.65175 ,03485 3.65925 i 10 2 .0 1725 1.81125 >IO 0 ,0140 1.470 1 .o 105 TOTALS fl(m) = fl(m - 1) + K , K = 0, 1, 2, 0.04096 0.1086 3 NUMBER OF REFERENCESRMPLES I, 4 a 2 PLET USING Fa RNO K-NERREST @ NEIGHBOR ALGORITHH 3l.00 3.6855 USING @ PLBT NEAAEST Fa AND NEIGHEOR RLGORITHH NEIGHBEA RLGBRITHM y V I1.00 ' .0351 105 7 0.01823 10 1.3487 105 9.91839 m22 for all f l (m) f fl(m - 1) for all fl(m) = f l (m - 1) if fi(m - 1)= W. Another function that has been considered is f2(m) = m. There is a special reasonforconsideringthefunction f2. Thisfunction is 2.5 times faster to computethan fl and gives results close to thoseobtained whenusing f i . The recognitionprocedure consists of recognizing theunknown utterance as the labeled utterance which gives the minimum distance to theunknown utterance. APPLICATION OF NEAREST-NEIGHBOR AND K-NEAREST-NEIGHBOR RULES TO SPEECH RECOGNITION A combinationof the nearest-neighbordecision rule and the K-nearest-neighbor decision rule is used to recognize the unknown utterance. If X I , X,, * * , X , are n labeled samples and if the unknown sample X is closest to the labeled sample X , thenearest-neighbor rule assigns X the label associated with X,. The nearest-neighbor rule is a suboptimal procedure; however, with an unlimited number of samples, the error rate is neverworse than twice Bayes rate [13]. This implies that if enough samples are available for comparison, good results can be obtained. Thus, a large number of reference patterns of each utterance must be stored for comparison. The recognition rate has been tested with n = 1, 2, , 14 reference samples of each utterance. Graphs 1 and 2 in Fig. 1 show results for n = 1, 2, * * , 14 for distance measure (3) and functionsfl and f 2 , respectively. The order p of the linear predictive process was 14 for this figure, and 40 wordsspoken by 25 speakers wereused for recognition purposes. This figure shows that the best achieved recognition rate was approximately88.6percent for fl and 84.8percent for f 2 , using the nearest-neighboralgorithm. In calculating the recognition rates therewerenoreference patternscorresponding to theunknownspeaker.Theques- - - - tion that now arises is how much confidence can be associated with the recognition rate obtained forthe speaker-independent isolated wordrecognitionexperiment. To answerthisquestion, it becomes necessary t o give some statistical arguments for confidence in the above percentages. The following statistical analysis was performed using n = 14, p = 14, and the nearest-neighbor decisionrule. Suppose that x represents the number of incorrectly recognized words for a speaker for the 40 word vocabulary. After observing the value of x for different speakers, it was hypothesized that the random variable x has a Poisson distribution. To test this hypothesis, a sample consisting of 105 speakers was studied. Table I shows that the observed x ranged from 1 t o 10. Theexpected value for x was estimatedfromthe data to be 4.95. Two statistical tests of the hypothesis that x has a Poisson distribution were performed. The chi-square test [16] was first performed. The chi-square test statistic is given by x2 = 2 (Oi i=l ei ed2 , where Oi is the observed frequency in cell i and e j is the expected frequency in cell i under the null hypothesis, that is, ei = 105 Pr {x falls in cell i } , for i = 1, 2, , k . The data in Table I are used for the calculation of the chi-square value. Cells are chosen so that ei > 4 in accordance with standard statistical practice [16]. Thechi-square value obtained was 9.92. The 5 percent significance level critical statistic for this test would bethe 95th percentile ofthechi-square distribution with 7 degrees of freedom. This value is found to be 14.1 fromstandard tables forthe chi-square distribution function. Since 9.92 < 14.1, we fail-to-reject (accept) at the 5 percent level of significance the null hypothesis that x has the Poisson distribution with mean 4.95. The Kolmogorov-Smirnov OneSample test [ 161wasalso performed t o test the hypothesis that x has a Poisson distribution. The Kolmogorov-Smirnov test statistic obtained was 0.0528. The 20 percent significance level critical statistic for 30 zyxwvutsrqponm zyxwvuts z IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-26, NO. 1, FEBRUARY 1978 zyxwv TABLE I1 ERROR RATEIN COMBINED DECISION RULE DECISION RULE zyxwvut y zyxwvutsrqponmlkji zyxwvutsrqpon Number of “ LPC 10 14 12 14 14 14 10 Decisions Necessary Number Misclassified Increase i n Recog. Rate 101 26 2% 85 21 1.5% 70 19 10 105 30 1.6% 12 10 83 23 2.2% 14 10 75 19 1.7% 10 6 118 35 2.1% 12 6 110 30 2.3% 14 6 104 33 2.4% 2.6% which have the same label. Define 6 = K , K 2 . Every time 6 was greater than 2, the M-nearest neighbor rule was used to recognize the unknownutterance. For 6 greater than zero and less than three, the following procedure was used. If the difference between the distance of the nearest neighbor and that of the nearest neighbor of a different label was more than 2 percent of the distance of the nearest neighbor, then the nearest-neighbor rule was chosen over the K-nearest-neighbor rule. If the above difference was less than 2 percent, then the K-nearest-neighbor rule was chosen over the nearest-neighbor rule. This combined decision procedure gave, on the average, a 2 percent increase in recognition rate. The ratio of over 2 percent shows a very good match with the nearest reference pattern and hence there is a high probability of the new sample having the label associated with the reference sample showing a good match with it. When several reference patterns show a close match with the unknown sample, the reference patternsinmajority are expected to give the correct result, that is, the K-nearest-neighbor rule is expected to perform better. The ratio of 2 percent was determined experimentally while working with a vocabulary size of 20 words. The same combined decision procedure also works well with the vocabulary size of 40 words. This combined decision procedure is used when comparing the unknownutterancewith 10 or more reference utterances. Table I1 shows theerrorrate of this combined decision procedure. This table was compiled using a40-word vocabulary spoken by 25 speakers. This tableindicates,for example, that for 14 linear predictor coefficients per window and 10 reference samples per utterance, this combined decision procedure was necessary for 101 out of 1000 samples. Seventy-five of these samples were classified correctly and 26 were misclassified. This increased the number of samples recognized by 26 as compared to the nearest-neighbor algorithm. This gave a 2.6 percent increase in recognition rate. The recognition rate using the nearestneighbor rule was 87.3 percent. Hence, the combined recognition rate becomes 89.9percent. When the number of speakers was increased to 110.5, the increase in recognition rate was 1.6 percent. The value of p was 14 and 14 reference samples per word were used forthis example. Thus, the recognition rate increased from 87.6 percent to 89.2 percent. Table 111 shows a similar analysis for the combined decision rule as was done in the case of the nearest-neighbor decision rule. This analysis uses 40 words spoken by 10.5 different -’ Fig. 2. Flowchart for the combined decision procedure. this test is 1 . 0 7 / m = 0.1044. Thus, the null hypothesis is accepted here also. The sample variance of x was 5.047. This is close to thetheoretical variance of 4.95. These tests support the hypothesis that x has a Poisson distribution. The 95 percent confidence limits for the mean of the Poisson distribution were calculated to be 5.44 and 4.53. A formula given by Wilks [ 17, p. 3911 was used for this computation. The upper confidence limitfor the mean of the errors implies a lower confidence level of 86.5 percent for an observed recognition rate of 87.6 percent. TheK-nearest-neighbor rule classifies the unknown utterance X by assigning it the label most frequently represented among the K nearest samples, that is, by examining the labels on the K nearest neighbors and taking a vote. In case of a tie, the nearest-neighbor rule was applied. As the number of labeled samples increases, the K-nearest-neighbor rule is expected to give better results than the nearest-neighbor rule [ 131. In the experiments performed, the K-nearest-neighbor rule showed a slightly lower percentage of recognition than the nearest-neighbor rule as is evident from Graphs 3 and 4 in Fig. 1. K was chosen to be 7 for this graph. The result is not surprising [14], since it is difficult t o specify precisely what a “‘large” sample size is and what the distribution function of the predictor coefficients looks like. Even though the K-nearest-neighbor rule does not give better results, it can be used to advantage. A careful evaluation was made of the unknown samples which were recognized incorrectly by one method and correctly by the other method. K again was set equal to 7 for this evaluation. This led to a combined decision procedure which improved the rate of recognition. This combined decision procedure is only used when the nearest-neighbor and the K-nearestneighbor rules disagree. A flow chart for this combined decision procedure isgiven in Fig. 2. The termK-NN in the flow chart specifies K-nearest-neighbor rule and NN specifies nearest-neighbor rule. Let K 1 denote the number of samples of the largest class and K, denote the number of samples of the second largest class when the K-nearest-neighbor rule is implemented. The term classisused to identifyutterances zyxwvutsrqpo zyxwvutsrqponm zyxwvutsrqp zyxwvutsrqponm zyxwvu zy zyxwvutsrqpo GUPTA et ai. : SPEAKER-INDEPENDENT SPEECH-RECOGNITION SYSTEM 31 0 TABLE I11 9 RANDOMVARIABLE x WITH ITS OBSERVED FREQUENCIES AND THEORETICAL PROBABILITIES USING COMBINED DECISION RULE 7 Probability n. 0 0 ,0136 1 8 ,0583 6.1215 2 10 ,1254 13.167 IO 0.7617 a 3 25 ,1738 18.879 25 1.9846 wrr; 4 I9 .I933 20.2966 19 0.0828 8 13 W 0 . 1662 17.451 13 1.1353 11 .I191 12.5055 I1 0.1812 7 12 12 2.4214 7 0.0290 .0 732 7.686 3 8 ,0333 4.1265 4 9 ,0188 1.974 ,0130 1.365 ’9 TOTALS 0 105 1 .a 105 I 105 g:: o m 6 5 LUTS USING ITAKURR’S DISTANCE MEASURE 0.02688 6.62288 speakers. Thesamplemeanfortherandom variable x was ’ 4.33. TheKolmogorov-Smirnovtest statistic obtained was 0.0366andthechi-square value obtained was 6.62with6 degrees of freedom. Thus, here also the hypothesis that the distribution observed is Poisson with mean 4.33 is accepted. The samplevariance of x was 4.18which is close tothe theoretical value of4.33.FromTables I and I11 it can be found that this combined decision procedure gives an increase in recognition rate of 1.6 percent for 105 speakers. To test the hypothesis that the recognition rate using the combined decision procedure is significantly greater than the recognition rate obtained by using the nearest-neighbor rule alone, a sign test [18] was performed. For n = 14, p = 14 andfunction fi , ‘it wasobserved thatfor 59 outof105 speakers the combined decision procedure gave higher a recognition rate than the nearest-neighbor method. The two decisionprocedures gave identical recognition rates for26 speakers. The nearest-neighbor decision rule gave better recognition rate thanthe combineddecisionprocedure for 20 speakers. According to standard statistical procedure, half of the 26 speakers giving identical results were added to eachgroup forthe sign test. Thus,thecombineddecision procedure wasassumed t o give betterrecognition rate for 72 speakersand the nearest-neighbor decision rule wasassumed to give better results for 33 speakers. The 5 percent significance level critical value for the one-tailed test is found to be 43 from standard mathematical tables. Since 33 < 43, we reject at the 5 percent level of significance the null hypothesis that the recognition rate using the combined decision procedure is equal to the recognition rate obtained by using the nearest-neighbordecision rule. Thus,thealternate hypothesis that the recognition rate using the combined decision procedure is greater thanthe recognition rate obtainedby using the nearest-neighbor rule is accepted here. The above conclusion coupled with the fact that the combined decision procedureadds very littlecomputationaloverheadwarrants its use in place of the nearest-neighbor decision rule in the isolated word-recognition system describedhere. Thecombineddecision rule wasvaried slightly to see its effect ontherateof recognition forthe first 25 speakers. E E: y @ @ USES Fl RND NERREST-NEIGHBOR RLGORITHM @ @ USES Fl RND K-NEAREST-NEIGHBBR RLGORITHN USES F2 AND NERREST-NEIGHBBR RLGORITHM USES F2 RNO K-NERREST-NEIGHBOR RLGBRITHN zyx The combined decision rule was tested for a decision boundary of 1.5 percentinstead of the 2 percent as usedabove. The corresponding increases in recognition rates for 14 linear predictor coefficients and values of n equal t o 10, 12, and 14 were 2.1 percent, 1.8 percent, and 1.6 percent, respectively. For the decision boundary using a value of 2.5 percent, the corresponding increases in recognition rate were 2.5 percent, 1.8 percent,and 1.4 percent, respectively. Similar results wereobtained using differentnumberof linear predictor coefficients. Clearly, there is no appreciable change in the recognition rate for small changes in the decision rule. Figs. 1 and 3 compare the recognition rate using the distance measureproposedherewith theoneproposedbyItakura. For Itakura’s method, the parameters c(m; k ) and b(m; k ) for the mth segment of the kth reference pattern are calculated and stored on the disk. In the recognition process the distance between the nth segment of the input and the mth segment of a reference patternis calculated as zyxwvutsr zyxwvu d(n, m ;IC) = c(m; IC) +log [(b(m; k)r(n))/(a^(n)r(n)>]. Thenotation used above is thatofItakura[3]. Fromthe graph it is clear that the recognition rate for Itakura’s distance measure also increases with an increase in thenumberof referencepatternsforeachword inthevocabulary.The recognition rate reachesasaturation level afterabout 10 referencepatterns for eachword in thevocabulary.From Fig. 3 it is clear that the highest recognition rate achieved for Itakura’s method is approximately 88.4 percent. Fig. 3 was drawn using 14 linear predictor coefficients. Itakura’s method also shows that approximately 10 to 14 referencepatterns for each word in the vocabulary are necessary for achieving good speaker-independent recognition results. The combineddecisionprocedureworks well for Itakura’s method also. For example, for p = 14, n = 10, the combined decision procedure gives an increase of 1.6 percent in recognition rate as compared t o the nearest-neighbor decisionrule. EXPERIMENTAL PROCEDURE AND RESULTS Thissectiondescribestheentiresetupandthe significant results obtained in this project. The samples were taken on a PDP-15 computer interfacedto an EAI-680 analog computer. Fig. 4 showsthe generalsampling setup, whichincludesa cardioidmicrophone.The sampleswere taken in thecomputingroomwiththeinherent highnoiselevel. Thenoise 32 zyxwvutsrqponml zy z zyxwvutsrqponml zyxwvutsr IEEE TRANSACTIONS O N ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-26, NO. 1, FEBRUARY 1978 zyxwvutsrqponm 0 10 KHZ CLOCK NEXT UNKNO 5 KHZ CUTOFF SET-UP FOR TAKING SPEECH SAMPLE WINDOWS FOR N SAMPLE SPEAKERS W=50? I SAMPLE ' i STORE r AND LFk R E J E CUTP P E R H A L F LABELLED SAMPLES J YES] ON DISK FLOWCHART TO CALCULATE AUTOCORRELATION (rk 1 AND LINEAR PREDICTOR (LPC)COEFFICIENTS. Fig. 4. Procedure for generating reference parameters. FLOW-CHART FOR RECOGNITION PROCEDURE Fig. 5. Flow chart for recognition procedure. level was measured to be 71 dB-SPL. The sampling terminates The number of linear predictor coefficients p isvaried to when 2000 consecutive samples stay below a prespecified level.Falsetriggeringisavoided by rejecting any sample findits ideal value.Fig. 6 shows the recognition rate using utterance having less than 0.1-second duration. The samples three different values for p . This figure is drawn with the help were verified by playing them back through an amplifier con- of 40 words spoken by 25 speakers. It is evident that p = 14 recognition programis nected to a digital-to-analog converter to make sure that the gives the best results. Theentire compiled using an H-level Fortran beginning and theend of theword were not missed. The writteninFortranand Fig. 6 sampling frequency is 10 kHz. The samples are stored on a compiler with a second level ofoptimization.From it is clear that for n = 14, p = 14 and p = 10 give the same magtape, and later classifiedusing an IBM 370/165.The vocabulary consists of a standard40-word reading test vo- rateof recognition. Thus, distance measure (3) in conjunction with function fi , ' p = 10 and FZ = 14 together with the cabulary (see Appendix) [ 191, spoken by 105different combined decision procedureappear to be good choices to speakers. The speakers wereall male, with a widerangeof regional accents. The number of samples for the same utter- be used inthe recognition system. This proceduretakes ance vary from1500 to 8000 depending on the speaker. A about 7.5 seconds to recognize eachutterance. The recogtotal of 4200 isolated utterances were used in the recognition nitionrate using the nearest-neighbor rule in this casewas scheme. The first 1000 samples were used to plot the various 88.6 percent. The increase in recognition rate using the graphs. All the 4200 samples wereused for statistical justi- combined decision procedure was1.7 percent. Hence, the total recognition rate was 90.3 percent. When the number fication of the results obtained with the smaller sample set. Fig. 5 shows a flow chart for the recognition process. It was of speakers were increased to 105, the recognition rateobtained was 89.2 percent for p = 14, n = 14, andfunction foundexperimentally that differencing oncein thetime domain gave the best results. Each utterance isdivided into f i . The recognition timeinthis casewas 9.8 seconds for 50 windows,whichare 50 percent overlapped. (Note that eachutterance.Thefunction fi togetherwith p = 10 and this implies windows of varying lengthfromutterance to n = 14 takes 3.4 seconds to recognize each utterance. Funcutterance.) AHanningwindowis applied to each of these tion f2 is about twice as fast as fl but gives approximately windows, and then autocorrelation coefficients are calculated 5 percentlowerrateofrecognition. Table IV showsmajor for each window. The linear predictor coefficients are calcu- misclassifications using 14 linear predictor coefficients, n = 14, lated for each window. The linear predictor coefficients are function f l , together with the combined decision procedure. foundby solving the Yule-Walker equations (l), using an The samples of all the105 speakers are used in compiling this table. Themajority of the misclassifications are due to iterative method [ 11. due to The autocorrelationcoefficientsof theunknown sample the samevowel sound. Somemisclassificationsare and the linear predictor coefficients of the known sample are similarities in thebeginning or ending of the utterance. used to calculate the total distance D. As is evident from the CONCLUSION flow chart, a sequential decision procedure is used to reduce computationtime.After calculating distances forthe first Anew distance measure forspeaker-independent recog4 windows, one half of the reference samplesare rejected. nitionof isolated words isgiven here. It isclear fromthe These are the reference samples which give higher distances results that the recognition rate increases with p (the number for the first 4 windows. A similar decision is taken after the of predictor coefficients) and with n (the number of reference 8th, 16th, and 32nd windows. This reduces the computation samples per utterance).The recognition rate reaches the to about one fifth while it has practically no effect on the saturation level after a certain number of reference samples recognition rate. per utterance. A value of p between 10 and 14 are shown to zyxw zyxwvut zyxwvutsrqponm zyxwvutsrqponm zyxwvut GUPTA et al. : SPEAKER-INDEPENDENT SPEECH-RECOGNITION SYSTEM TABLE IV MAJORMISCLASSIFICATIONS USINGFUNCTION^^, LPC = 14, AND n = 14 7 ln “1.00 33 Input Utterance Number o f Errors Recognized A s zyxwvut zyxwvu zyxwvutsrqponmlkj 3.00 5.00 7.00 (a), what (E), ( 3 ) , horse. want one ( 2 0 ) , was along what was ( 1 9 ) , want(1 UD bump ( 9 ) , PUPPY ( 4 ) -h e l p ( 2 ) , come ( 2 ) . dark ( 2 ) , b a l l ,w a n t , one, what,road little l i v e (81, under ( 4 ) , h i l l b a l l ,w i t h ,a l o n g bump along ( 2 ) , up ( 3 1 b , all ( 3 ) , mother (7.1, come ( 2 ) w . hat, down, o n e ,l o o k ,h e l p , little bump l ) , one ( 8 ) . run (2) (31, road I PUPPY 41 41 24 ( 3 ) , run, 22 19 run, I 9.00 11.00 13.00 NUMBER OF REFERENCE SRMPLES PER UTTERANCE 15.00 Fig. 6. Comparison of recognition rate using different number of linear predictor coefficients. be optimal. A sequential decision procedure is used to reduce computationtime.Fora40-wordvocabulary,therecog nition rate achieved was 89.2 percent with computation time 9.8 seconds per word. This recognition rate has been shown to be statistically significant. It hasbeenshown thatthe numberofwords misrecognized for eachspeakerfollowsa Poisson distribution. The mean obtained for this distribution is 4.33. This implies that 95 percent of the time the recognition rate for a new speakerwill lie between 78.8 percent and 99.6percent. Also, it has been shown thatthe 95 percent confidence interval forthe average recognition rate of89.2 percent is between 88.1 and 90.15 percent. APPENDIX The40-wordreading test vocabulary usedisas follows: here, run, play, is, down, help, ball, mother, see, look, come, can,little,one,up,baby, make,want,three, jump,with, friends, came, horse, ride, under, was, what, bump, live, very, puppy, dark, first, wish, basket, food, road, hill, along. ACKNOWLEDGMENT Theauthorswould like to thankthe constructive comments. reviewers fortheir 1ive w i t h ( 7 ) , h i l l ( 2 1 , l i t t l e ( 2 ) . u pv, e r y , jump, make, one,road, can along come ( 5 ) , down ( 4 ) . b a l l ( 4 ) ,h e l p jump food road ( 4 ) , baby ( 3 ) , look see, c a nt, h r e ea, l o n g look live ( 3 ) , l i t t i e ( 2 1 , up ( 2 ) , was (21, run, one (Z), down, (31, bump (2), here ( 2 ) . l i v e (2), 19 I3 17 16 what, bump, puppy, h i l l ,r i d e (15) ride run very can ( 4 ) . w i t h( 4 ) ,t h r e e dark ride 15 ( 2 ) , here ( 2 ) . l i v e (21, 14 (IO), one ( 2 1 , road(2) h i l l (5), 15 l i v e ( 4 ) , what ( 2 ) , mother, l i t t l e 13 h i l l ( 6 ) . l i v e ( 3 ) , can ( 2 ) , h e l p 12 was ( 6 ) .w a n t( 2 ) ,w i t h ,w h a t ,p l a y 11 came ( 3 ) ,f o o d( 3 ) , make ( 2 ) ,h e r e , come, l i v e 11 [6] -, “On the optimum quantization of featureparametersin the Parcor speech synthesizer,” in IEEE Con8 Rec., 1972. [7] M. J. Knudsen, “Real-time linear predictive coding of speech on the SPS-41 triple-microprocessor machine,” IEEETrans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 140-145, Feb. 1975. [8] J. Makhoul, ‘‘Spectralanalysis of speech by linear prediction,” IEEETrans. Audio Electroacoust., vol. AU-21, pp. 140-148, June 1973. [9] -, “Linear prediction: A tutorial review,” Proc. of the IEEE, vol. 63, pp. 561-580, Apr. 1975. [ l o ] J. D. Markel,“Digitalinverse fdtering-A new tool for formant trajectory estimation,” IEEETrans. Audio Electroacoust., vol. AU-20, pp. 129-137, June 1972. [ I l l R. W. Shafer and L.R. Rabiner, “Digital representationsof speech signals,” Proc. of the IEEE, vol. 63, pp. 662-677, Apr. 1975. [12] G. E. Box and G. M. Jenkins, Time Series Analysis Forecasting and Control. San Francisco, CA: Holden-Day, 1970. [13] R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. [14] G. M. White and P. J. Fong, “K-nearest-neighbor decision rule performance in a speech recognition system,” IEEE Trans. Systems, Man, Cybernet., vol. SMC-5, p. 389, May 1975. [ 151 E. Parzen, “Some recent advances in time series analysis,” in Statistical Models and Turbulence. New York: Springer-Verlag, 1972. [16] B. W. Lindgren, Statistical Theory. New York: MacMillan Company, 1960. [17] S. S . Wilks, Mathematical Statistics. New York: Wiley, 1962. [18] J. D. Gibbons, Nonparametric Statistical Inference. New York: McGraw-Hill, 1970. [19] R. L. Slosson, Slosson Oral Reading Test. New York: Slosson Educational Publication, 1963. zyxwvuts zyxwvutsrqp REFERENCES [ 11 B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis by linear prediction of the speech wave,” J. Acoust. SOC.Amer., vol. 50, no. 2,pp. 637-655,1971. [2] B. Gold and C. M. Rader, Digital Processing of Signals. New York: McGraw-Hill, 1969. [3] F. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEETrans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 67-72, Feb. 1975. [4] F. Itakura and S. Saito, “Analysis synthesis telephony based on the maximum likelihood method,”in Rep.6thInt. Cong. Acoust., Y. Kohasi, Ed., pp. C17-C20, paper C-5-5, Aug. 1968. [SI - , “A statistical methodforestimation of speech spectral density and formant frequencies,” Electron. Commun. (Japan), vol. 53-A, no. 1, pp. 36-43,1970.

RELATED PAPERS

RELATED TOPICS

Log In

Speaker independent speech recognition system and method

Speaker independent speech recognition system and method

Related Papers

RELATED PAPERS

RELATED TOPICS