Signal Analysis Research (SAR) Group - RNet - Ryerson University

Signal Analysis 

Research (SAR) 

Group 

Refereed Conference Papers 

May 1996- December 2007

Table of Contents 

2007.12 Construction of Discriminative Positive Time-Frequency 

Distributions 

K. Umapathy and S. Krishnan 

2007.10 Combining Vocal Source and MFCC Features for Enhanced 

Speaker Recognition Performance Using GMMs 

D. Hosseinzadeh and S. Krishnan 

2007.08 Multiresolution Analysis and Classification of Small Bowel 

Medical Images 

A. Khademi and S. Krishnan 

2007.06 Interference Detection in Spread Spectrum Communication Using 

Polynomial Phase Transform 

R. Zarifeh, S. Krishnan, A. Anpalagan and N. Alinier 

2007.05 Emotion Recognition Using Novel Speech Signal Features 

T.S. Tabatabaei, S. Krishnan and A. Guergachi 

2007.04 A Watermarking Method for Speech Signals Based on the Time- 

Warping Signal Processing Concept 

C. Ioana, A. Jarrot, A. Quinquis and S. Krishnan 

2006.12 Chirp-Based Image Watermarking as Error-Control Coding 

B. Ghoraani and S. Krishnan 

2006.07 Automatic Content-Based Image Retrieval Using Hierarchical 

Clustering Algorithms 

K. Jarrah, S. Krishnan and L. Guan 

2006.07 Computational Intelligence Techniques and their Applications in 

Content-Based Image Retrieval 

K. Jarrah, M. Kyan, S. Krishnan and L. Guan 

2006.07 Discrete Polynomial Transform for Digital Image Watermarking 

Application 

L. Le, S. Krishnan and B. Ghoraani 

2006.05 Improving Position Estimates From a Stationary GNSS Receiver 

Using Wavelets and Clustering 

M. Aram, B. Li, S. Krishnan and A. Anpalagan 

2006.05 Keystroke Identification Based on Gaussian Mixture Models 

D. Hosseinzadeh, S. Krishnan and A. Khademi 

2006.05 Soccer Video Retrieval Using Adaptive Time-Frequency Methods 

J. Marchal, C. Ioana, E. Radoi, A. Quinquis and S. Krishnan 

1 - 5 

6 - 9 

10 – 13 

14 - 19 

20 - 23 

24 - 27 

28 – 31 

32 – 37 

38 – 41 

42 – 45 

46 – 50 

51 – 54 

55 – 58

2006.05 Support Vector Machines Based Approach for Chemical 

Phosphorus Removal Process in Wastewater Treatment Plant 

T.S.Tabatabaei, T. Farooq, A. Guergachi and S. Krishnan 

2005.11 Data Embedding in Miu-Law Speech with Spread Spectrum 

Techniques 

L. Zhang, H. Ding and S. Krishnan 

2005.09 Comparison of JPEG 2000 and Other Lossless Compression 

Schemes for Digital Mammograms 

A. Khademi and S. Krishnan 

2005.07 Gaussian Mixture Modeling Using Short Time Fourier Transform 

Features for Audio Fingerprinting 

A. Ramalingam and S. Krishnan 

2005.05 Multipath Mitigation of GNSS Carrier Phase Signals for an On- 

Board Unit for Mobility Pricing 

R. Puri, A. El Kaffas, A. Anpalagan, S. Krishnan and B. Grush 

2005.03 A Signal Classification Approach Using Time-Width VS 

Frequency Band Sub-Energy Distributions 


2005.03 Indexing of NFL Video Using MPEG-7 Descriptors and MFCC 

Features 

S.G. Quadri, S. Krishnan and L. Guan 

2004.12 Audio Signal Feature Extraction and Classification Using Local 

Discriminant Bases 

K. Umapathy, S. Krishnan and R.K. Rao 

2004.05 A Novel Robust Image Watermarking Using a Chirp Based 

Technique 

A. Ramalingam and S. Krishnan 

2004.05 A Novel Way of Lossless Compression of Digital Mammograms 

Using Grammar Codes 

X. Li, S. Krishnan and N.-W. Ma 

2004.05 Content Based Audio Classification and Retrieval Using Joint 

Time-Frequency Analysis 

S. Esmaili, S. Krishnan and K. Raahemifar 

2004.05 Modified Local Discriminant Bases and its Applications in Signal 

Classification 


2004.05 Radio Over Multimode Fiber for Wireless Access 

R. Yuen, X.N. Fernando and S. Krishnan 

59 – 63 

64 – 67 

68 - 71 

72 – 75 

76 – 79 

80 – 83 

84 – 87 

88 – 92 

93 – 96 

97 - 100 

101 – 104 

105 – 108 

109 - 112

2004.05 Sub-Dictionary Selection Using Local Discriminant Bases 

Algorithm for Signal Classification 

K. Umapathy, S. Krishnan and A. Das 

2003.09 Ultrasound Backscatter Signal Characterization and 

Classification Using Autoregressive Modeling and Machine 

Learning Algorithms 

N.R. Farnoud, M. Kolios and S. Krishnan 

2003.07 Robust Audio Watermarking Using a Chirp Based Technique 

S. Erkucuk, S. Krishnan and M. Zeytinoglu 

2003.07 Time-Frequency Filtering of Interferences in Spread Spectrum 

Communications 

S. Erkucuk and S. Krishnan 

2003.05 A General Perceptual Tool for Evaluation of Audio Codecs 

K. Umapathy, S. Krishnan and G. Sinanian 

2003.05 Non-Stationary Noise Cancellation in Infrared Wireless Receivers 

S. Krishnan, X. Fernando and H. Sun 

2003.04 Adaptive Denoising at Infrared Wireless Receivers 

X. N. Fernando, S. Krishnan, H. Sun and K. Kazemi-Moud 

2003.04 Audio Watermarking Using Time-Frequency Characteristics 


2002.10 Time-Frequency Modeling and Classification of Pathological 

Voices 

K. Umapathy, S. Krishnan, V. Parsa and D. Jamieson 

2002.08 Audio Signal Classification Using Time-Frequency Parameters 

K. Umapathy, S. Krishnan and S. Jimaa 

2002.05 Detection of Linear Chirp and Non-Linear Chirp Interferences in a 

Spread Spectrum Signal by Using Hough-Radon Transform 

S. Thayilchira and S. Krishnan 

2002.05 Discrimination of Pathological Voices Using an Adaptive Time- 

Frequency Approach 

K. Umapathy, S. Krishnan, V. Parsa and D. Jamieson 

2002.05 Interference Excision in Spread Spectrum Communications Using 

Adaptive Positive Time-Frequency Distributions 

S. Erkucuk and S. Krishnan 

2001.05 Fixed Block-Based Lossless Compression of Digital 

Mammograms 

M.Y. Al-Saiegh and S. Krishnan 

113 - 116 

117 – 120 

121 – 124 

125 – 128 

129 – 132 

133 – 137 

138 – 146 

147 – 151 

152 - 153 

154 - 157 

158 

159 - 162 

163 

164 - 169

2001.05 Instantaneous Mean Frequency Estimation Using Adaptive Time- 

Frequency Distributions 

S. Krishnan 

2000.07 Sonification of Knee-joint Vibration Signals 

S. Krishnan, R.M. Rangayyan, G.D. Bell and C.B. Frank 

1999.05 Denoising Knee Joint Vibration Signals Using Adaptive Time- 

Frequency Representations 

S. Krishnan and R.M. Rangayyan 

1998.10 Comparative Analysis of the Performance of the Time-Frequency 

Distributions with Knee Joint Vibroarthrographic Signals 

R.M. Rangayyan and S. Krishnan 

1998.10 Detection of Nonlinear Frequency-Modulated Components in the 

Time-Frequency Plane Using an Array of Accumulators 


1997.10 Time-Frequency Signal Feature Extraction and Screening of Knee 

Joint Vibroarthrographic Signals Using the Matching Pursuit 

Method 

S. Krishnan, R.M. Rangayyan, G.D. Bell and C.B. Frank 

1997.08 Detection of Chirp and Other Components in the Time-Frequency 

Plane Using the Hough and Radon Transform 


1997.08 Spatial-Temporal Decorrelating Decision-Feedback Multiuser 

Detector for Synchronous Code-Division Multiple-Access 

Channels 

S. Krishnan and B.R. Petersen 

1996.10 Screening of Knee Joint Vibroarthrographic Signals by Statistical 

Pattern Analysis of Dominant Poles 

S. Krishnan, R.M Rangayyan, G.D. Bell, C.B. Frank and K.O. Ladly 

1996.05 Recursive Least-Squares Lattice-Based Adaptive Segmentation 

and Autoregressive Modeling of Knee Joint Vibroarthographic 

Signals 

S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank and K.O. Ladly 

Other Refereed Conference Papers 

170 – 175 

176 – 179 

180 – 185 

186 - 189 

190 - 193 

194 – 197 

198 - 201 

202 - 207 

208 – 209 

210 - 213 

214 - 215

Construction of Discriminative Positive 

Time-frequency Distributions 

Karthikeyan Umapathy and Sridhar Krishnan 

Dept. of Electrical and Computer Engineering, Ryerson University, Toronto, Canada 

Email: (karthi)(krishnan)@ee.ryerson.ca 

Abstract— Positive time-frequency energy distributions 

(PTFD) are suitable for studying the non-stationary dynamics 

of a signal. Instantaneous features extracted from the PTFD are 

often used in classification applications where the discriminatory 

clue lies in the non-stationary behavior of the signal. From a 

classification point of view it would be desirable to identify 

and extract instantaneous features that correspond to only the 

discriminative portion of the signal. By doing so we get an added 

advantage of eliminating the overlap from the non-discriminatory 

portion of the signal in the instantaneous feature extraction 

process. In this paper, we propose a front-end processing 

using a novel time-width versus frequency band mapping that 

facilitates the construction of PTFD corresponding to only the 

discriminatory portion of the signal. Instantaneous features 

extracted from these PTFDs are readily discriminative and 

attractive for classification and characterization applications. The 

proposed method is demonstrated with a speech classification 

example. 

Keywords: Time-frequency, Positive time-frequency distributions, 

Instantaneous features, Matching pursuits, Time-width 

versus frequency band mapping. 

I. INTRODUCTION 

Joint time-frequency (TF) analysis has been widely employed 

for extracting TF features from non-stationary signals. 

While the parametric TF decomposition approaches are highly 

suitable for objective feature extraction, the non-parametric 

Cohen’s class TF energy distributions (TFD) are usually used 

to extract instantaneous TF features. In order to extract meaningful 

instantaneous features from a TFD, the TFD has to 

satisfy the following properties: (i) positivity and (ii) time and 

frequency marginals [1]. Positivity refers to that the energy 

values of the TFD are ≥ 0. The marginal property states that 

integrating a TFD in time and frequency directions should 

yield the instantaneous energy and energy spectral density of 

a signal. 

In reality, the presence of cross-terms with multicomponent 

signals affect the positivity of a TFD. In some cases 

though we could achieve positive TFDs by compromising 

TF resolution, they do not satisfy the marginals. Various 

methods and conditions have been proposed over the years to 

construct PTFDs that satisfy both the properties of positivity 

and marginals [2], [3]. One known way of constructing PTFDs 

with high TF resolution and free from cross-terms is to use the 

adaptive TF transformation (ATFT - Matching Pursuits with 

TF dictionaries) approach [4]. While this approach produces 

cross-term free high resolution PTFDs, it does not satisfy the 

marginal properties. A correction to the marginals using minimum 

cross entropy (MCE) optimization has been proposed 

in the work of [5] that make the ATFT based PTFDs suitable 

for instantaneous feature extraction. An added advantage of 

the ATFT based PTFD approach is that the building blocks 

of the PTFD are the TF functions (represented by a set of 

decomposition parameters) whose parameters can be cleverly 

manipulated to achieve desirable effects in the PTFDs. 

In authors previous works [6], [7], [8] a novel time-width 

versus frequency band (TWFB) mapping based on ATFT was 

used to identify the discriminative decomposition subspaces 

between classes of signals. Objective TF features were extracted 

from these subspaces and successfully applied in real 

world applications to achieve high classification accuracies. 

The TWFB mapping utilizes the parametric benefits of ATFT 

in identifying the discriminative decomposition subspaces that 

are suitable for objective feature extraction. In order to extend 

this approach to extract instantaneous TF features from the 

discriminatory portion of the signal, the information provided 

by the TWFB mapping has to be combined with the ability of 

ATFT to construct PTFD. This possibility is explored in this 

paper and a methodology to construct discriminative PTFDs 

using TWFB mappings is presented. These discriminative 

PTFDs are constructed using only the discriminative portion of 

a signal which ensures that the instantaneous features extracted 

from these PTFDs truly reflect the discriminating dynamics 

between different classes of signals. 

The block diagram of the proposed methodology is shown 

in Fig. 1. The solid lines in the block diagram show the main 

components of the proposed work. The paper is organized as 

follows: Section 2 covers the methodology, which includes 

subsections on TWFB mappings, discriminative subspace selection, 

and construction of discriminative PTFDs, Section 3 

presents the results and discussion, and conclusions are given 

in Section 4. 

II. METHODOLOGY 

A. Time-width versus Frequency Band Mappings 

TWFB mappings are constructed using the decomposition 

parameters of an ATFT [6], [7], [8]. In ATFT, any given real 

signal is modeled using a sum of L TF functions selected from 

a redundant dictionary of TF functions. The TF functions used 

to model a real signal can be represented using five model 

or decomposition parameters (ai, si, pi, fi, and φi). The 

parameter ai is the expansion coefficient for the TF functions, 

si represents the time-width or scale parameter of the TF 

function, pi its time position, fi its center frequency and φi 

1–4244–0983–7/07/$25.00 c○ 2007 IEEE ICICS 2007 

Authorized licensed use limited to: Ryerson University Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply. 

1

Class A 

Class B 

TWFB 

Mapping 

Discriminative 

TF functions 

Marginals 

Wigner−Ville 

Distributions 

MCE 

Optimization 

Discriminative 

TWFB − Time−width vs Frequency Band Mapping, MCE − Minimum Cross Entropy, PTFD − Positive Time−frequency Distribution 

represents the phase of the TF function. The index i represents 

the iteration number. In our study Gaussian TF functions were 

used due to their excellent TF resolution properties [1]. 

In order to effectively utilize the ATFT decompositions for 

discriminant subspace selection, the decomposition parameters 

need to be rearranged in a pseudo dictionary format. Of the 

five essential TF decomposition parameters explained, from 

a TF decomposition subspace point of view only the energy 

parameter ai, time-width si, and the frequency fi parameters 

are of relevance. This is because, the main feature of a TF 

function and thereby the decomposition itself lies in these three 

parameters. The phase parameter φi provides the information 

about how the TF functions combine to form the signal, which 

is of more relevance in a signal reconstruction scenario. The 

information provided by the time parameter pi is not important 

for identifying global signal patterns since in most cases the 

pattern recognition applications look for only global patterns 

irrespective of its occurrence in time (translation invariance). 

Also the time (pi) independence is the key to bring generality 

and organization to the TWFB mapping. Hence only ai, 

si, and fi are used in the construction of TWFB mapping. 

However it is obvious that without the time and time-varying 

information neither instantaneous features nor features related 

to time-triggered events can be extracted. So, after locating 

the discriminant subspaces using the TWFB mapping, all 5 

ATFT decomposition parameters (including pi and φi) ofthe 

TF functions that correspond to the discriminatory subspaces 

will be used to construct the PTFD. 

Let us redefine the subscript of si to sw, fi to fb and 

a 2 i (energy parameter) to a 2 (j, sw, fb) 

Fig. 1. Block diagram of the proposed methodology 

. The index w in the 

sw now represents the possible time-width values of the TF 

function. The sw then represents all the TF functions with a 

particular time-width w. Similarly the index b in fb represents 

the possible frequency band values. The fb then represents all 

the TF functions that occur within a particular frequency band 

b. The range of values for w and b is determined by the discrete 

implementation of the ATFT algorithm and the choice of the 

frequency band resolution. Depending upon the application 

requirement, finer resolution TWFB can be constructed by 

controlling the step size of the time-width and frequency 

PTFD 

parameters. The subscript (j, sw, fb) of the energy parameter 

corresponds to the j th TF function that has a time-width value 

of w and occurs within the b th frequency band. For every 

combination of (sw,fb), there may be j =1, ..., M TF 

functions. The M varies for each combination of (sw,fb)and 

is signal dependent. 

The TWFB mapping can then be defined as the cumulative 

energy distribution of the TF functions for all the possible 

combinations of the time-widths (sw) and frequency bands 

(fb) and is given by 

TWFB(sw,fb) = 


2 

M (sw ,fb ) 

j=1 

a 2 (j,sw,fb) , (1) 

In the implementation used in this study, the index w 

takes values from 1 to 14 (which translates into a length 

of 2w time samples) and b values range from 1 to 4 (Each 

covering 1 th 

4 of normalized frequency spectrum). This give 

us 56 possible combinations of sw and fb. Let us call each 

of these combinations as a tile in the TWFB mapping. Each 

of the these cell corresponds to the cumulative energy of 

all the TF functions corresponding to a particular sw and 

fb combination. Fig. 2 shows a sample signal, spectrogram, 

and its TWFB mapping. The X axis of the TWFB map 

corresponds to the time-width or scale parameter sw and the 

Y axis corresponds to the frequency bands fb. Here we would 

like to stress that the TWFB mapping is independent of time 

occurrences of a signal pattern which is a desirable property 

(translation invariance) for pattern recognition. The Z axis 

(color) indicates the magnitude of the cumulative energy of TF 

functions for each cell. The next subsection will explain how 

the TWFB mapping can be used for identifying discriminative 

TF decomposition subspace. 

B. Discriminative Subspace Selection 

TWFB maps of different classes of signals can be compared 

to arrive at the TWFB tiles (difference mapping) that demonstrate 

high discrimination between the classes of the signal [6], 

[7], [8]. As an example, for a 2 class problem (class A and 

class B as shown in Fig. 1), we compute the average TWFB

Amplitude 

Frequency 

Frequency bands (f b ) 

1 

0.5 

0 

−0.5 

1 

0.5 

0 

1 TWFB map 

2 

3 

4 

2000 4000 6000 8000 10000 12000 14000 16000 

Time samples 

Spectrogram 

1000 2000 3000 4000 

Time 

5000 6000 7000 8000 

2 4 6 8 

Time−width (s ) 

w 

10 12 14 

Fig. 2. A sample TWFB mapping of a synthetic signal. From top to bottom: 

time domain signal, spectrogram, and TWFB map. 

mapping for each of the class using a set of training signals. 

The corresponding cells of the average TWFB mappings are 

then compared by applying a dissimilarity measure D to obtain 

a difference mapping. The choice of dissimilarity measure D 

can be made depending on the application. In the proposed 

method, the simple absolute energy difference between the 

cells were used as the dissimilarity measure. The difference 

mapping is then given by 

TWFB(sw,fb) diff = |TWFB(sw,fb) A − TWFB(sw,fb) B | 

(2) 

After arranging the cells of the difference mapping in the 

decreasing order of their absolute difference value, the top P 

cells are chosen as the discriminatory cells. The restricted span 

of sw and fb or the discriminant TF decomposition subspace 

corresponding to these P cells then represents the highly 

discriminatory clues of a signal. Once the span of sw and 

fb are identified using the training signals, the TF functions 

corresponding to this span could be used to construct the 

discriminative PTFD. 

C. Construction of Discriminative PTFD 

The ATFT based PTFD is constructed by adding the 

Wigner-Ville distributions (WVDs) of the individual TF functions 

[4]. WVD is a quadratic classical TFD and is known to 

have the best TF resolution [1]. However it suffers from the 

cross term artifacts when dealing with multicomponent signals 

due its bilinear nature. For notational convenience from this 

point forward let us represent the ATFT based PTFD as just 

AP. In order to explain the cross-term free nature of the AP, let 

us express a signal x(t) in terms of TF functions (gγi) afterL 

ATFT iterations as x(t) = L−1 

i=0 gγi(t)+R L x(t) 

[4]. The first part of the preceding equation represents the 

signal modeled using the L TF functions and the second part 

of the equation represents the residue of the signal x(t). 

As explained in Section II-A each of the TF function gγi is 

represented by a set of decomposition parameters (ai, si, pi, 

fi, and φi). Now assuming that the signal x(t) is completely 

modeled with L TF functions (i.e. the residue is zero after L 

iterations), we could express the WVD of x(t) in terms of the 

TF functions gγi as 

WVDx(t, f) = 

+ 

 

L−1 

| | 2 Wgγi(t, f) 

L−1 L−1 

h=i (3) 

W[gγi,gγh](t, f) 

where Wgγ(t, f) is the WVD of the TF function. Here it 

should be noted that the TF functions gγi used in this work 

are Gaussian hence the WVD of the individual TF functions 

Wgγ(t, f) are positive [4]. The second term (double sum) 

corresponds to the cross term artifacts that occur if x(t) is 

a multicomponent signal. These cross terms could be easily 

removed by retaining only the first term in the Equ. 3 which 

yields the AP 

L−1 

AP (t, f) = | | 2 Wgγi(t, f) (4) 

i=0 

The AP generated this way is positive, free from cross 

terms, and has sufficiently high TF resolution for analyzing 

non-stationary signals. However the AP does not satisfy the 

marginal properties. The marginal properties of AP will be 

addressed in the later part of this section. 

Similar to the above case of constructing AP for the complete 

signal x(t), we could compute AP for the discriminatory 

portion of the signal x(t) identified using the TWFB mappings. 

Let us denote the discriminatory portion of the signal as ˆx(t) 

that corresponds to Q TF functions that occurred within the 

restricted span of sw and fb (as explained in Section II-B). The 

ˆx(t) can then be written as ˆx(t) = Q−1 

i=0 gγi(t), 

where Q

as an added advantage of this approach, the AP is automatically 

denoised from the non-discriminant and overlapping signal 

structures. Here the term “noise” means irrelevant signal 

information for a particular application. 

As mentioned earlier the AP (t, f)ˆx still needs to be corrected 

for its marginals. The works of [2], [3] have demonstrated 

successful ways of correcting the marginals of an 

improper TFD using the minimum cross entropy (MCE) optimization 

and TF copulas respectively. Although the TF copula 

based approach [3] is a recent work that is computationally 

attractive, we chose to use the MCE approach since it has 

been already successfully applied to correct ATFT based TFDs 

in [5]. Moreover the choice between these two approaches 

does not affect the main focus of this paper in constructing 

the discriminative PTFD. MCE is an iterative process where 

the correction factor of the marginals is computed as the 

ratio of the marginals of the improper PTFD to the actual 

marginals extracted from the signal. Iteratively this procedure 

is alternatively applied in time and frequency directions till the 

marginals of the PTFD matches to that of the actual signal 

marginals. Let us assume the AP (t, f) ′ 

ˆx 

is the corrected 

ATFM-TFD that satisfies the marginals, u(t) as the true time 

marginal which could be extracted from the time domain 

signal, and u ′ (t) as the actual time marginal of AP (t, f)ˆx, 

then, after simplification the first iteration is shown as 

AP 1 (t, f) ′ 

u(t) 

ˆx = AP (t, f)ˆx 

u ′ (t) 

This operation scales the AP (t, f)ˆx by the time marginal 

ratio. After this operation, AP 1 (t, f) ′ 

ˆx would be the corrected 

AP in the time marginal. Now, to satisfy the frequency 

marginal, the operation is repeated with AP 1 (t, f) ′ 

ˆx as the 

prior estimate and computing the correction factor using u(f) 

and u ′ (f). This process is repeated alternatively to correct the 

time and frequency marginals. The only difference in the above 

procedure compared to the previous works is that the true 

time u(t) and frequency marginal u(f) are computed from the 

discriminatory portion of the signal ˆx(t). ˆx(t) is reconstructed 

using the discriminant TF functions and the true marginals are 

computed before being used with MCE. The block diagram 

in Fig.1 shows this operation of extracting marginals from the 

discriminant TF functions. After n iterations the AP n (t, f) ′ 

ˆx 

will become the corrected discriminative AP satisfying the 

marginal conditions and is suitable for extracting meaningful 

instantaneous features. 

III. RESULTS AND DISCUSSIONS 

To demonstrate the proposed construction of AP we present 

here a pathological speech classification (characterization) 

application. Fig. 3 has 5 rows and 2 columns containing 

10 images. The top row of the figure shows a normal and 

pathological speech segments of length 16384 samples in time 

domain. The second row shows the spectrogram of the speech 

segments which gives an idea about their time-frequency 


(6) 

4 

Fig. 3. Example of constructing discriminatory PTFD - Pathological speech 

classification application 

energy distribution. The TWFB mappings of the normal and 

pathological speech segments are shown in the third row of 

the figure. These TWFB mappings were constructed using 

1000 TF functions each. Please note that the X axis of the 

TWFB mappings are in time-width and not time. It is evident 

from the TWFB images that there is a significant difference in 

the cumulative energy distribution of the corresponding cells 

especially between sw of 5 to 11 and frequency bands fbof 3 

and 4. The difference mapping was computed and the cells that 

exhibited high difference between the normal and pathological 

speech segments were chosen. In the third row of the figure, 

the chosen cells are shown bounded by a rectangular box. 

All the Q TF functions that fell within these boxes were 

used to reconstruct the discriminatory portion of the normal 

and pathological speech segments (ˆx(t)). During the reconstruction 

all the 5 ATFT decomposition parameters were used. 

The discriminatory portion of the reconstructed signals are 

shown in the fourth row of the figure. By closely comparing 

the first row of original signal x(t) and the fourth row of ˆx(t) 

it could be observed that the discriminatory cells of the TWFB 

mapping did capture the signal components that differ between 

the two speech segments. The discriminatory AP n (t, f) ′ 

ˆx was 

then computed in 5 MCE iterations using Q TF functions and 

the marginals extracted from ˆx(t). TheAP n (t, f) ′ 

ˆx of the 

normal and pathological speech segments are shown in the fifth

Normalized Frequency 


0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 

0.5 

0.4 

0.3 

0.2 

0.1 

IMF of the normal speech segment 

2000 4000 6000 8000 

Time instants 

10000 12000 14000 16000 

IMF of the pathological speech segment 

2000 4000 6000 8000 

Time instants 

10000 12000 14000 16000 

Fig. 4. Instantaneous mean frequency (IMF) extracted from the discriminatory 

AP n (t, f) ′ 

ˆx ’s 

row of the figure. Extracting instantaneous features from these 

AP n (t, f) ′ 

ˆx ’s will readily demonstrate the discrimination 

between the normal and pathological speech segments. As an 

example, we extracted the instantaneous mean frequency (IMF) 

from the AP n (t, f) ′ 

ˆx ’s of normal and pathological speech 

segments and are shown in the Fig. 4. The difference between 

the IMF pattern is evident from the figure. It should be noted 

that we achieved the above result using only the PTFD that 

was constructed using the discriminatory portion of the signal. 

IV. CONCLUSIONS 

A novel approach to construct discriminatory PTFD for 

instantaneous feature extraction was presented. The proposed 

method used the TWFB mappings to identify the discriminatory 

decomposition subspaces and translated them into 

corresponding PTFD. The instantaneous features extracted 

from discriminatory PTFD are expected to be free of overlaps 

from irrelevant signal structures and ideally suitable for 

classification applications. Since PTFDs are the best way to 

extract meaningful instantaneous features from the signal, the 

proposed approach is an optimal way to perform meaningful 

classification that demands instantaneous features. A pathological 

speech classification example was used to demonstrate 

the proposed technique and the results were discussed. 

Future work involves applying the proposed method to real 

world applications and compare its performance with the 

conventional ways of extracting instantaneous features. TF 

copula based approach will also be investigated in constructing 

discriminative PTFDs. 

REFERENCES 

[1] L. Cohen, “Time-frequency distributions - a review,” Proceedings of the 

IEEE, vol. 77(7), pp. 941–981, 1989. 

[2] P. Loughlin, J. Pitton, and L. Atlas, “Construction of positive timefrequency 

distributions,” IEEE Trans. on Signal Processing, vol. 42, pp. 

2697–2705, 1994. 


5 

[3] M. Davy and A. Doucet, “Copulas: a new insight into positive timefrequency 

distributions,” IEEE Signal Processing Letters, vol. 10, no. 7, 

pp. 215–218, 2003. 

[4] S. G. Mallat and Z. Zhang, “Matching pursuit with time-frequency 

dictionaries,” IEEE Trans. Signal Processing, vol. 41, no. 12, pp. 3397– 

3415, 1993. 

[5] S. Krishnan, R. M. Rangayyan, G. D. Bell, and C. B. Frank, “Adaptive 

time-frequency analysis of knee joint vibroarthrographic signals for 

noninvasive screening of articular cartilage pathology,” IEEE Trans. on 

biomedical engineering, vol. 47, no. 6, pp. 773–783, June 2000. 

[6] K. Umapathy and S. Krishnan, “Time-width versus frequency band mapping 

of energy distributions,” IEEE Transactions on Signal Processing, 

vol. 55, no. 3, pp. 978–989, Mar 2007. 

[7] K. Umapathy, S. Krishnan, and A. Das, “Sub-dictionary selection using 

local discriminant bases algorithm for signal classification,” in Proceeding 

of IEEE Canadian Conference on Electrical and Computer Engineering, 

Niagara falls, Canada, May 2004, pp. 2001–2004. 

[8] K. Umapathy and S. Krishnan, “A signal classification approach using 

time-width vs frequency band sub-energy distributions,” in Proc. IEEE 

International conference on Acoustics, Speech and Signal processing 

ICASSP05, Philadelphia, USA, Mar 2005, pp. V 477–480.

Combining Vocal Source and MFCC Features for Enhanced 

Speaker Recognition Performance Using GMMs 

Abstract— This work presents seven novel spectral features for speaker 

recognition. These features are the spectral centroid (SC), spectral 

bandwidth (SBW), spectral band energy (SBE), spectral crest factor 

(SCF), spectral flatness measure (SFM), Shannon entropy (SE) and Renyi 

entropy (RE). The proposed spectral features can quantify some of the 

characteristics of the vocal source or the excitation component of speech. 

This is useful for speaker recognition since vocal source information 

is known to be complementary to the vocal tract transfer function, 

which is usually obtained using the Mel frequency cepstral coefficients 

(MFCC) or linear predication cepstral coefficients (LPCC). To evaluate 

the performance of the spectral features, experiments were performed 

using a text-independent cohort Gaussian mixture model (GMM) speaker 

identification system. Based on 623 users from the TIMIT database, the 

spectral features achieved an identification accuracy of 99.33% when 

combined with the MFCC based features and when using undistorted 

speech. This represents a 4.03% improvement over the baseline system 

trained with only MFCC and ΔMFCC features. 


Speaker recognition has many potential applications as a biometric 

tool for resources that can be accessed via the telephone or internet. 

In these applications, the identity of users cannot be verified because 

there is no direct contact between the user and the service provider. 

Hence, speaker recognition is a cost effective and practical technology 

that can be used for enhanced security. 

Often in literature, the entire speech system is modeled with a 

time-varying excitation and a short-time-varying filter [1]. Using this 

model, the source and filter are assumed independent and hence the 

speech signal (s(t)) is modeled by the linear convolution of: 

Danoush Hosseinzadeh and Sridhar Krishnan 

Department of Electrical and Computer Engineering 

Ryerson University, Toronto, ON - M5B 2K3 Canada 

Email: (danoushh@hotmail.com) (krishnan@ee.ryerson.ca) 

s(t) =x(t) ∗ h(t) (1) 

where, x(t) is a periodic excitation (for voiced speech) or white 

noise (for unvoiced speech) and h(t) is a time-varying filter which 

constantly changes to produce different sounds. Although h(t) is 

time varying, it can be considered stable over a period of a few 

milliseconds (ms); typically around 10-30ms is commonly used in 

literature [1]. This convenient short-time stationary behavior is exploited 

by many speaker recognition systems in order to characterize 

the vocal tract configuration given by h(t), which is known to be 

a unique speaker-dependent characteristic for a given sound. While 

assuming a linear model, this information can be easily extracted 

from speech signals using well established deconvolution techniques 

such as homomorphic filtering or linear prediction methods. 

To date, the most effective features for speaker recognition have 

been the Mel frequency cepstral coefficient (MFCC) and the linear 

prediction cepstral coefficients (LPCC) [2][1][3]. These features can 

accurately characterize the vocal tract configuration of a speaker and 

can achieve good performance. Part of the success of these features 

is that they provide a compact representation of the vocal tract which 

can be modeled effectively. The first several MFCCs can characterize 

the speaker’s vocal tract configuration and LPCCs generally define 

lower order polynomials [1]. Additionally, the first derivative of the 

MFCC feature (ΔMFCC) is largely uncorrelated with the MFCC 

feature and has been shown to enhance recognition performance. 

Although the MFCC and LPCC based features have proven to be 

effective for speaker recognition, they do not provide a complete 

description of the speaker’s speech system. Hence, vocal source 

information can complement these traditional features by quantifying 

some speaker-dependent characteristics such as pitch, harmonic 

structure and spectral energy distribution [4][5]. 

This work proposes seven novel spectral features for speaker 

recognition that can quantify the vocal source. These features are 

the spectral centroid (SC), spectral bandwidth (SBW), spectral band 

energy (SBE), spectral crest factor (SCF), spectral flatness measure 

(SFM), Shannon entropy (SE), and Renyi entropy (RE). These 

spectral features can be used to complement the MFCC or LPCC 

features since they can quantify characteristics of the vocal source. 

It is also known that there is some degree of coupling between 

the vocal source and vocal tract [6][4] - i.e. the linear model 

assumed when calculating MFCC and LPCC is not entirely accurate. 

Therefore, the vocal source signal is to some extent predictable 

for a given vocal tract configuration. Given these factors, features 

that characterize the vocal source can be expected to improve the 

performance of existing speaker recognition systems. In this work, 

the seven proposed spectral features are extracted from the speech 

spectrum and used to enhance the performance of MFCC-based 

features in order to illustrate their effectiveness. 

Others have attempted to use the vocal source for improving 

performance of speaker recognition systems. Attempts have been 

made to develop features from the LPCC residual [7][8] with some 

success. In these cases, the authors have noted improved performance 

by complementing vocal tract features with vocal source information. 

The paper is organized as follows. Section II defines the baseline 

system used for testing and presents the spectral features. Section 

III presents the results as well as the experimental conditions and 

Section IV concludes the paper. 

II. PROPOSED TESTING METHOD 

GMM based speaker recognition systems have become the most 

popular method to date. This is because GMMs can capture the 

acoustic phenomena or acoustic classes that are present in speech 

[2]. In fact, some of the GMM clusters have been found to be 

highly correlated with particular phonemes [9]. As a result, good 

recognition performance can be achieved with GMM based systems. 

The performance of the proposed spectral features will be compared 

to the baseline system, which is a cohort text-independent 

GMM classifier trained with 14-dimensional MFCC vectors and 14dimensional 

ΔMFCC vectors extracted from 30ms speech frames. 

The log-likelihood function is used to find the user model that best 

matches a given utterance. 

1-4244-1274-9/07/$25.00 ©2007 IEEE 365 

6 

MMSP 2007 

Authorized licensed use limited to: Ryerson University Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.

TABLE I 

SUBBAND ALLOCATION USED TO CALCULATED SPECTRAL FEATURES. 

Subband Lower Edge (Hz) Upper Edge (Hz) 

1 300 627 

2 628 1060 

3 1061 1633 

4 1634 2393 

5 2394 3400 

A. Training and GMM Estimation 

The expectation maximization (EM) algorithm was used to estimate 

the parameters of the GMM models. In the past, model orders 

of 8-32 have been commonly used in literature however, good results 

have been obtained with cohort GMM systems using as little as 

16 components [2][10]. A model order of 24 was in this work to 

account for the additional features being used in the system and also, 

preliminary experimental results indicate that this model order was 

the optimal order for the proposed feature set given models of order 

16, 20, 24, 28 and 32. The k-means algorithm was used to obtain 

the initial estimate for each cluster since it has been shown that the 

initial grouping of data does not significantly affect the performance 

of GMM based recognition systems [2]. 

A diagonal covariance matrix was used to estimate the variances 

of each cluster in the models since it is well known that diagonal 

covariance matrices are much more computationally efficient than full 

covariance matrices. Furthermore, diagonal covariance matrices can 

provide the same level of performance as full covariance matrices 

because they can capture the correlation between the features if 

a larger model order is used [11]. For these reasons, diagonal 

covariance matrices have almost been exclusively used in previous 

speaker recognition works. Each element of these matrices is limited 

to a minimum value of 0.01 during the EM estimation process to 

prevent singularities in the matrix, as recommended by [2]. 

B. Spectral Features 

The proposed spectral features can be expected to improve the 

performance of MFCC or LPCC features because they can capture 

complementary information related to the vocal source such as pitch, 

harmonic structure, energy distribution, bandwidth of the speech 

spectrum and even voiced or unvoiced excitation. To illustrate the 

effectiveness of these features, they are extracted from the speech 

spectrum and used to enhance the performance of MFCC and 

ΔMFCC features. 

Spectral features should be extracted from multiple subbands, 

as shown in Table I. This extraction method will provide better 

discrimination between different speakers because the trend for a 

given feature can be captured from the spectrum. This is better than 

obtaining one global value from the spectrum, which is not likely to 

show speaker-dependent characteristics. 

The proposed subbands are linearly spaced on the Mel scale and 

spans the range of a practical telephone channel (300Hz-3.4kHz). 

This allocation scheme reflects the fact that most of the energy 

of the speech signal is located in the lower frequency regions and 

therefore, narrowly defined subbands are used in the lower frequency 

regions in order to capture more detail. This is also consistent with 

the non-linearities of human auditory perception, which shows 

more sensitivity to lower frequencies than higher frequencies. This 

non-linearity has been shown to be important for cepstral based 

features such as the MFCC feature [3]. 

Spectral features are extracted from 30ms speech frames as 

follows. Let si[n] for n ∈ [0,N], represents the i th speech frame 

366 

7 

and Si[f] represents the spectrum of this frame. Then, Si[f] can 

be divided into M non-overlapping subbands where, each subband 

(b) is defined by a lower frequency edge (lb) and a upper frequency 

edge (ub). Now, each of the seven spectral features can be calculated 

from Si[f] as shown below. 

Spectral Centroid (SC) - SC as given below is the weighted 

average frequency for a given subband, where the weights are the 

normalized energy of each frequency component in that subband. 

Since this measure captures the center of gravity of each subband it 

can locate large peaks in subbands. These peaks correspond to the 

approximate location of formants [12] or pitch frequencies. 

SCi,b = 

ub 

f=l b f|Si[f]| 2 

ub 

f=l b |Si[f]| 2 

Spectral Bandwidth (SBW) - SBW as given below is the weighted 

average distance from each frequency component in a subband to 

the spectral centroid of that subband. Here, the weights are the 

normalized energy of each frequency component in that subband. 

This measure quantifies the relative spread of each subband for 

a given sound and therefore, it might characterize some speaker- 

dependent information. 

SBWi,b = 

(2) 

ub 

f=l b (f − SCi,b) 2 |Si[f]| 2 

ub 

f=l b |Si[f]| 2 (3) 

Spectral Band Energy (SBE) - SBE as given below is the energy of 

each subband normalized with the combined energy of the spectrum. 

The SBE gives the trend of energy distribution for a given sound and 

therefore, it contains some speaker-dependent information. 

ub f=l |Si[f]| 

b 

SBEi,b = 

2 

 

f,b |S[f]|2 (4) 

Spectral Flatness Measure (SFM) - SFM as given below is a 

measure of the flatness of the spectrum, where white noise has a 

perfectly flat spectrum. This measure is useful for discriminating 

between voiced and un-voiced components of speech [13]. 

SFMi,b = 

ub 

f=l b |Si[f]| 2 

1 

u b−l b+1 

1 

u b −l b +1 

ub 

f=l b |Si[f]| 2 

Spectral Crest Factor (SCF) - SCF as given below provides a 

measure for quantifying the tonality of the signal. This measure is 

useful for discriminating between wideband and narrowband signals 

by indicating the relative peak of a subband. These peaks correspond 

to the most dominant pitch frequency in each subband. 

SCFi,b = 

1 

u b−l b+1 

max(|Si[f]| 2 ) 

ub 

f=l b |Si[f]| 2 

Renyi Entropy (RE) - RE as given below is an information theoretic 

measure that quantifies the randomness of the subband. Here, the 

normalized energy of the subband can be treated as a probability 

distribution for calculating entropy and α is set to 3, as commonly 

found in literature [14]. This RE trend is useful for detecting the 

voiced and unvoiced components of speech. 

REi,b = 1 

1 − α log ⎛ 

 

u α 

b 

 

⎝ Si[f] 

 

2 

 

ub 

 

f=l f=l Si[f] 

b 

b ⎞ 

⎠ (7) 

Shannon Entropy (SE) - SE as given below is also an information 

theoretic measure that quantifies the randomness of the subband. 

Here, the normalized energy of the subband can be treated as a 


(5) 

(6)

probability distribution for calculating entropy. Similar to the RE 

trend, the SE trend is also useful for detecting the voiced and unvoiced 

components of speech. 

 

 

u b 

 

 

Si[f] 

SEi,b = − 

 

ub 

 

f=l f=l Si[f] 

b 

b · log 

 

 

 

Si[f] 

2 

 

ub (8) 

f=l Si[f] 

b 

To the best of our knowledge, these features are being used for 

the first time in speaker recognition although they have previously 

been used in other areas [15]. These spectral features along with the 

MFCC and ΔMFCC features will be extracted from each speech 

frame and appended together to form a combined feature matrix for 

the speech signal. These vectors can then be modeled and used for 

speaker recognition. Equation 9 shows the feature matrix that can 

be extracted based on only one spectral feature, say the SC feature, 

from i frames; where the bracketed number is the length of the 

feature. It should be noted that any other spectral feature can be 

substituted in for the SC feature in the feature matrix. 

−→ F = 

⎡ 

⎢ 

⎣ 

MFCC1(14) 

. 

ΔMFCC1(14) 

. 

SC1(5) 

. 

⎤ 

⎥ 

⎦ (9) 

MFCCi(14) ΔMFCCi(14) SCi(5) 

The spectral features are expected to be largely uncorrelated with 

the MFCC based features because the spectral features can capture 

some information about the vocal source, whereas the MFCC features 

tend to capture information about the vocal tract. Among the spectral 

features, there may be some correlation between the SC and the 

SCF features because they both quantify information about the peaks 

(locations of energy concentration) of each subband. The difference is 

that the SCF feature describes the normalized strength of the largest 

peak in each subband while the SC feature describes the center of 

gravity of each subband. Therefore, these features will perform well 

if the largest peak in a given subband is much larger than all other 

peaks in that subband. The RE and SE features are also correlated 

since they are both entropy measures. However, the RE feature is 

much more sensitive to small changes in the spectrum because of 

the exponent term α. Therefore, although these features quantify the 

same type of information, their performance may be different for 

speech signals. 

III. EXPERIMENTAL RESULTS 

All speech samples used in these experiments were obtained from 

623 speakers of the TIMIT speech corpus. Since the TIMIT database 

has a sampling frequency of 16kHz, the signals were down sampled 

to 8kHz which is well suited for telephone applications. Features 

were extracted from 30ms long frames with 15ms of overlap with the 

previous frames and a Hamming window was applied to each frame 

to ensure a smooth frequency transition between frames. Twenty 

seconds of undistorted speech from each speaker was used to train the 

system and the remaining samples were used for testing. Although the 

tests were performed with undistorted audio, it is expected that some 

of these features will remain robust to different linear and non-linear 

distortions [15]. 

A. Results and Discussions 

MFCC based features are very effective for characterizing the 

vocal tract configuration. Although this is the main reason for the 

success of the MFCC based features, they do not provide a complete 

description of the speaker’s speech system. The proposed spectral 

features are expected to increase identification accuracy of MFCC 

367 

8 

TABLE II 

EXPERIMENTAL RESULTS USING 7S TEST UTTERANCES (298 TESTS) 

Feature Accuracy(%) 

MFCC & ΔMFCC (Baseline system) 95.30 

MFCC & ΔMFCC & SC 97.32 

MFCC & ΔMFCC & SBE 97.32 

MFCC & ΔMFCC & SBW 96.98 

MFCC & ΔMFCC & SCF 96.31 

MFCC & ΔMFCC & SFM 81.55 

MFCC & ΔMFCC & SE 90.27 

MFCC & ΔMFCC & RE 98.32 

MFCC & ΔMFCC & SBE & SC 96.98 

MFCC & ΔMFCC & SBE & RE 96.98 

MFCC & ΔMFCC & SC & RE 99.33 

based systems because they provide some information about the 

vocal source. 

Table II demonstrates the identification accuracy of the system 

when using spectral features in addition to the MFCC based features 

with undistorted speech. The table also shows several combinations 

of the best performing features. The accuracy rate represents the 

percentage of test samples that were correctly identified by the 

system, as shown below. 

Samples Correctly Identified 

Accuracy = (10) 

Total Number of Samples 

It is evident from these results that there is some speaker-dependent 

information captured by most of the proposed features since they improved 

identification rates when combined with the standard MFCC 

based features. In fact, when two of the best performing spectral 

features (SC and RE) were simultaneously combined with the MFCC 

based features, an identification accuracy of 99.33% was achieved, 

which represents a 4.03% improvement over the baseline system. 

These results suggest that the proposed spectral features provide 

complementary and discriminatory information about the speaker’s 

vocal source and system, which leads to enhanced identification 

accuracies. 

The best performing feature was the RE feature. This feature is 

very effective at quantifying voiced speech which is quasi-periodic 

(relatively low entropy) and un-voiced which is often represented 

by AWGN (relatively high entropy). However, we suspect that the 

RE feature may also be characterizing another phenomena other 

than voice and unvoiced speech. This is likely since the SE feature 

did not show any performance benefits and it too is an entropy 

measure capable of discriminating between voiced and unvoiced 

speech. One possibility is that the exponential term α in the RE 

definition is contributing to this performance improvement. Since the 

spectrum is a normalized to the range of [0,1] before calculating 

these features, the exponent term α has the effect of significantly 

reducing the contributions of the low energy components relative to 

the high energy components. Therefore, the RE feature is likely to 

produce a more reliable measure since it heavily relies on the high 

energy components of each subband. However, the entropy features 

in general are susceptible to random noise and will not perform well 

in all conditions. 

Figure 1(a) shows that the SC feature can capture the center of 

gravity of each subband. Since the subband’s center of gravity is 

related to the spectral shape of the speech signal, it implies that the SC 

feature can also detect changes in pitch and harmonic structure since 

they fundamentally affect the spectrum. Pitch and harmonic structure 

convey some speaker-dependent information and are complementary 


Mag. 

Mag. 

0.15 

0.1 

0.05 

(a) Location of SC 

0 

0.15 

0.1 

0.05 

500 1000 1500 2000 2500 

(b) Location of SCF 

3000 3500 4000 

Frequency (Hz) 

Mag. 

Mag. 

0 

0.2 

0.1 

500 1000 

8% 18% 2% 

1500 2000 2500 

(c) Percentage of SBW 

33% 38% 

3000 3500 4000 


0 

0 

0.2 

0.1 

500 1000 1500 2000 2500 

(d) Percentage of SBE 

46% 5% 3% 2% 2% 

3000 3500 4000 


0 

0 500 1000 1500 2000 2500 3000 3500 4000 


Fig. 1. Plot of the spectral features. Subband boundaries are indicated 

with dark solid lines and feature location is indicated with dashed lines. (a) 

Location of the SC (b) Location of the SCF (c) SBW as a percentage of the 

five subbands. (d) SBE as a percentage of the of the whole spectrum. 

to the vocal tract transfer function for speaker recognition. In addition, 

the SC feature can also locate the approximate location of the 

dominant formant in each of the subbands since formants will tend 

towards the subband’s center of gravity. These properties of the SC 

feature provide complementary information and led to the improved 

performance of the MFCC based classifier. 

The SCF feature shown in Figure 1(b) quantifies the normalized 

strength of the dominant peak in each subband. Given that the 

dominant peak in each subband corresponds to a particular pitch 

frequency harmonic, it shows that the SCF feature is pitch dependent 

and therefore, it is also speaker-dependent for a given sound. This 

dependence on pitch frequency is useful when the vocal tract configuration 

(i.e. MFCC) is known as seen by the enhanced performance. 

Moreover, the SCF feature is a normalized measure and should not 

be significantly affected by the intensity of speech from different 

sessions. 

The SBE feature, shown in Figure 1(d), also performed well in 

the experiments. This feature provides the distribution of energy in 

each subband as a percentage of the entire spectrum, which is another 

measure that can quantify the harmonic structure of the signal. The 

SBE feature is also a normalized energy measure and should not 

be significantly affected by the intensity (or relative loudness) of 

speech from different sessions. The results in Table II suggests that 

for a given vocal tract configuration the SBE trend is predictable and 

complementary for speaker recognition. 

The SBW feature is largely dependent on the SC feature and the 

energy distribution of each subband therefore, it has also performed 

well for the reasons mentioned above. Figure 1(c) shows the SBW 

for each subband as a percentage of all subbands. 

The SFM feature did not perform well because it quantifies characteristics 

that are not well defined in speech signals. For example, 

the SFM feature measures the tonality of the subband, a characteristic 

that is difficult to define in the speech spectrum since its energy is 

distributed across many frequencies. 

IV. CONCLUSION 

Features such as the SC, SCF and SBE provide vocal source 

information as it relates to harmonic structure, pitch frequency and 

spectral energy distribution, while the entropy features quantify the 

368 

9 

spectrum in terms of voiced and unvoiced speech. The proposed 

features were shown to be complementary in nature and enhanced 

performance when used with the vocal tract transfer function (i.e. 

MFCC). This is mainly because the vocal tract transfer function is 

the most discriminating feature for speaker recognition and it greatly 

influences the spectral shape and harmonic structures of speech. 

Experimental results show that the proposed spectral features 

improve the performance of MFCC based features. Based on 623 

users from the TIMIT database, the combined feature set of MFCC, 

ΔMFCC, SC and RE achieved an identification accuracy of 99.33% 

(for clean speech) by incorporating information about the vocal 

source. This represents a 4.03% improvement over the baseline 

system, which only used the MFCC based features. 

The good performance of spectral features for speaker recognition 

in this speaker identification system is very promising. These features 

should also produce good results if used with more sophisticated 

speaker recognition techniques, such as universal background model 

(UBM) based approaches. 

REFERENCES 

[1] J. P. Campbell, “Speaker recognition: A tutorial.” Proc. of the IEEE, 

vol. 85, no. 9, pp. 1437–1462, Sep. 1997. 

[2] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker 

identification using gaussian mixture speaker models.” IEEE Trans. on 

Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, Jan. 1995. 

[3] S. B. Davis and P. Mermelstein, “Comparison of parametric representations 

for monosyllabic word recognition in continuously spoken 

sentences.” IEEE Trans. on Acoustics, Speech, and Signal Processing, 

vol. 28, no. 4, pp. 357–366, Aug. 1980. 

[4] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, “Modeling of the 

glottal flow derivative waveform with application to speaker identification.” 

IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, 

pp. 569–586, Sept. 1999. 

[5] J. M. Naik, “Speaker verification: A tutorial.” IEEE Communications 

Magazine, vol. 28, no. 1, pp. 42–48, Jan. 1990. 

[6] C. Che and Q. Lin, “Speaker recognition using HMM with experiments 

on the yoho database.” in Proc. Eurospeech, Sept. 1995, pp. 625–628. 

[7] W. Chan, T. Lee, N. Zheng, and H. Ouyang, “Use of vocal source features 

in speaker segmentation.” in Proc. IEEE Int’l Conf. on Acoustics, 

Speech, and Signal Processing (ICASSP), vol. 1, May 2006, pp. 14–19. 

[8] N. Zheng and P. Ching, “Using haar transformed vocal source information 

for automatic speaker recognition.” in Proc. IEEE Int’l Conf. on 

Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, May 2004, 

pp. 77–80. 

[9] R. Auckenthaler, E. S. Parris, and M. J. Carey, “Improving a GMM 

speaker verification system by phonetic weighting.” in Proc. IEEE Int’l 

Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Mar. 1999, 

pp. 313–316. 

[10] J. Gonzalez-Rodriguez, S. Gruz-Llanas, and J. Ortega-Garcia, “Biometric 

identification through speaker verification over telephone lines.” in 

Proc. IEEE Int’l Carnahan Conf. on Security Technology, Oct. 1999, 

pp. 238–242. 

[11] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification 

using adapted gaussian mixtures models.” Digital Signal Processing, 

vol. 10, pp. 19–41, Jan. 2000. 

[12] K. K. Paliwal, “Spectral subband centroid features for speech recognition.” 

in Proc. IEEE Int’l Conf. on Acoustics, Speech, and Signal Proc. 

(ICASSP), vol. 2, May 1998, pp. 617–620. 

[13] R. E. Yantorno, K. R. Krishnamachari, and J. M. Lovekin, “The 

spectral autocorrelation peak valley ratio (SAPVR) − a usable speech 

measure employed as a co-channel detection system.” in Proc. IEEE 

Int’l Workshop on Intelligent Signal Processing (WISP), May 2001. 

[14] P. Flandrin, R. G. Baraniuk, and O. Michel, “Time-frequency complexity 

and information.” in Proc. IEEE Int’l Conf. on Acoustics, Speech, and 

Signal Processing (ICASSP), vol. 3, Apr. 1994, pp. 329–332. 

[15] A. Ramalingam and S. Krishnan, “Gaussian mixture modeling of shorttime 

fourier transform features for audio fingerprinting.” IEEE Trans. on 

Information Forensics and Security, vol. 1, no. 4, pp. 457–463, 2006. 


Proceedings of the 29th Annual International 

Conference of the IEEE EMBS 

Cité Internationale, Lyon, France 

August 23-26, 2007. 

Multiresolution Analysis and Classification of Small Bowel Medical 

Images 

Abstract— This is the first reported work in the area of small 

bowel image classification and a novel analysis system was 

developed. Principles of human texture perception were used to 

design features which can discriminate between abnormal and 

normal images. The proposed method extracts statistical features 

from the wavelet domain, which describe the homogeneity 

of localized areas within the small bowel images. To ensure that 

robust features were extracted, a shift-invariant discrete wavelet 

transform (SIDWT) was explored. LDA classification was used 

with the leave one out method to improve classification under 

the small database scenario. A total of 75 abnormal and normal 

bowel images were used for experimentation resulting in high 

classification rates: 85% specificity and 85% sensitivity. The 

success of the system can be accounted to the discriminatory 

and robust feature set (translation, scale and semi-rotational 

invariant), which successfully classified various sizes and types 

of pathologies at multiple viewing angles. 

Index Terms— Biomedical image processing, feature extraction, 

computer-aided diagnosis, content-based image retrieval 


The PillCam TM SB is a tiny capsule endoscope (10mm × 

27mm [1]), which is ingested from the mouth. As natural 

peristalsis moves the capsule through the gastrointestinal 

tract, it captures video and wirelessly transmits it to a data 

recorder the patient is wearing around his or her waist. 

This video provides visualization of the 21 foot long small 

bowel, which was originally seen as a “black box” to 

doctors [2]. Video is recorded for approximately 8 hours 

and then the capsule is excreted naturally. Clinical results 

for the PillCam TM show that it provides superior diagnostic 

capabilities for diseases of the small intestine [2]. 

In the small intestine, there are four main types of cancers, 

which are named after the cell they originate from: 

adenocarcinoma, sarcoma, carcinoid and lymphoma. These 

types of cancers can occur in various sizes and shapes, and 

may be found anywhere along the small bowel tract. Since an 

internal view of the small bowel was previously not available, 

the PillCam TM offers gastroenterologists a new method of 

detecting disease. The downfall of this technology is that the 

doctor has to watch and diagnose approximately 8 hours of 

footage! Viewing this footage is a very labourious task for 

physicians, which could cause them to miss important clues 

due to fatigue, boredom or due to the repetitive nature of 

the task. Therefore, to aid the doctors with this labourious 

task, a computer-aided diagnosis (CAD) system may be 

used to offer a secondary opinion of the images. Such a 

system would automatically isolate suspicious video instants 

April Khademi and Sridhar Krishnan 

Dept. of Electrical and Computer Engineering 

Ryerson University, Tor., ON, Canada 

akhademi@ieee.org, krishnan@ee.ryerson.ca 

1-4244-0788-5/07/$20.00 ©2007 IEEE 4524 

10 

(images) for the doctor. The extracted features may also 

be used for content-based image retrieval (CBIR), where 

physicians can locate abnormal image(s) based on their 

semantic content, not based on text annotations. 

There are several challenges associated with the development 

of an automated classification scheme for small bowel 

imagery: the camera angle can be expected to be different 

from patient to patient, suspicious regions may occur in 

several different places along the gastrointestinal tract and 

pathologies come in various forms, sizes and shapes. This 

work aims to develop a unified feature extraction algorithm 

which can account for all these scenarios. This is the first 

reported work in the area of small bowel image classification 

and the system aims to detect both malignant and benign 

pathologies with a high classification rate. The small bowel 

images (video instants) are stored as lossy JPEG images, so 

feature extraction is completed in the compressed domain. 

II. METHODOLOGY 

To extract highly discriminatory features, image processing 

techniques are needed to analyze and understand the biomedical 

images. Since biomedical signals (including small 

bowel images) contain a combination of information which is 

localized spatially (i.e. transients, edges) as well as structures 

which are more diffuse (i.e. small oscillations, texture) [3], a 

technique which can exploit both these characteristics (which 

may be related to the diagnosis) is required. To perform this 

task, the discrete wavelet transform (DWT) will be utilized 

due to its excellent space-localization properties [4] [6]. 

A. DWT Properties for Feature Extraction 

The DWT is scale-invariant since a complete decomposition 

will contain all the basis functions needed to decompose 

various scaled versions of the input image. Since pathological 

features do not come in a predefined size, scale-invariance 

will help to capture pathologies of different sizes. 

Although the DWT offers good localization and scaleinvariance 

properties, it is well known that the DWT is shiftvariant 

[4]. Different translations of an input image results 

in a different set of DWT coefficients. Therefore, extracting 

robust features from the wavelet domain is a challenging 

task. 

B. Shift-Invariant DWT 

SaP1B5.1 

To extract a consistent feature set, the 2-D version of 

Belkyns’s shift-invariant DWT (SIDWT) is utilized [5]. This 


algorithm computes the DWT for all circular translates of 

the image, in a computationally efficient manner. Coifman 

and Wickerhauser’s best basis algorithm [6] is employed to 

ensure the same set of coefficients are chosen, regardless 

of the input shift. This will permit for the selection of a 

consistent set of DWT coefficients, therefore allowing for 

the extraction of robust, shift-invariant features. 

C. Image Texture 

When a textured surface is viewed, the human visual system 

can discriminate between textured regions quite easily. 

To understand how the human visual system can easily differentiate 

between textures, Julesz defined textons, which are 

elementary units of texture [7]. Various textured regions can 

be decomposed using these textons, which include elongated 

blobs, lines, terminators and more. It was also found that the 

frequency content, scale, orientation and periodicity of these 

textons can provide important clues on how to differentiate 

between two or more textured areas [7] [4]. Therefore, 

a robust texture analysis scheme would take into account 

the localized spatial properties of images to understand the 

orientation, periodicity, scale or frequency content of the 

primitive texture elements. Consequently, to differentiate 

between normal and pathological cases of the small intestine, 

the proposed work aims to develop an automated system 

which mimics the human visual system to understand the 

texture content of the small bowel images. 

Small Bowel Texture: Normal small bowel images 

contain smooth, homogeneous texture elements with very 

little disruption in uniformity except for folds and crevices. 

This is shown in Figure 1(d)-(f). Abnormal small bowel 

images (benign and malignant) can contain various pathologies 

(polyp, Kaposi’s sarcoma, carcinoma, etc.). These diseases 

may occur in various sizes, shapes, orientations and 

locations within the gastrointestinal tract. Although there 

are many types of diseases, small bowel pathologies have 

some common textural characteristics: (1) diseased regions 

contain a variety of textured areas simultaneously and (2) 

diseased areas are mostly composed of heterogeneous texture 

components. Typical, abnormal cases are shown in Figure 

1(a)-(c). Another important factor which must be considered 

is that the camera angle will vary from image to image. 

Therefore, textural characteristics may appear in several 

orientations. 

D. Features 

To extract texture-based features, normalized graylevel cooccurrence 

matrices (GCMs) are used. Let each entry of 

the normalized GCM be represented as p (l1, l2, d, θ), where 

l1 and l2 are two graylevels at a distance d and angle θ. 

Normalized GCMs allow statistical quantities to be computed 

which reflect the textural properties of the region of interest. 

To exploit the textural characteristics of the small bowel 

images, texture features which describe the relative homogeneity 

or non-uniformity of the images are used since these 

texture properties differentiate between the normal and the 

abnormal images. The features used are homogeneity (h), 

4525 

11 

which describes how uniform the texture is and entropy (e), 

which is a measure of nonuniformity or the complexity of 

the texture. 

h(θ) = 

e(θ) = 

L−1 L−1 

l1=0 l2=0 

L−1 L−1 

l1=0 l2=0 

p 2 (l1, l2, d, θ), (1) 

p (l1, l2, d, θ) log2(p (l1, l2, d, θ)). (2) 

To gain a highly descriptive representation, textural features 

are computed from the wavelet domain. Extracting 

features from the wavelet domain will result in a localized 

texture description, since the DWT has excellent spacelocalization 

properties. To account for oriented texture, the 

GCMs are computed at various angles in the wavelet domain 

at d = 1 to account for fine texture. Typically, the DWT isn’t 

used for texture analysis due to its shift-variant property. 

However, using the SIDWT algorithm previously described 

will permit for the extraction of a consistent feature set, 

thus allowing for multiscale texture analysis. This scheme is 

devised to be in accordance with human texture perception. 

1) Multiresolutional Features: To examine texture features 

at various scales, GCMs p (l1, l2, d, θ) are computed 

from the wavelet domain for each scale j at several angles 

θ. Each subband isolates different frequency components - 

the HL band isolates horizontal edge components, the LH 

subband isolates horizontal edges, the HH band captures the 

diagonal high frequency components and LL band contains 

the lowpass filtered version of the original. Consequently, 

to capture these oriented texture components, the GCM is 

computed at 0 ◦ in the HL band, 90 ◦ in the LH subband, 

45 ◦ and 135 ◦ in the HH band and 0 ◦ , 45 ◦ , 90 ◦ and 135 ◦ in 

the LL band to account for any directional elements which 

may still may be present in the low frequency subband. 

From these GCMs, homogeneity h(θ) and entropy e(θ) are 

computed for each decomposition level using Equation 1 and 

2. For each decomposition level j, more than one directional 

feature is generated for the HH and LL subbands. The 

features in these subbands are averaged so that: features 

are not biased to a particular orientation of texture and 

the representation will offer some rotational invariance. The 

features generated in these subbands (HH and LL) are 

shown below. Note that the quantity in parenthesis is the 

angle at which the GCM was computed. 

h j 

HH 


e j 

HH 

h j 

LL 

e j 

LL 

1 

 

= h 

2 

j 

HH (45◦ ) + h j 

HH (135◦ 

) , (3) 

1 

 

= e 

2 

j 

HH (45◦ ) + e j 

HH (135◦ 

) , (4) 

1 

 

= h 

4 

j 

LL (0◦ ) + h j 

LL (45◦ 

) 

(5) 

+ 1 

 

h 

4 

j 

LL (90◦ ) + h j 

LL (135◦ 

) , 

1 

 

= e 

4 

j 

LL (0◦ ) + e j 

LL (45◦ 

) 

(6) 

+ 1 

 

e 

4 

j 

LL (90◦ ) + e j 

LL (135◦ 

) .

Fig. 1. Typical images of the small bowel captured by the PillCam TM SB, which exhibit textural characteristics. (a) Small bowel lymphoma, (b) GIST 

tumour, (c) polypoid mass, (d) healthy small bowel, (e) normal small bowel, (f) normal colonic mucosa. 

As a result, for each decomposition level j, two feature sets 

are generated: 

F j 

h = 

 

h j 

HL (0◦ ), h j 

LH (90◦ ), h j 

HH , h j 

 

LL , (7) 

F j 

e = e j 

HL (0◦ ), e j 

LH (90◦ ), e j 

 

HH , ej LL , (8) 

where h j 

HH , h j 

LL , ej HH and ej LL are the averaged texture 

descriptions from the HH and LL band previously described 

and h j 

HL (0◦ ), e j 

HL (0◦ ), h j 

LH (90◦ ) and e j 

LH (90◦ ) are homogeneity 

and entropy texture measures extracted from the HL 

and LH bands. Since directional GCMs are used to compute 

the features in each subband, the final feature representation 

is not biased for a particular orientation of texture and may 

provide a semi-rotational invariant representation. 

E. Classifier 

A large number of test samples are required to evaluate a 

classifier with low error (misclassification) rates since a small 

database will cause the parameters of the classifiers to be estimated 

with low accuracy. This requires the biomedical image 

database to be large, which may not always be the case since 

acquiring the images is always not easy and also the number 

of pathologies may be limited. If the extracted features are 

strong (i.e. the features are mapped into nonoverlapping 

clusters in the feature space) the use of a simple classification 

scheme will be sufficient in discriminating between classes. 

Therefore, linear discriminant analysis (LDA) will be the 

classification scheme used in conjunction with the Leave One 

Out Method (LOOM). LOOM combats the small sample size 

scenario by removing one sample from the whole set and 

generating the discriminant functions from the remaining 

N − 1 data samples. Using these discriminant scores, the 

left out sample is classified. This procedure is completed 

for all N samples. LOOM allows classifier parameters to be 

estimated with least bias [8]. 

III. RESULTS AND DISCUSSION 

The objective of the proposed system is to automatically 

classify various pathologies from normal regions throughout 

the small bowel tract. The small intestine images used 

are 256×256, 24bpp and lossy (.jpeg). Forty-one normal 

and 34 abnormal (including: various sized lesions such as 

submucosal masses, lymphomas, jejunal carcinomas, polypoid 

masses, Kaposi’s sarcomas, multifocal carcinomas, etc.) 

images were used for experimentation (ground truth is 

supplied with the database and images are acquired from 

4526 

12 

TABLE I 

CONFUSION MATRIX CONTAINING THE NUMBER OF CORRECTLY 

CLASSIFIED SMALL BOWEL IMAGES AS EITHER NORMAL OR ABNORMAL. 

Normal Abnormal 

Normal 35 (85%) 6 (15%) 

Abnormal 5 (15%) 29 (85%) 

various patients). The images were converted to grayscale 

prior to any processing to examine the feature set in this 

domain. Features were extracted for the first five levels of 

decomposition. Further decomposition levels will result in 

subbands of 8×8 or smaller, which will result in skewed 

probability distribution (GCM) estimates and thus were not 

included in the analysis. Therefore, the extracted features are 

F j 

h and F j e for j = {1, 2, 3, 4, 5}. The block diagram of the 

proposed system is shown in Figure 2. 

In order to find the optimal sub-feature set, an exaustive 

search was performed (i.e. all possible feature combinations 

were tested using the proposed classification scheme). The 

optimal classification performance was achieved by combining 

homogeneity features from the first and third decomposition 

levels with entropy from the first decomposition level. 

These three feature sets are shown below: 

F 1 h = 

 

h 1 HL(0 ◦ ), h 1 LH(90 ◦ ), h 1 HH, h 1 F 

 

LL , (9) 

3 h = 

 

h 3 HL(0 ◦ ), h 3 LH(90 ◦ ), h 3 HH, h 3 F 

 

LL , (10) 

1 e = e 1 HL(0 ◦ ), e 1 LH(90 ◦ ), e 1 HH, e 1 

LL . (11) 

Using the above features in conjunction with LOOM and 

LDA, the classification results for the small bowel images 

are shown as a confusion matrix in Table I. A total of 75 

abnormal and normal bowel images were correctly classified 

at rate of 85% specificity and 85% sensitivity. The classification 

rates are high even though: (1) the angle of the 

camera (or viewing angle) is different from image to image, 

(2) the images came from various patients and different 

regions within the gastrointestinal tract, (3) the pathologies 

were not restricted to a specific type, but in fact included 

many diseases and (4) the masses and lesions were of 

various sizes and shapes. The success of the system can be 

accounted to several factors. Firstly, the utilization of the 

DWT was important to gain a space-localized representation 

of the images’ nonstationary properties. Secondly, the choice 

of wavelet-based statistical texture measures (entropy and 

homogeneity) was critical in differentiating between the 

localized texture properties of the images, since abnormal 


images contain localized heterogeneous texture elements, 

whereas normal images are smooth (uniform). Utilization of 

the SIDWT allowed for the extraction of consistent (i.e. shiftinvariant) 

features. Furthermore, due to the scale-invariant 

basis functions of the DWT, pathologies of varying sizes 

were captured within one transformation (i.e. the features 

were scale-invariant). 

The system is relatively robust to the different camera 

angles by design. Since the viewing angle is different from 

image to image, features were collected at various angles (0 ◦ , 

45 ◦ , 90 ◦ , 135 ◦ ) in the respective subbands in order to account 

for the texture properties, regardless of the orientation. The 

feature set thus offered a semi-rotational invariant representation 

which could account for oriented textural properties at 

various angles within the gastrointestinal tract. 

Since this is the first work in the area of small bowel 

image classification, the results are promising and show 

great potential for applications such as CAD and CBIR. This 

is especially true since all features were extracted in a fullyautomated 

manner without any intervention or assistance 

from a gastroenterologist. This means that such a system 

could in fact be used as a tool which could either (1) sort 

the 8 hours of film and highlight suspicious regions or (2) 

automatically retrieve a specific region or mass, without 

having to use text annotations. 

Although the classification results are high, any misclassification 

can be accounted to cases where there is a lack 

of statistical differentiation between the texture uniformity 

of the abnormal and normal small bowel images. Additionally, 

normal tissue can sometimes assume the properties 

of abnormal regions; for example, consider a normal small 

bowel image which has more than the average amount of 

folds. This may be characterized as non-uniform texture and 

consequently would be misclassified. 

Another important consideration arises from the sizes of 

the databases. As was stated in Section II-E, the number of 

images used for classification can determine the accuracy 

of the estimated classifier parameters. Since only a modest 

number of images were used, misclassification could result 

due to the lack of proper estimation of the classifiers parameters 

(although the scheme tried to combat this with LOOM). 

Additionally, finding the right trade off between number of 

features and database size is an ongoing research topic and 

has yet to be perfectly defined [8]. 

A last point for discussion is the fact that features were 

successfully extracted from the compressed domain. Since 

many forms of multi-media are being stored in lossy formats, 

it is important that classification systems may also be 

successful when utilized in the compressed domain. 

Fig. 2. System block diagram for the classification of small bowel images. 

4527 

13 


A unified feature extraction and classification scheme was 

developed using the DWT for small bowel images and this 

is the first reported work in the area. Textural features 

were extracted from the wavelet domain in order to obtain 

localized numerical descriptors of the relative homogeneity 

of the small bowel images. To ensure the DWT representation 

was suitable for the consistent extraction of features, a shiftinvariant 

discrete wavelet transform (SIDWT) was computed. 

To combat the small database size, a small number of 

features and LDA classification were used in conjunction 

with the LOOM to gain a more accurate approximation of 

the classifier’s parameters. 

Seventy-five abnormal and normal bowel images were 

correctly classified at rate of 85% specificity and 85% 

sensitivity. The success of the system can be accounted 

to the semi-rotational invariant, scale-invariant and shiftinvariant 

features, which permitted the extraction of discriminating 

features for multiple camera angles and various sized 

pathologies. Due to the success of the proposed work, it may 

be used in a CAD scheme or a CBIR application, to assist the 

gastroenterologists to diagnose and sort 8 hours of footage. 

REFERENCES 

[1] B. Kim, S. Park, C. Jee, and S. Yoon, in An earthworm-like locomotive 

mechanism for capsule endoscopes. International Conference on 

Intelligent Robots and Systems, Aug. 2005, pp. 2997 – 3002. 

[2] Given Imaging Ltd., PillCam TM SB Capsule Endoscopy. [ONLINE], 

2006, http://www.givenimaging.com/. 

[3] M. Unser and A. Aldroubi, “A review of wavelets in biomedical 

applications,” Proceedings of the IEEE, vol. 84, no. 4, pp. 626 – 638, 

Apr. 1996. 

[4] S. Mallat, Wavelet Tour of Signal Processing. USA: Academic Press, 

1998. 

[5] J.Liang and T. Parks, “Image coding using translation invariant wavelet 

transforms with symmetric extensions,” IEEE Transactions on Image 

Processing, vol. 7, pp. 762 – 769, May 1998. 

[6] A. Khademi, “Multiresolutional analysis for classification and compression 

of medical images,” Master’s thesis, 2006, ryerson University, 

Canada. 

[7] B. Julesz, “Textons, the elements of texture perception, and their 

interactions,” Nature, vol. 290, no. 5802, pp. 91–97, Mar. 1981. 

[8] K. Fukunaga and R. Hayes, “Effects of sample size in classifier design,” 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 

vol. 11, no. 8, pp. 873 – 885, Aug. 1989. 


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings. 

Interference Detection in Spread Spectrum 

Communication Using Polynomial Phase Transform 

Randa Zarifeh and Nandini Alinier 

School of Electronics, Communication 

and Electrical Engineering 

University of Hertfordshire 

Hertfordshire, UK 

Email: rzarifeh@ee.ryerson.ca,n.d.alinier@herts.ac.uk 

Abstract—In this paper we propose an interference detection 

technique for detecting time varying jamming signals in spread 

spectrum communication systems. The technique is based on 

Discrete Polynomial Phase Transform (DPPT), where the jamming 

signal is synthesized from the modulated spread spectrum 

signal using the DPPT. The technique has shown good performance 

under low interference conditions with 2dB SJR, when 

correlation coefficient between the synthesized chirp signal and 

the reference chirp is 0.9. The computational complexity of the 

proposed technique is low compared to other techniques such as 

Hough-Radon Transform. This interference detection technique 

can be applied for different interference excision methods in 

military and wireless communication applications. 


The most commonly used type of spread spectrum signal is 

the direct sequence (DS/SS) spread spectrum signal, where a 

pseudorandom (PN) sequence is superimposed upon the data 

bits to achieve data spreading over a wider bandwidth. This 

increase in the bandwidth yields a processing gain, defined 

as the ratio of the bandwidth of the transmitted signal to the 

bandwidth of the message signal. The spread spectrum signal 

is not easily detected since it appears to be noise-like except 

at the intended receiver where the PN sequence is known. 

The DS spread spectrum signals are often used for their 

interference rejection capabilities in military and wireless communications. 

While SS systems can strongly reject narrowband 

interference, they fail in rejecting wideband interference. In 

practical systems, it is not possible to transmit high power 

wideband jamming signal due to power limitation. That is 

the reason for considering most of the jamming signals to 

be wideband signals with narrowband instantaneous frequency 

such as chirp signals, linear or nonlinear FM signals. The 

performance of the SS system can be further improved by 

detecting the interference/jamming and excise it prior to data 

despreading and detection. 

Different research works have been done in the area of 

interference (chirp) detection; some of the proposed methods 

are using: adaptive filters [1], evolutionary algorithm [2], and 

maximum likelihood estimation. The optimal method is the 

maximum likelihood technique which integrates along the 

Instantaneous Frequency (IF) ridge in the time frequency 

distribution. But if the initial information on the position of 

1-4244-0353-7/07/$25.00 ©2007 IEEE 

Sridhar Krishnan and Alagan Anpalagan 

Department of Electrical 

and Computer Engineering 

Ryerson University 

Toronto, Canada 

Email: (krishnan)(alagan)@ee.ryerson.ca 

the IF is not available, the integration will be taken along all 

possible lines in the TF domain. The maximum likelihood can 

also be applied on the IF ridge which is the result of wavelet 

transforms as done in [3]. This method has a high computational 

complexity especially when the initial estimation of the 

IF is not available. Another technique was proposed by Amin 

et al. [12], where they evaluated the Wigner-Ville Distribution 

(WVD) of the observed signal and estimated the IF parameters 

from the WVD. Once the parameters have been estimated, 

an adaptive time varying filter can be set up to suppress the 

interference. One of the problems related to this method is 

that, if the interference is low with respect to the SS signal or 

the noise, the estimation of the interference parameter can fail 

and the suppression filter can track the useful signal instead 

of the interference. 

A linear chirp signal was detected by applying Hough- 

Radon Transform (HRT) on the WVD or the Spectrogram 

of the signal [4][5]. The HRT is an optimal technique for 

detecting directional lines in an image which requires a high 

degree of computational complexity. Other techniques for 

chirp detection are based on signal synthesis. Previous work 

on signal synthesis from bilinear Time Frequency Distribution 

(TFD) was done by Bartel and Parks [6]. In their work 

the signal was synthesized from the WVD using the least 

square approximation. Krattenthaler and Hlawatsch extended 

the work in [6] and synthesized the chirp signal from the 

smoothed versions of the WVD [7]. These techniques are 

based on the least square approximation where they have high 

computational complexity. 

The motive of this work is to detect a jamming/interfering 

chirp signal in a spread spectrum communication system. This 

interfering signal could be from intentional jammer (hostile 

source) or interference from multipath effects in multipath 

channel. In this paper we propose a chirp detection technique 

using signal synthesis approach, where a parametric signal 

analysis approach is used to represent the time domain chirp 

signal. The proposed solution based on DPPT will detect the 

chirp jammer/interferer even under low jamming power with 

low computational complexity, and hence is better than the 

existing approaches. The proposed technique is a good interference 

detection tool that can be applied prior to interference 

2979 


14

excision blocks in communications. 

The paper is organized as follows: Section II describes 

the signal and system model, spread spectrum system and 

chirp signals. Section III defines the discrete polynomial 

phase transform technique. Section IV outlines numerical and 

simulation results. And the last Section V is the conclusion 

and the summary of the paper. 

II. SIGNAL AND SYSTEM MODEL 

A. Spread Spectrum System 

Assuming Binary Phase Shift Keying modulation (BPSK), 

the transmitted spread spectrum signal s(t) consists of the 

message signal m(t) and the spreading signal p(t), 

where 


s(t) =m(t)p(t), (1) 

m(t) = 

bkrectTm (t − kTm), (2) 

k 

bk = {+1,-1} is the message bits, and rectTm is a rectangular 

pulse of duration Tm, and 

p(t) = 

L−1 

cnrectTp (t − nTp), (3) 

n=0 

where cn = {+1,-1} is the nth chip of the L-element PN 

sequence, 

s(t) = 

bkp(t − kTm). (4) 

k 

During the transmission of the modulated signal, additive 

white Gaussian noise n(t) (with zero mean and variance = σ 2 ) 

and interference i(t) are added to the signal in the channel, 

and the following signal is received: 

r(t) =s(t)+n(t)+i(t). (5) 

At the receiver the received signal r(t) is synchronized and 

correlated with the same PN sequence (known to the intended 

receiver) and estimation of the message signal ˆmk is made 

based on the polarity of the recovered message bits, 

ˆmk = 〈r(t),p(t)〉 = mk〈p(t),p(t)〉+〈n(t),p(t)〉+〈i(t),p(t)〉, 

(6) 

where is the correlation operator. From (6) it can be 

seen that correlating the received signal with the PN sequence 

p(t) will recover the message signal, but will spread both the 

noise and the interference. If the ratio of the interference power 

to the signal power is large so the processing gain can not 

suppress the interference then the estimation of the message 

bit will be wrong. The SS system (shown in Fig.1) is able to 

recover the correct data bit at low interference, but when the 

interference is high and time varying the SS system will fail. 

B. Chirp Signal 

Fig. 1. Spread Spectrum System Block Diagram. 

Chirp signals are present in many areas of science and 

engineering. They are present in natural signals such as animal 

sounds and whistling sounds. Because of their ability to 

reject interference, they are widely used in spread spectrum 

communications, military communications, radar and sonar 

applications. 

Mathematically, chirp signals are modeled as nonstationary 

signals with polynomial phase parameters. A polynomial phase 

signal y(n) can be expressed as: 

M 

y(n) =b0 exp{jφ(n)} = b0 exp j am(n∆) m 

 

, (7) 

m=0 

where φ(n) is the phase of the signal, M is the polynomial 

order, N is the total signal length, and ∆ is the sampling 

interval and b0 is the signal amplitude. 

In this paper we will deal with linear and parabolic (nonlinear) 

chirp signals as interferences, where their phases are 

second and third order polynomial functions (M = 2, 3). 

Figures 2 and 3 show the Time-Frequency representation of 

the linear and parabolic chirp signal respectively. 

Frequency 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 1000 2000 3000 

Time 

4000 5000 6000 

Fig. 2. TF representation of Linear Chirp. 

III. DISCRETE POLYNOMIAL PHASE TRANSFORM (DPPT) 

The DPPT is a parametric signal analysis approach for 

estimating the phase parameters of a polynomial phase signal 

[10] [11] [14]. Normally, the phase parameters of a signal 

are determined by applying least square approximation to fit 

15 

2980 


Frequency 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 


0 

0 1000 2000 3000 

Time 

4000 5000 6000 

Fig. 3. TF representation of Parabolic Chirp. 

a polynomial to the phase curve. This process poses some 

difficulty especially when the phase is not available. The 

DPPT, on the other hand, is applied directly onto the signal 

and it works quiet well even in the presence of noise. 

The principle of the DPPT is as follow: when the DPPT of 

order M is applied to a signal with polynomial phase of order 

M, it produces a spectral peak. The position of this spectral 

peak at frequency ω0 provides an estimation of the coefficient 

âM . After the estimation of âM the order of the polynomial 

is reduced from M to M − 1 by multiplying the signal with 

the conjugate pair of the estimated phase. Then the coefficient 

âM−1 will be estimated the same way by applying the DPPT 

of order M − 1 on the signal. The procedure is repeated until 

all of the coefficients are estimated. 

The DPPT of order M of a continuous phase signal y(n) 

is the Fourier transform of the higher order DPM[y(n),τ] 

operator: 

DPPTM [y(n),ω,τ]= 

N−1 

(M−1) τ 

where τ is a positive number and: 

DPM[y(n),τ]exp (−jωn∆) , 

(8) 

DP1[y(n),τ]:=y(n), (9) 

DP2[y(n),τ]:=y(n)y ∗ (n − τ), (10) 

DPM[y(n),τ]:=DP2[DPM−1[y(n),τ],τ]. (11) 

The coefficient aM is estimated based on the following formula: 

1 

âM = 

M!(τM ∆) M−1 argmaxω{|DPPTM [y(n),ω,τ]|}, 

(12) 

where DPPTM[y(n),ω,τ] is calculated as in Equation (11). 

The formulas for the DPPT of order one to three are shown 

below: 

DPPT1[y(n),ω,τ]=fft{y(n)}, (13) 

DPPT2[y(n),ω,τ]=fft{y(n)y ∗ (n − τ)}, (14) 

DPPT3[y(n),ω,τ]=fft{y(n)[ y ∗ (n−τ)] 2 y(n−2τ)}. (15) 

After the estimation of aM , the order of the signal 

phase will be reduced by multiplying the signal y(n) with 

exp{−jâM (n∆) M }, 

y(n) (M−1) = y(n)exp{−jâM (n∆) M }. (16) 

To determine aM−1, apply the DPPT of order M −1 on the 

signal y(n) (M−1) from Equation (13). The process is repeated 

until all the remaining coefficients are calculated. Coefficient 

a0 and b0 are determined by the following formulas: 

N−1 

 

â0 = phase y(n)exp − j 

n=0 

 

ˆb0 = 1 

N−1 

N 

n=0 

 

y(n)exp − j 

m=0 

M 

m=1 

M 

m=1 

am(n∆) m 

 

, (17) 

am(n∆) m 

 

. (18) 

The final synthesized signal is, 

ˆy(n) = ˆ M 

 

b0 exp j âm(n∆)m . (19) 

Figures 4 and 5 show the result of applying second order and 

third order DPPT on non linear chirp (parabolic chirp) with 

third order polynomial phase. The spectral peak only appears 

in the case of third order DPPT corresponding to third order 

polynomial phase. 

Amplitude 

25 

20 

15 

10 

5 

0 

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 


Fig. 4. Second Order DPPT. 

For the DPPT algorithm to work, the location of the spectral 

peak ω0 when taking the Fourier transform of the DP operator 

has to be smaller than half of the sampling frequency ωs: 

16 

2981 


|ω0| = M!(τ∆) M−1 |aM |≤ ω0 

. (20) 

2

Amplitude 

60 

50 

40 

30 

20 

10 


0 

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 


Fig. 5. Third Order DPPT. 

This condition is translated into the following requirement on 

the range of coefficient aM: 

|aM |≤ 

Π 

M!τ M−1 , (21) 

∆M when M = 1, we have |a1| ≤ ωs 

2 which is the Nyquist 

criterion. 

The accuracy of the DPPT method depends on many factors 

such as the level of noise present, the type of noise, the length 

of the chirp signal and the chosen value of τ [10] [11]. The 

best signal estimation is achieved when τ = N/M, where 

N is the signal length and M is the order of the polynomial 

phase of the chirp (M=2 for linear chirp, M=3 for parabolic 

chirp). 

The SNR (the jamming signal) at the spectral peek ω0 

should be at least 14dB for a good detection. Also the number 

of points when taking the Fourier transform will affect the 

results when estimating the position of the spectral peak. 

Increasing the order of the polynomial will also affect the 

estimation error. For example, if we consider a third order 

polynomial, any error which occurs in the coefficient a3 will 

make it impossible to remove this coefficient completely from 

the polynomial during phase unwrapping step. Therefore, the 

estimation of the next coefficient a2 will suffer and have error 

as well. Similarly, the new error in a2 will affect the precision 

of a1 and a0. 

The computational complexity of the DPPT is determined 

based on the number of multiplications needed to synthesize 

the chirp having a length of N. The DPPT process 

involves the calculation of the ambiguity function and then 

taking the Fourier transform of the ambiguity function. The 

computational complexity of the ambiguity function calculation 

is O(N) and the complexity of the fast Fourier trans- 

form is O(N log 2 N). Therefore the total complexity is only 

O(N log 2 N). 

IV. NUMERICAL AND SIMULATION RESULTS 

We used 128 chips per data bit for spreading the message 

signal and we assumed a Gaussian channel. We considered 

a constant amplitude linear and parabolic chirp as the interference 

source. We first evaluated the bit error rate (BER) 

resulting from the presence of linear chirp at different jamming 

ratios between [0, 60] dB. We assumed SNR (with Gaussian 

noise) to be -10 dB for each case. As seen in Figure 6, the bit 

error rate increase as the jamming ratio increases. The spread 

spectrum system is able to recover the data bits at low jamming 

ratio of 10 dB, but as the ratio increases the system fails to 

recover the correct data bits. 

BER 

10 0 

10 −1 

10 −2 

10 −3 

10 

0 10 20 30 40 50 60 

−4 

JSR 

Fig. 6. BER vs. JSR results for a self-excised SS system. 

Next we used the proposed DPPT technique to synthesize 

the jamming chirp in the previous spread spectrum system. 

The chirp signal used was a linear chirp with normalized 

instantaneous frequency varying from 0 to 0.5 Hz. We also 

used 128 data chips and 100 data bits for a total of 12800 

bits. Table I shows the correlation coefficient between the 

reference linear chirp and the synthesized chirp using second 

order DPPT. The first simulation was for -0.5 dB signal to 

noise ratio (SNR) with signal to jamming ratio (SJR) ranging 

from [6,-20] dB, and the second simulation for -5 dB signal 

to noise ratio with jamming ratio in the range [6,-20] dB. 

The DPPT showed good results since it was able to detect the 

chirp under low chirp ratio (0.9879 for -2dB signal to jamming 

ratio). 

In the next simulation we used a parabolic chirp signal as 

the interference source. Figure 8 shows the spectrogram of the 

received signal r(t) with interference and noise added. 

Table II shows the correlation coefficient between the reference 

chirp and the synthesized chirp. A third order DPPT 

was applied in the signal. The first simulation was for -0.5 

dB signal to noise ratio with signal to jamming ratio changing 

in the range [6,-20] dB. And the second simulation for -5 dB 

17 

2982 


Frequency 

TABLE I 

RESULTS WITH LINEAR CHIRP 

SJR in dB Corr-Coeff(SNR=-0.5 db) Corr-Coeff(SNR=-5 db) 

6 0.7916 0.0469 

4 0.9878 0.9480 

2 0.9879 0.9874 

0 0.9878 0.9879 

-2 0.9879 0.9879 

-4 0.9881 0.9880 

-6 0.9880 0.9880 

-8 0.9880 0.9880 

-10 0.9880 0.9880 

-15 0.9880 0.9880 

-20 0.9880 0.9880 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 


0 

0 1000 2000 3000 

Time 

4000 5000 6000 

Fig. 7. Received signal with chirp interference and Gaussian noise. 

signal to noise ratio with jamming ratio in the range [6,-20] 

dB. 

In the previous simulation the DPPT performed better in 

the parabolic chirp case than the linear chirp because the 

frequency variation in the parabolic chirp [0.6, 0.9] Hz was 

less than the variation in the linear chirp [0, 0.5] Hz. The 

experimental result shows that the proposed technique can be 

successfully used for detection of chirp like interference in 

spread spectrum system. 

Figure 8 shows the TF representation of the synthesized 

TABLE II 

RESULTS WITH PARABOLIC CHIRP 

SJR in dB Corr-Coeff(SNR=-0.5 db) Corr-Coeff(SNR=-5 db) 

6 0.0943 0.0482 

4 0.1117 0.0895 

2 0.9969 0.1672 

0 0.9981 0.9670 

-2 0.9987 0.9979 

-4 0.9991 0.9986 

-6 0.9994 0.9993 

-8 0.9995 0.9993 

-10 0.9997 0.9994 

-15 0.9997 0.9994 

-20 0.9997 0.9996 

parabolic chirp (Fig. 3) under 6dB signal to jamming ratio 

(low correlation coeff=0.0015). 

Frequency 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 1000 2000 3000 

Time 

4000 5000 6000 

Fig. 8. Detected parabolic chirp with 6dB SJR. 

Previous work [5] on chirp detection using Hough-Randon 

transform (HRT) showed good performance but computationally 

expensive. In their work they decomposed the received 

signal into its TF functions using an adaptive signal decomposition 

algorithm. The TF functions were mapped onto the 

TF plane and then treated as an image, and chirps present in 

the TF plane were detected using HRT. The HRT is an optimal 

technique for detecting lines in an image but it requires a high 

degree of computational complexity. 

This technique outperformers previous TF distribution techniques, 

which provide a distribution of the signal spectrum 

over a period of time but do not inherently provide chirp 

parameters like the DPPT. In addition, TF distribution functions 

always suffer from a tradeoff between resolution and 

interference terms which can result in incorrect synthesis and 

detection of the interfering signal. 

V. CONCLUSION 

A new technique is introduced for modulated interference 

detection in spread spectrum systems. The simulation results 

show that the new method provides accurate detection and 

estimation of linear and parabolic chirp interference. This 

technique does not suffer from the same limitation as previous 

techniques, where it detected the chirp signals under low 

jamming ratio and has low computational complexity. The 

method can be extended to include the excision of the detected 

interference which will be done in future work. 

REFERENCES 

[1] Genyuan Wang and Xiang-Gen Xia, “An adaptive filtering approach to 

chirp estimation and isar imaging of maneuvering targets,” IEEE 2000 

International Radar Conference, pp. 481-486, May 2000. 

[2] J.S. Dhanoa, E.J. Hughes, and R.F Ormondroyd, “Simultaneous detection 

and parameter estimation of multiple linear chirps,” Proc. IEEE Intl. 

Conference Acoustics, Speech, and Signal Processing (ICASSP), vol.6, 

pp. VI-129-32, Apr 2003. 

[3] M.Morvidone and B.Torresani, “Time scale approach for chirp detection,” 

International Journal of Wavelets, Multiresolution and Information 

Processing,vol. 1, no. 1, pp. 1949, 2003. 

18 

2983 



[4] A. Ramalingam and S. Krishnan, “A novel robust image watermarking 

using a chirp based technique,” In Proc. Canadian Conference on 

Electrical and Computer Engineering, pp. 1889-1892, May 2004. 

[5] S. Erkucuk, S. Krishnan, and M. Zeytinoglu, “Robust audio watermarking 

using a chirp based technique,” Proc. International Conference on 

Multimedia and Expo, 2003. ICME 03, pp. II - 513-16, July 2003. 

[6] G.F. Boudreaux-Bartels and T. Parks, “Time-varying filtering and signal 

estimation using wigner distribution synthesis techniques,” IEEE Signal 

Processing Magazine,vol. 9, no. 2, pp. 21-67, April 1992. 

[7] W. Krattenhaler and F. Hlwatsch, “Bilinear signal synthesis,” IEEE 

Transactions on signal processing, vol. 40, no. 2, pp. 352–363, Feb 

1992. 

[8] W. Krattenhaler and F. Hlawatsch, “General signal synthesis allgorithms 

for smoothed version of wigner distribution,”Proc. IEEE International 

Conference on Acoustics, and speech, and Signal Processing (ICASSP), 

no. 3, pp. 1611–1614, Apr 1990. 

[9] A. Francos and M. Porat, “Analysis and synthesis of multicomponent 

signals using positive time-frequency distributions,” IEEE Transactions 

on Signal Processing, pp. 47(2):493-504, Feb. 1999. 

[10] S. Peleg and B. Friedlander, “Multicomponent signal analysis using 

the polynomial-phase transform,” IEEE Transactions on Aerospace and 

Electronic Systems, vol. 32, no. 1, pp. 378-387, Jan 1996. 

[11] S. Peleg and B. Friedlander, “The discrete polynomial-phase transform,” 

IEEE Transactions on Signal Processing,vol. 43, no. 8, pp. 1901-1914, 

August 1995. 

[12] M.G. Amin, “Interference mitigation in spread spectrum communication 

systems using time-frequency distributions,” IEEE Trans. Signal 

Processing, vol. 45, no. 1, pp. 90–101, Jan 1997. 

[13] J.D. Laster and J.H. Reed, “Interference rejection in digital wireless 

communication,” IEEE Signal Processing Mag., pp. 37–62, May 1997. 

[14] L. Lee, “Time-frequency signal synthesis and its application in multimedia 

watermark detection,” Master Thesis, Ryerson University, 2005. 

19 

2984 


Emotion Recognition Using Novel Speech Signal 

Features 

Talieh Seyed Tabatabaei, Sridhar Krishnan 




{tseyedta, krishnan}@ee.ryerson.ca 

Abstract—Automatic Emotion Recognition (AER) is a very 

recent research topic in the Human-Computer Interaction (HCI) 

field which still has much room to grow. In this contribution a 

set of novel acoustic features and Least Square-Support Vector 

Machines (LS-SVMs) are proposed to set up a speaker- 

independent Automatic Human Emotion Recognition system. 

Six discrete emotional states are classified throughout this work: 

happiness, sadness, anger, surprise, fear, and disgust. Different 

multi-class SVM methods are implemented in order to get the 

best result. The result achieved by LS-SVM is then compared by 

that of a Linear Classifier. We achieved an overall accuracy of 

81.3%. 


Research has been done on emotion in the fields of 

psychology and physiology for a long time. More recently it’s 

been the subject of interest to engineers. Its most important 

application is in intelligent human-machine interaction. As 

computers have become an integral part of our lives, the need 

for a more natural communication interface between humans 

and machines has arisen. To accomplish this goal, a computer 

would have to be able to perceive its present situation and 

respond in a different manner depending on its perception. To 

make Human-Computer Interaction (HCI) more natural and 

friendly, it would be beneficial to give computers the ability to 

recognize situations the same way a human does. 

In today’s HCI systems, machines can recognize the 

speaker and also content of the speech, using speech 

recognition and speaker identification techniques. If machines 

are equipped with emotion recognition techniques, they can 

also know “how it is said” to react more appropriately, and 

make the interaction more natural. Other potential applications 

of Automatic Emotion Recognition (AER) include psychiatric 

diagnosis, intelligent toys, lie detection, learning environment, 

customer service, educational software, and detection of the 

emotional state in telephone call center conversations to 

provide feedback to an operator or a supervisor for monitoring 

purposes. 

One of the most important human communication 

channels is auditory channel which carries speech and vocal 

1-4244-0921-7/07 $25.00 © 2007 IEEE. 

345 

Aziz Guergachi 

Department of Information Technology Management 



a2guerga@ryerson.ca 

intonation. In fact people can perceive each other’s emotional 

state by the way they talk. Therefore in this work we are 

analyzing the speech signal in order to set up an automatic 

system to recognize the human emotional state. Different 

researchers have decided on different number and kinds of 

emotional states, such as 3 categories of positive(joy), 

negative(anger, irritation), and neutral in [7], 4 categories of 

neutral, anger, Lombard, and loud in [9], and 5 categories of 

neutral, happiness sadness, anger, and fear in [8]. In this work 

we are automatically categorizing six different human 

emotional states: anger, happiness, fear, surprise, sadness, 

and disgust. 

Some researchers have developed speaker-dependent 

speech emotion recognition systems [7, 12]. We think that 

speaker independency is one of the intrinsic characteristics of 

an AER system. When a system is person-dependent the 

accuracy increases but on the other hand, for each new person 

we have to train our system all over again and that is a major 

drawback. So here we are trying to reach a very satisfying 

accuracy with a person-independent system by choosing right 

acoustic features and powerful classifier. While some 

researchers have utilized both acoustic characteristics and 

textual content of an emotional spoken utterance [10, 12], we 

are conducting our work using commonly used and newly 

proposed acoustic features of the speech signal only. 

Various different classifiers have been taken into 

consideration for categorizing the emotional states. The most 

common classifiers used are Hidden Markov Model (HMM) 

[13] and Neural Networks (NNs) [15], whereas the number of 

works which use Support Vector Machines (SVMs) are 

relatively very few [12]. SVM is a relatively new approach in 

the field of Machine Learning and has a large number of 

advantages to other conventional and popular classifiers like 

NNs. In this contribution we are using Least Square Support 

Vector Machines (LS-SVMs), which are the reformulations to 

the original SVMs. 

The paper is organized as follows: Section II explains the 

emotion database used in this research. Section III 

demonstrates the structure of the AER system proposed in this 


20

work and the corresponding steps. In Section II the theory of 

SVM is briefly discussed. In Section V the experimental 

results are presented, and Section VI is the conclusion. 

II. THE EMOTION DATABASE 

The database used in this research is the one created in 

[16]. We believe that the results obtained in different emotion 

recognition experiments are strongly related to the database 

used. So lack of a common good quality database for 

researchers makes it hard to compare the performance of their 

proposed systems. 

This audio-visual emotion database presented in [16] is a 

professional reference database for testing and evaluating 

video, audio or joint audio-visual emotion recognition 

algorithms. The final version of the database contains 42 

subjects, coming from 14 different nationalities. Among the 

42 subjects, 81% are men, while the remaining 19% are 

women. First, each subject is asked to listen carefully to a 

short story for each of the six emotions (happiness, sadness, 

surprise, disgust, fear, and anger) and to immerge themselves 

into the situation. Once the subject is ready, he or she may 

read, memorize and pronounce the five proposed utterances 

(one at the time), which constitutes five different reactions to 

the given situation. The subjects are asked to put as much 

expressiveness as possible, producing a message that contains 

only the emotion to be elicited. All the subjects talk in 

English but they might have different accents. All the 

utterances are approved by two experts in order to be genuine. 

III. SPEECH EMOTION RECOGNITION SYSTEM 

The structure of the speech emotion recognition system 

used in this paper is depicted in Fig. 1. 

A. Preprocessing 

In the preprocessing stage first each signal is de-noised by 

soft-thresholding the wavelet coefficients, and since the silent 

parts of the signals do not carry any useful information, those 

parts including the leading and trailing edges are eliminated by 

thresholding the energy of the signal. The signals are divided 

into frames using a Hamming window of length 23 ms. 

B. Feature Extraction 

We are proposing a set of novel acoustic features in this 

experiment. Most researchers use prosodic features and their 

statistical characteristics to classify the emotions [8, 11, 13, 

14, 15]. In this contribution we are using the set of features 

listed in Table I. Among these features only Mel Frequency 

Cepstrum Coefficients (MFCC) and Zero Crossing Rate 

(ZCR) have been used for speech emotion recognition in the 

past [9, 10, 11], while the rest are being used for the first time 

in this application. All the features are extracted from each 

frame and then the mean and standard deviation for each 

feature is considered to constitute the feature vector. 

C. Feature Selection 

The performance of a pattern recognition system highly 

depends on the discriminant ability of the features. Selecting 

the most relevant subset from the original feature set, we can 

increase the performance of the classifier and on the other 

Speech Signal 

Final Result 

Figure 1. The structure of the speech emotion recognition system. 

hand decrease the computational complexity. We are using the 

forward selection method for each single binary classifier in 

our system in order to select the most efficient subset of 

features. At each step the variable which increases the 

performance of the classifier the most, is added to the feature 

subset. Fig. 3 illustrates the concept. 

D. Classification 

The recognition of human emotion is essentially a pattern 

recognition problem. We are using LS-SVM (descried in next 

section) as a classifier in this research. Since we are dealing 

with a multi-class classification problem, we need a method to 

extend our two-class support vector classification 

methodology to a multi-class problem. There are different 

ways for multi-category SVM mentioned in the literature 

among which one-against-all and one-against-one (pairwise) 

are the most popular ones. In this paper we are comparing the 

results achieved by one-against-all, fuzzy one-against-all, 

pairwise, and fuzzy pairwise [17]. 

For the purpose of comparative study we are also applying 

a Linear Classifier with gradient descent optimization 

algorithm. 

IV. SUPPORT VECTOR MACHINES 

SVM was introduced first by Vapnik and co-workers, and 

it is such a powerful method that in the few years since its 

introduction has outperformed most other systems in a wide 

variety of applications. SVM is used in applications of 

regression and classification; however, it is mostly used as a 

binary classifier. SVM is based on the principle of structural 

risk minimization. The optimal boundary is found in such a 

way that maximizes the margin between two classes of data- 

TABLE I. LIST OF ACOUSTIC FEATURES USED FOR SPEECH EMOTION 

RECOGNITION 

Spectral Features 

• Shannon entropy 

• Renyi entropy 

• Spectral bandwidth 

• Spectral centroid 

• Spectral flux 

• Spectral roll-off 

frequency 


346 

21 

Preprocessing Feature 

Extraction 

Classifier 1 

Classifier 2 

Classifier n 

Audio Features 

Cepstral 

Features 

Feature Selection 



Time-domain 

Features 

MFCC Zero crossing rate 

. 

. 

.

points [1, 2, 3] (Fig. 2). SVM is based on kernel functions, 

which are used to map data points to a higher dimensional 

feature space in order to be linearly separable. The 

optimization problem here is the dual optimization problem 

which is solved by Lagrangian method and making use of very 

important Karush-Kuhn-Tucker (KKT) conditions. Equation 

(1) shows the optimization problem for SVM classifiers: 

minimize w, b 

 

n 

1 2 

2 

w + C∑ 

ξ i 

(1) 

2 

i= 

1 

subject to 

 

( < w. 

x > + b) 

≥ 1− 

ξ ξ ≥ 0, 

i = 1,....., 

n 

yi i 

i i 

where C is a regularization parameter which is a trade off 

between the empirical risk (reflected by the second term in 

(1)) and model complexity (reflected by the first term in (1)), 

and ξi are slack variables which are introduced to relax the 

constraints and make the system more noise-tolerant. 

The corresponding dual representation is: 

n 

n 1 

 

W ( α ) = ∑αi−∑yiyjαiαjK( xi. 

x j ) (2) 

2 

subject to constraints 

n 

∑ 

i= 

1 

i= 

1 i, 

j= 

1 

α i yi 

= 0 α i ≥ 0, 

i = 1,....., 

n 

 

where α i ≥ 0 are the Lagrange multipliers and K( 

xi. 

x j ) is 

the kernel function. Note that we don’t need to know the 

underlying mapping function, however it is necessary to 

define the kernel function. Among the different kernel 

functions, the most common kernels are polynomial, 

Gaussian Radial Basis function (RBF) and multi-layer 

perception (MLP). 

Our final decision rule can be expressed as 

o 

Nsv 

= ∑ 

i=1 

* 

 

f ( x, 

α , b ) y α K( 

x . x) 

+ b (3) 

i 

where N sv and * 

αi denote the number of support vectors and 

the non-zero Lagrange multipliers corresponding to the 

support vectors respectively. This result reveals the important 

fact that only support vectors contribute to the final boundary. 

In fact this is a way to beat the curse of dimensionality which 

is a big worry for most of the classifiers. The dimension of 

input space can be as high as it needs to be, without having to 

worry about having too many free parameters which usually 

leads to overfitting. 

In this paper for training SVM we use the LS-SVM (Least 

Squares Support Vector Machine) MATLAB toolbox. LS- 

SVMs are reformulations to the original SVMs which lead to 

solving linear KKT systems [6]. In LS-SVMs the inequality 

constraints in SVM are replaced with equality constraints. As 

a result the solution follows from solving a set of linear 

equations instead of a quadratic programming problem which 

we have in original SVM formulation of Vapkin, and 

obviously we can have a faster algorithm. 

The primal problem of the LS-SVMs is defined as: 

* 

i 

i 

o 

Figure 2. A linear SVM classifier. Support vectors are those 

elements of the training set which are on the boundary hyperplanes of two 

classes. 

subject to 

 

d 

2 

min w, 

b, 

e J p ( w, 

b, 

e) 

= 1 2 w + γ 1 2∑ 

e 

= 

2 

i 

i 1 

T 

[ w ϕ( 

x ) + b] 

= 1− 

e , i = 1,....., 

d 

yi i 

i 

where γ is a parameter analogous to SVM’s regularization 

parameter (C). 

The main characteristic of LS-SVMs is the low 

computational complexity comparing to SVMs without 

quality loss in the solution. 

V. EXPERIMENTAL RESULTS 

Our database consists of 1260 instances of utterances 60% 

of which was used for the training phase exclusively and the 

remaining 40% for evaluating the trained classifiers (the 

division is done in random). All the binary LS-SVM 

classifiers are trained using RBF kernel function with different 

regularization and kernel parameters. The linear classifiers are 

trained using the gradient descent algorithm and perceptron 

Figure 3. The perfomance of one of the binary LS-SVMs by adding 

a new feature at each iteration of Forward Selection algorithm 


347 

22 

margin 

Support 

Vectors 

Hyperplane 

(4)

criterion function. The confusion matrix and the final 

recognition results are presented in Table II and Table III 

respectively. The abbreviations in Table II stand for the six 

different emotions: anger, fear, disgust, happiness, sadness, 

and surprise, and FS in Table III means Feature Selection. 

As it is shown in Table III, the best performance (81.3%) 

belongs to fuzzy-pairwise LS-SVM using the features selected 

by Forward Selection algorithm. Table II shows that the most 

difficult emotion to recognize in our experiment is surprise 

and the easiest ones are sadness and happiness. And fear and 

sadness have the highest probability to be confused with each 

other. 

VI. CONCLUSION 

In this contribution, we introduced a set of new acoustic 

features which are used for the first time in the application of 

AER. For classification we used LS-SVM which is a recent 

and powerful classifier with many advantages to other 

conventional and popular classifiers such as Neural Networks. 

We also implemented different schemes to adapt our binary 

classifiers to a multi-category problem. The result of a Linear 

Classifier is compared with LS-SVM performance. We 

achieved an overall classification accuracy of 81.3% with 

fuzzy-pairwise LS-SVM 

TABLE II. CONFISION MATRIX OF THE LS-SVM CLASSIFIER 

(FUZZY PAIRWISE WITH FEATURE SELECTION) 

Recognized Emotions (%) 

Ang Fea Dis Hap Sad Sur 

Ang 83.3 0 2.7 6.4 2.7 4.6 

Fea 1.8 71.9 7.4 1.8 13 3.7 

Dis 4.6 5.5 79.6 0 3.7 6.4 

Hap 1.8 1.8 0 92.4 1.8 1.8 

Sad 0 6.1 0.9 0 90.5 2.3 

Sur 11.1 9.2 5.5 4.6 13.8 55.5 

TABLE III. FINAL RECOGNITION RESULTS 

Recognition Rate 

One-Vs-All SVM 44.9% 

fuzzy One-Vs-All SVM 53.6% 

Pairwise SVM 74.5% 

fuzzy pairwise SVM 78.4% 

fuzzy pairwise SVM, FS 81.3% 

fuzzy pairwise LDA 37.7% 

348 

REFERENCES 

[1] N. Cristianini and J. SH. Taylor, An Introduction to 

Support Vector Machines and Other Kernel-based Methods. United 

Kingdom: Cambridge University Press, 2000. 

[2] C.J. Burges, “A tutorial on support vector machine for pattern 

recognition,” Knowledge Discovery and Data Mining, vol. 2, pp. 121- 

167, June, 1998. 

[3] I.E. Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and R. M. 

Nohikawa, “A support vector machine approach for detection of 

microcalifications,” IEEE trans. Med. Imag., vol.21, NO. 12, 

December, 2002. 

[4] P.H. Chen, C. J. Lin, and B. Schölkopf, “A tutorial on ν – support 

vector machines,” unpublished. 

[5] B. Schölkopf and A. J. Smola, Learning with kernels – support vector 

machines, regularization, optimization, and beyond. Cambridge, MA: 

MIT press, 2002. 

[6] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. 

Vandewalle, Least square support vector machines. Singapore: World 

scientific publishing Co. Pte. Ltd., 2002. 

[7] S. Hoch, F. Althoff, G. McGlaun, and G. Rigoll, “Bimodal fusion of 

emotional data in an automotive environment,” proceeding of IEEE 

international conference on acoustic, speech, and signal processing. 

Vol. 2, PP. 1085-1088, 18-23 March 2005. 

[8] C.A. Martinez and A.B. Cruz, “Emotion recognition in non-structured 

utterance for human-robot interaction”, IEEE international workshop 

on robot and human interactive communication, PP. 19-23, 13-15 Aug. 

2005. 

[9] T. Nguyen and I. Bass, “Investigation of combining SVM and decision 

tree for emotion classification,” seventh IEEE international 

symposium on multimedia, PP. 540-544, 2005. 

[10] ZJ. Chuang, CH. Wu, “Emotion recognition using acoustic features and 

textual content”, IEEE international conference on multimedia and 

expo, Vol. 1, PP. 53-56, 27-3- June 2004. 

[11] YL. Lin and G. Wei, “Speech emotion recognition based on HMM and 

SVM,” proceeding of International conference on machine learning and 

cybernetics, Vol. 8, PP 4898-4901, 18-21 Aug. 2005. 

[12] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition 

combining acoustic features and linguistic information in a hybrid 

support vector machine-belief network architecture, ” Proceedings of 

IEEE International Conference on Acoustic, Speech, and Signal 

Processing, vol.1, PP. I-577-80, 17-21 May, 2004. 

[13] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based 

speech emotion recognition, ” Proceeding of the IEEE International 

Conference on Acoustic, Speech, and Signal Processing, PP. I-401-04, 

6-10 April, 2003. 

[14] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in 

speech using Nueral Networks,” Proceedings of the 6 th International 

Conference on Neural Information Processing, vol. 2, PP. 495-501, 

1999. 

[15] V. A. Petrushin, “Creating emotion recognition agents for speech 

signal, ” unpublished. 

[16] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05 

audio-visual emotion database,” Proceedings of the 22 nd International 

Conference on Data Emgineering Workshop, 3-7 April 2006. 

[17] D. Tsujinishi, Y. Koshiba, and SH. Abe, “Why pairwise is better that 

One-against-All or All-at-Once,” Proceedings of IEEE International 

Conference on Neural Networks, vol. 1, PP. 693-698, July 2004. 


23

A WATERMARKING METHOD FOR SPEECH SIGNALS BASED ON THE TIME–WARPING 

SIGNAL PROCESSING CONCEPT 

Cornel Ioana (1) , Arnaud Jarrot (2) , André Quinquis (2) , Sridhar Krishnan (3) 

(1) LIS Laboratory 

BP 46, 961 rue de la Houille Blanche 

38402 Saint-Martin d’Hères cedex, FRANCE 

phone: +33(0) 476 826 422 

email: cornel.ioana@lis.inpg.fr 

ABSTRACT 

This paper deals with the watermarking of audio speech signals 

which consists in introducing an imperceptible mark in 

a signal. To this end, we suggest to use an amplitude modulated 

signal that mimics a formantic structure present in the 

signal. This allows to exploit the time–masking effect occurring 

when two signals are close in the time–frequency plane. 

From this embedding scheme, a watermark extraction method 

based on nonstationary linear ltering and matched lter detection 

is proposed in order to recover information carried by 

the watermark. Numerical results conducted on a real speech 

signal show that the watermark is likely not hearable and informations 

carried by the watermark are easily retrievable. 

Index Terms— Watermarking, Time–warping signal processing, 

Time–frequency analysis. 

1. INTRODUCTION 

Today’s digital media have opened the door to an information 

era where the true value of a product is generally dissociated 

from any physical medium. While it enables a high degree 

of exibility in its distribution, the commerce of data without 

any physical media raises serious copyright issues. Data can 

be easily duplicated turning piracy into a simple data copy 

process. 

In order to secure the identity of the owner of a media, a 

solution consists in hiding digital–subcodes inside data since 

no physical media can be used for this purpose. This problematic 

is generally referred as watermarking [1]. The main 

rules in watermarking context are : 

• The watermarking should not be discernible from the 

media in order to keep the integrity of the media. 

• The watermarking should be easily retrievable. Providing 

a priori, the inserted watermark should be recovered 

as well as the digital–subcodes carried by the 

watermark. 

(2) E 3 I 2 Laboratory (EA 3876) – ENSIETA, 

2RueFrançois Verny, 29806, Brest, 

FRANCE 

phone: +33(0) 298 348 720 

emails: [jarrotar, quinquis]@ensieta.fr 

(3) Department of Electrical Engineering – 


350 Victoria Street, Toronto, CANADA 

phone: 416.979.5000 x6086 

email: krishnan@ee.ryerson.ca 

• The watermarking should be robust to attacks (i.e. compression 

or noise insertion) since these phenomenons 

often occur in media transmissions. 

In this paper we propose a watermarking procedure that 

attempts to exploit the time–frequency region available between 

two formants. We suggest to use, for the watermark, an 

amplitude modulated signal whose carrier frequency is modulated 

according to the modulation law of a formant. In this 

way, the time-frequency content of the watermark follows the 

time-frequency content of the formant. This allows to put 

the watermark signal very close to the formant. As will be 

seen, this embedding strategy makes the watermark likely not 

perceptible from an acoustical point of view. The recovery 

of the watermark is ensured by nonstationary linear ltering 

and matched ltering method. Numerical results show that 

the watermark can be easily recovered as well as the coded 

sequence carried by the watermark. 

The paper is organized as follows. Section 2 is devoted 

to a short presentation of the time–warping signal processing 

concept. Based on this concept, a new watermarking procedure 

is proposed in Section 3. Numerical results presented 

in Section 4 illustrate the benets of the proposed technique. 

Concluding remarks are given in Section 5. 

2. TIME–WARPING SIGNAL PROCESSING 

CONCEPT 

2.1. Non-unitary Time–Warping Operators 

Let x(t) ∈ L 2 (R) be a squared integrable signal. The set of 

unitary time–warping operators {W, w(t) ∈C 1 , ˙w(t) ≥ 0: 

x(t) → (Wx)(t)}, isdenedin[2]by 

(Wx)(t) =| ˙w(t)| 1/2 x (w(t)) , (1) 

where ˙w(t) stands for the derivative of the warping function 

w(t) with respect to t. Properties of this transformation 

include linearity and unitary equivalence since the envelope 

| ˙w| 1/2 preserves the energy in the signal at the output of W. 

1424407281/07/$20.00 ©2007 IEEE II 201 

24 

ICASSP 2007 


In what follows, we deal with a modied version of time– 

warping operators that does not fulll the unitary equivalence 

property anymore. 

We dene the class of non–unitary time–warping operators 

by the set { ˘ W,w(t) ∈C1 , ˙w(t) ≥ 0:x(t) → ( ˘ Wx)(t)} 

for which 

 

˘Wx(t) = x(t ′ )δ(w(t) − t ′ ) dt ′ 

(2) 

R 

Because ˙w(t) ≥ 0, w-1 (t) exists, we can dene the inverse 

projector by 

˘W -1 

x(t) = x(t ′ )δ(w -1 (t) − t ′ ) dt ′ 

(3) 

2.2. Time–warping convolution operator 

R 

The stationary convolution operator applied on x(t),h(t) ∈ 

L2 (R) is given by 

 

x(t) ∗ h(t) = x(t ′ ) h(t ′ − t) dt ′ 

(4) 

R 

From this denition, it is natural to ask whenever the convolution 

operator has an equivalent expression in the time warped 

space. We dene the time warping convolution operator by 

x(t) w(.) 

∗ h(t) = ˘ W -1 

Wx(t ˘ ′ 

) ∗ h(t) (5) 

where w(.) 

∗ stands for the time–warping convolution operator 

along the warping function w(t). Using Equ. 2, Equ. 3, 

Equ. 4, some straightforward algebra manipulations lead to 

x(t) w(.) 

∗ h(t) = 

 

2.3. Time–warping lter 

R 

x(t ′ ) d ˘ Wt 

dt h w -1 (t) − w -1 (t ′ ) dt ′ 

From Equ. 2, one can show that any signal x(t) of the form 

x(t) =exp(2iπf0w-1 (t)), f0∈ R is transformed via non– 

unitary time–warping operators into 

˘Wx(t) =exp(2iπf0 w -1 (w(t)) (7) 

=exp(2iπf0t) (8) 

which is a pure harmonic signal with frequency fo. One can 

exploit this stationarisation effect to design efcient time– 

varying lters. Let hH (t) be the impulse response of a lin- 

fc 

ear time–invariant highpass lter, and hL (t) be the impulse 

fc 

response of a linear time–invariant lowpass lter. Both lters 

are designed to have a cutoff frequency equal to fc. Using 

the time–warping convolution operator dened in Equ. 6, we 

dene x H (t) and x L (t) by 

(6) 

x H (t) =x(t) w(.) 

∗ h H (t), fc (9) 

x L (t) =x(t) w(.) 

∗ h L fc (t). (10) 

II 202 

25 

Then, Equ. 9 and Equ. 10 dene a non–stationary ltering 

procedure for which 

e(t) =fc 

˙ 

w -1 (t) (11) 

is the time–varying cutoff frequency of the time–varying lter. 

3. TIME–WARPING–BASED 

AUDIO–WATERMARKING 

3.1. Watermark embedding 

w(t) 

Warping function 

m(t) = 

a(t) e 

x(t) 

j2πfot 

Watermark 

Signal 

 

˘Wm(t) 

• δ(w(t) − t ′ )dt ′ xm(t) 

Fig. 1. Watermark embedding procedure. 

Watermarked 

signal 

The proposed watermarking embedding scheme is depicted 

in the Fig. 1. Roughly speaking, the embedding of the watermark 

is carried out in two steps. First, the watermark is 

matched to the specicity of the audio signal by means of 

adapted warping operator. Then, the watermark is added to 

the original signal. 

Human hears are sensitive to frequency–spread signals, 

which are interpreted as shufe [3]. For this reason we suggest 

to use a watermark m(t) that belongs to the class of frequency 

coherent signals expressed by 

m(t) =a(t) e j2πf0t , f0 ∈ R + 

(12) 

where a(t) is assumed to be a positive slowly time–varying 

signal. This class of signals is concentrated around the carrier 

frequency f0. 

In the proposed method, the rule of insertion of the watermark 

is based on the fact that two close signals with similar 

instantaneous frequency laws are very similar in an auditive 

point of view [3]. Therefore one can exploit this time– 

masking effect by choosing an area, on the time–frequency 

plane, where the watermark is designed to mimics some frequency 

concentrated component which is present in the signal. 

In what follows, we denotes by the term “masking component” 

such component. In the case of speech signals, a natural 

choice for the masking component is to select a formant 

that has a long enough time–duration. 

Let f(t) be the model of a formant described by 


f(t) =af(t) e j2πφf (t) , t ∈ [ti,tf ],ti

In order to exploit the masking effect provided by the masking 

component f(t), the time–warped watermark ˘ Wm(t) should 

be as close as possible of the formant in the time-frequency 

plane. Therefore, we dene the time–warped watermark by 

˘Wm(t) =a(w(t)) e j2π(φf (t)+εt) , t ∈ [ti,tf ], (14) 

where ε ∈ R is the frequency shift of the watermark. The 

choice of ε depends on a trade–off between the separability 

of the watermark and performances of the masking effect. If 

ε is too large, the masking effect decreases. If ε is too small, 

the watermark cannot be retrieved because of the proximity 

of the formant. 

Beyond the stealthiness of the watermark, another topic 

of the watermarking concept is the coding of some specic 

information on signals. To achieve this topic, we suggest to 

use the amplitude of the watermark for information coding. 

Let the atom g(t) ≥ 0, t∈ 

−T T 

2 , 2 be a positive compactly 

supported function for which T is small compared to 

the time–duration of the masking component Tf − Ti. Based 

on this denition, we suggest to construct the amplitude of 

the watermark a(t) as a superposition of time–delayed versions 

of the atom g(t). 

The choice of the g(t) function can be be guided by physiological 

aspect of the human hear. It is generally accepted 

that hears are very sensitive to fast variations of signals since 

they produce a large spread in the frequency domain [3]. For 

this reason, we force the atom g(t) to be as smooth as possible 

which can be translated into a mathematical notation by 

requiring the atom g(t) to be of class C∞ , the class of in- 

nitely derivable functions. In the remaining of this paper we 

dene g(t) as a scaled version of the mother atom gm(t) 

⎧ 

2 

⎨ −(t/a) 

exp 

gm(t) = 

1−(t/a) 

⎩ 

2 

2 , t ∈ [−1, 1], 

(15) 

0, t /∈ [−1, 1], 

where a ∈ R + is the scaling factor. From empirical evidences, 

we saw that for detection reasons, atoms g(t) have 

to be separated each other of at least 5σg, whereσ 2 g is the 

variance of g(t). 

Let τ be the digital information that has to be watermarked 

in the audio signal which is expressed in binary by (τ)2 = 

τ0τ1 ...τN where τi are the bits of τ. Then, the amplitude 

a(t) of the watermark is encoded as follows 

a(t) = 

N 

τn g(t − 5iσg), (16) 

i=0 

which is known as an amplitude modulation coding scheme. 

3.2. Watermark recovery 

Once a signal has been watermarked, next step is to deal with 

the recovery of the watermark sequence. However, because 

of different aspects related to the transmission of the signal 

(compression, quantization, noise, ...) this recovery is generally 

performed on a modied version xm(t) of xm(t). Inthe 

proposed method, the watermark is said to be recovered if the 

digital information τ has been estimated from xm(t) without 

error. The recovery procedure is depicted in the Fig. 2 where 

the symbol (ˆ.) denotes an estimation of the quantity (.). As 

seen, the watermark recovery is carried out in three steps. 

xm(t) • w1(.) 

Watermarked 

signal 

II 203 

26 

˘Wm(t) 

w1(t) = 

[φf (t)+(ε − Δ)t] −1 

 

∗ hH 1 (t) • w2(.) 

∗ hL 1 (t) 

• δ(w -1 (t) − t ′ )dt ′ 

w2(t) = 

[φf (t)+(ε +Δ)t] −1 

➊ ➋ 

 

•g(t − 5iσg)dt 1 

ˆm(t) 

g 

≷ 

0 2 

➌ ➍ 

Fig. 2. Watermark extraction procedure. 

τi 

Estimation 

of bit τi 

First step corresponds to the extraction of the time–warped 

watermark ˘ Wm(t) by means of time–warped lters (blocks 

➊ and ➋). Two time–varying lters are necessary to extract 

the watermark : one highpass (block ➊), and one lowpass 

(block ➋). This ltering stage denes a time–varying pass– 

band lter expressed by 

 

˙φf (t)+ε +Δ, the upper cuttof frequency, 

(17) 

˙φf (t)+ε − Δ, the lower cuttof frequency. 

It is well–known that the frequency spread of a time– 

varying signal around its instantaneous frequency law depends 

on the regularity of its amplitude. Because the amplitude of 

the watermark is of class C∞ , the frequency decay is faster 

than any power of f. Therefore, only a small Δ value is necessary 

to extract the time–warped watermark. 

Second step corresponds to the unwarping of the estimated 

time–warped sequence (block ➌) in order to recover an estimation 

of the original sequence m(t). 

Last step corresponds to the estimation of bits τi with 

matched ltering (block ➍). The estimation is performed as 

follows 

 

τi = ˆm(t) g(t − 5iσg)dt 

R 

1 

≷ 

0 

where g is the norm of g(t). 

4. NUMERICAL RESULT 

g 

, i =1..N, (18) 

2 

The test signal is a male utterance of the word “bingo” sampled 

at 8 kHz. The Log–spectrogram of the test signal is depicted 

in the Fig. 3. The selected masking component is the 


formant referenced by the black arrow. The watermark is em- 

Normalized frequency 

0.5 

0.25 

0 

0 1000 2000 3000 4000 

Normalized time 

Fig. 3. Log–spectrogram of the test signal. Male utterance of 

the word “bingo” with a sampling rate of 8 kHz. 

bedded as described in Sec. 3.1. First, the data (τ)2 = 010011 

is used to generate the amplitude of the watermark by means 

of the Equ. 16. Then the insertion zone is manually choseninordertodene 

the warping operator used to generate 

the time–warped watermark. Finally, the time–warped watermark 

is added to the original signal. Result of the watermark 

embedding is shown in Fig. 4. As can be seen, the time– 


0.5 

0.25 

0 

0 

1000 


(a) Original signal 


0.5 

0.25 

0 

0 

1000 


(b) Watermarked signal 

Fig. 4. Zoomed log–spectrogram of the original and watermarked 

test signal. 

warped watermark is very close to the original formant. As 

expected, the frequency spread decreases very fast, thanks to 

the smoothness of the amplitude of the watermark sequence. 

We found the embedding strategy satisfactory since we 

were not able to guess wether the signal was watermarked or 

not during blind tests. In order to provide a more objective 

comparison criterion, we make use of the “Auditory Toolbox” 

[4] to generate auditory representations of original and 

watermarked signals which is a pseudo–time–frequency representation 

based on physiological aspects of human hears. 

Auditory representations of original and watermarked signals 

are depicted in Fig. 5. Both representations are very similar 

which conrms stealthiness of the watermark. 

Next step consists in the recovery of the watermark sequence. 

We tested the proposed approach on the true watermarked 

signal, and on two different deteriorated versions of 

the watermarked signal : the rst is a MP3 compression attack, 

and the second is an additive Gaussian noise attack with 

a signal–to–noise ratio of 0 dB. 

Results of the matched ltering estimation are presented 

in Tab. 1. Results of the estimation step show that the watermark 

is perfectly extracted and has resisted to the MP3 attack 

as well as the white–noise attack. 

II 204 

27 

Channels 

120 

60 

0 

0 

1000 


(a) Original signal 

Channels 

120 

60 

0 

0 

1000 


(b) Watermarked signal 

Fig. 5. Auditory representation of the original signal and the 

watermarked signal. 

τ τ1 τ2 τ3 τ4 τ5 τ6 

True 0 1 0 0 1 1 

No attack 0 1 0 0 1 1 

MP3 attack 0 1 0 0 1 1 

Noise attack 0 1 0 0 1 1 

Table 1. Results of the estimation of the set {τi} by matched 

ltering. 

5. CONCLUSION 

In this paper we have proposed a new watermarking method 

for speech signals, based on the time–warping signal processing 

concept. We have shown that it is possible to exploit physiological 

aspects of the human hear in order to carry information 

while keeping stealthiness of the inserted watermark. 

Then, we have developed a complete extraction method based 

on time–varying lter, time–warping operators and match ltering, 

to recover the watermark sequence. Numerical results 

show that the watermark is likely not hearable and numerical 

information carried by the watermark are retrievable. Future 

work will include a close study the robustness of the method 

against various attacks. For real applications, another topic 

is the unsupervised embedding of the watermark according to 

the position of formant. This issue is left for future work. 

6. REFERENCES 

[1] M. Arnold, “Audio watermarking: Features, applications 

and algorithms,” in IEEE International Conference on 

Multimedia and Expo, New York, USA, July 2000. 

[2] R. Baraniuk, “Unitary equivalence: A new twist on signal 

processing,” IEEE Trans. on Signal Processing, vol. 43, 

no. 10, pp. 2269–2282, Oct. 1995. 

[3] M.C. Botte, G. Canevet, L. Demany, and C. Sorin, Psychoacoustique 

et perception auditive, Inserm, 1989. 

[4] M. Slanley, “Auditory toolbox, version 2.0,” Avaiable at 

http://www.slaney.org/malcolm/pubs.html, 1994. 


Chirp-based image watermarking as error-control coding 

Behnaz Ghoraani, and Sridhar Krishnan 

Dept. of Elec. & Comp. Eng., Ryerson University, Toronto, Canada 

E-mail: bghoraan@rnet.ryerson.ca, and krishnan@ee.ryerson.ca 

Abstract 

In this paper, we use post processing methods to compensate 

the bit errors occurred in watermark embedding 

and extracting. Forward error correction (FEC)-based and 

chirp-based techniques are applied to encode and shape 

the embedded watermark message so that even at the presence 

of some bit error rates (BERs) in the extracted watermark, 

the watermarking algorithm be able to successfully 

estimate the correct embedded watermark message. 

Repetition and Bose-Chaudhuri-Hocquenghem (BCH) codings 

are used as two well-known FEC schemes, and discrete 

polynomial transform (DPPT) and Hough-Radon transform 

(HRT) are utilized as two chirp detectors in chirp-based watermarking. 

Robustness of all the proposed post processing 

methods are tested for checkmark benchmark attacks, 

and we found that the chirp-based watermarking using the 

DPPT chirp detector offers the highest watermark extraction 

rate, and the best bit error compensation even at BERs 

of higher than 17%. 

1. Introduction 

The worldwide trend of using the Internet to electronically 

distribute multimedia offers lot of advantages such as 

huge cost reduction and considerable time saving of transportation 

process to both owners and consumers. However, 

the available methods for distribution of multimedia lacks 

the privacy and ownership proof. One of the suggested solutions 

to protect the copyrights and prevent illegal use of the 

multimedia is watermarking. Watermarking is embedding a 

hidden signature into the multimedia signal containing information 

about the content authentication, the access control 

and copy protection, and the identification and traitor 

tracing in the case of illegally distribution. The embedded 

watermark signal should be imperceptible, and do not effect 

the quality of the multimedia content. Also, since users normally 

apply many signal manipulations such as lossy compressions 

on a multimedia signal, the watermark should be 

robust to these typical signal operations. However, even us- 

Proceedings of the 2006 International Conference on Intelligent 

Information Hiding and Multimedia Signal Processing (IIH-MSP'06) 

0-7695-2745-0/06 $20.00 © 2006 

ing the robustest embedding techniques, there will be some 

bit errors in the received watermark message. Therefore, the 

watermark detection process encounters some difficulties in 

extracting the exact watermark message, and retrieving the 

hidden information. Shaping the embedded signature in a 

certain way that the extractor could compensate the bit error 

rate (BER) of the message using the prior knowledge 

about the watermark structure can be useful in estimating 

the exact watermark message. 

In this study, we focus on utilizing chirps as watermark 

message structures [1][7], and comparing its results with 

forward error correction (FEC) schemes. To do the experiments, 

we use spread spectrum method to embed the watermark 

messages to discrete cosine transform (DCT) coefficients 

of the image. After extracting the watermark message 

from the watermarked image, there is a post processing 

stage. We utilize BCH and repetition codings for FECbased 

post processing, and discrete polynomial transform 

(DPPT) and Hough-Radon transform (HRT) detectors in 

chirp-based watermarking. In this paper, we present the results 

of chirp-based and FEC-based post processings, and 

show that the chirp-based watermarking is found comparable 

with the FEC schemes. 

2. Watermarking method 

In this study we use spread-spectrum watermarking 

scheme which is a correlation method that embeds pseudorandom 

sequence and detects watermark by calculating correlation 

between pseudo-random noise sequence and watermarked 

signal. Spread-spectrum scheme is the most popular 

scheme and has been studied well in literature [2]. The 

spread-spectrum method can be applied in time domain or 

transformed frequency domain. We utilized DCT coefficients, 

which are widely used in compression applications 

and are easier to impose human visual system (HVS) constraints 

on them. Figure 1 shows the block diagram of the 

watermarking embedding and extraction schemes [7]. 

As mentioned earlier, because of the intentional and nonintentional 

signal processings there will be some BERs in 

the received message. In the next section we try to compen- 


28

Figure 1. Detection and extraction block diagram 

of the watermarking method 

sate these bit errors with concentrating more on the structure 

of the watermark message. 

3. Post processing in watermarking 

In case of severe signal manipulations, the extracted watermark 

has some bit errors. One of the post processing 

methods that can be useful in correcting the bit errors is encoding 

the watermark signature with a FEC scheme before 

embedding it to the multimedia signal. 

3.1 Forward Error Correction schemes 

FEC schemes, or channel codings, are used to protect 

digital communication by inserting redundant bits to the 

data. These additional bits contribute in detecting and correcting 

the errors happened in the data. Due to the similarity 

between watermarking and communication systems, 

FEC methods have been commonly used to increase the bit 

error compensation capacity of watermarking techniques. 

BCH, turbo and repetition codings are the most commonly 

used FEC schemes in watermarking application [4]. In this 

study, we utilized BCH and repetition codings as two wellknown 

FEC schemes. 

3.1.1 Bose-Chaudhuri-Hocquenghem (BCH) coding 

BCH is a kind of block coding scheme. A binary BCH 

code (n, k) segments the data into block of k bits, and transforms 

each k-bit data block into N-bit block. The (n − k) 

bits are called redundant bits, and the code rate is k/n. 

Since in this study our target is comparing different types of 

post processing methods, all the watermark messages used 

in each method have almost the same number of bits and 

redundancy rates. n and k for the BCH code that results in 



0-7695-2745-0/06 $20.00 © 2006 

a watermark message length closer to 180 bits, and gives 

the highest redundancy rate are 63 and 7 respectively. BCH 

(63, 7) encodes a 21-bit watermark signature to a 189-bit 

long embedded message with 10.7/12 redundancy rate. 

3.1.2 Repetition coding 

Repetition coding is a very simple and well-known coding 

technique. Repetition coding with repetition number of 

n repeats each bit n times, so results in a redundancy rate of 

n/(n+1). We choose n =11to encode a 15-bit watermark 

to an embedded watermark of 180-bit long and redundancy 

rate of 11/12. 

Having a prior knowledge about the structure of the watermark 

message could also be useful in increasing the compensation 

of the BERs in the extracted watermark. The 

structure that has been proposed is embedding chirp signals 

as the watermark message in chirp-based watermarking. 

3.2. Chirp-based watermarking 

In chirp-based watermarking, the idea is embedding 

chirps in the multimedia signal as watermark signatures. 

Before embedding the watermark signal into the image, the 

watermark message is encoded to the embedded chirp according 

to a predefined codebook. Chirps are time varying 

frequency signals and they can be best detected in the TF 

plane; also, different chirp rates represent different watermark 

messages. Because the extracted watermark message 

should be in the form of a chirp, by applying a post processing 

step such as a chirp detector on the extracted watermark, 

the embedded chirp could be estimated successfully even in 

the presence of some BERs. HRT and DPPT are the two 

chirp detection tools that have been used in our experiments. 

3.2.1 Hough-Radon transform (HRT) 

The HRT is a parametric tool to detect the pixels that belong 

to a parametric constraint of either a line or curve in a 

gray level image [8]. HRT divides the Hough-Radon parameter 

space to cells, then calculates the accumulator value 

for each cell in the parameter space. The cell with the highest 

accumulator value represents the parameter of the HRT 

constraint. Since in the application of post-processing of 

chirp-based watermarking, we are looking for the embedded 

chirp as straight lines in the TF plane, we can apply 

the HRT method to detect the embedded chirp. First, the 

extracted watermark bits are transformed to the TF plane; 

then the HRT detects the line representing the chirp in TFD. 

In order to achieve a good detection performance, Wigner- 

Ville Transform (WV) is used as the TFD representation of 

the signal. In this study, HRT space has 182 X 182 cells that 

supports a 15-bit long watermark message. HRT-based post 

processing has a redundancy rate of 11/12. 


29

3.2.2 Discrete Polynomial Phase Transform (DPPT) 

DPPT is a spatial parametric signal analysis for estimating 

the phase parameters of constant amplitude polynomial 

phase signals even under the presence of some BERs in the 

signal [5]. The embedded watermark in the chirp-based watermarking 

is in the form of: 

chirp(n) =exp j(a0 + a1n + a2n 2 ) 

(1) 

The DPPT gives an estimation of a0,a1 and a2 which enables 

us to synthesize the original chirp. The DPPT algorithm 

defines ambiguity functions, which applying the 2order 

function to a constant-amplitude chirp transforms the 

broadband signal into a single tone signal with frequency 

related to a2. The position of this spectral peak provides 

an estimate of the coefficient â2. Multiplying the signal 

with exp −jâ2n 2 reduces the order of the polynomial 

to 1, and repeating the procedure gives an estimation of all 

parameters. The judgment about the successful watermark 

extraction is done based on the following methods: 

Threshold-based (DPPT[T]) [3] - We make decision 

on the correct detection of the watermark based on 

the correlation between the estimated chirp and the 

embedded watermark. The threshold for 182-bit long 

chirp as embedded chirp, and 15 bit watermark signature 

is experimentally set up to 0.9. 

Correlation-based (DPPT[C]) - Searches for the chirp 

in the codebook which has the highest correlation coefficient 

with the estimated chirp. In this case to have 

a better watermark extraction rate the correlation between 

the chirps in the codebook is limited to a maximum 

of 0.93 that offers a redundancy rate of 11.08/12 

for a chirp length of 182 bits. 

Initial and final frequency-based (DPPT[F]) - Finds the 

chirp which has the closest initial and final frequencies 

with the estimated initial and final frequencies of 

the recovered chirp. Due to BER in the received watermark 

signal, the DPPT estimates the initial and final 

frequencies with some variations from the original 

ones. To increase the watermark extraction rate the 

minimum difference between the initial and final frequencies 

of chirps in the codebook are defined 4 Hz 

for 182-bit long chirp. This settings give the 11.09/12 

redundancy rate. 

4. Results 

To measure the robustness of the post processing algorithms, 

we perform the checkmark benchmark attacks [6] 

for 10 different images of size 512 X 512. The PN sequence 

is 100,000 samples and the watermark sampling frequency 

is 1 kHz. To have a fair comparison of all the post 



0-7695-2745-0/06 $20.00 © 2006 

processing techniques, the methods have been set up with 

almost equal message lengths and redundancy rates. Figure 

2 shows important features of the applied schemes. 

Figure 2. Features of each coding schemes 

used to code the watermark message 

Figure 3 shows the results of all the post processing techniques. 

The first column shows different types of attacks ap- 

Figure 3. Watermark Detection Results for 

Checkmark Benchmark Attacks 

plied on the watermarked image, and the number shows the 

number of attacks. The number under each column represents 

the percentage of successful watermark detection under 

each class of attack. Although DPPT[T] method shows 

higher result comparing to DPPT[C] and DPPT[F], it results 

in 13% False positive error rate, which is too high to 

be applicable for multi user watermarking system. Therefore, 

DPPT[T] is useful in watermarking applications that 

we are interested in detecting the watermark rather than extracting 

the watermark message. DPPT[F] based method 

offers higher or in some cases equal results comparing 

to DPPT[C]-based method. In addition, DPPT[F]-based 

method does not require the long process of correlating the 

estimated chirp with all chirps in the CodeBook and is faster 

than DPPT[C]. Thus, we conclude that the initial and the final 

frequency comparison is the best DPPT-based method 

to find the embedded message in the CodeBook. 

Although HRT is an optimum line detection tool, hav- 


30

ing a large number of cells in Hough-Radon space shifts 

the peak of the accumulator to the neighbor cells and consequently 

results in a wrong detection of the slope in TFD. 

Therefore, we see in Figure 3 that DPPT-based method outperforms 

HRT-based algorithm in most of the attack types 

with a total detection of 92% compared to 87%. Also, comparing 

the complexity order of HRT-based and DPPT-based 

techniques, we conclude that the DPPT-based method is 

more practical for real-time applications. Figure 4, shows 

the complexity order and running time based on Pentium 

IV, CPU 2.66GHz and 512MB of RAM. DPPT-based watermark 

extraction is about 55 times faster than HRT-based 

algorithm. 

As we observe in Figure 3, the detection result for the 

BCH coding and repetition codings have almost the same 

detection rates, but DPPT-based method offers better or in 

some cases equal results when compared to REP and BCH 

codings. Figure 5 shows the detection results considering 

BER in the received message. As we see in this figure, both 

DPPT and BCH detect 100% watermark messages successfully 

up to a BER of 17%. However, the maximum BER 

that BCH detects a watermark correctly is 22% with 17% 

detection, while DPPT shows 50% detection rate at a BER 

of 28%. To highlight the outstanding performance of DPPT 

at high BERs, we calculate the watermark detection rate 

for BERs bigger than 17%. We see that the DPPT-based 

method offers 52% detection rate in higher BERs, while 

BCH and Repetition codings have 47% and 41% detection 

rates. 

Figure 4. Order of complexity of each coding 

schemes used to code the watermark message 

5. Conclusions 

In this paper, we compared FEC-based and chirp-based 

post processing methods in watermarking. The robustness 

of the proposed techniques was tested against checkmark 

benchmark attacks. The DPPT-based and BCH-based methods 

were able to compensate the BER of up to 17%. The 

DPPT-based post processing offered the highest detection 

rate of 92%, and showed the highest detection rate for BERs 

of higher than 17%. Also, we compared the computation 

complexity of the proposed methods; BCH, repetition 

and DPPT-based methods have almost the same complex- 



0-7695-2745-0/06 $20.00 © 2006 

Successful watermark estimation(%) 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

HRT 

REP 

BCH 

DPT 

0 

0 5 10 15 20 25 30 35 40 

BER of the received watermark message 

Figure 5. Watermark detection under different 

bit error rates. 

ity, while HRT has a complexity of 55 times higher than the 

other methods, and this is because HRT operates on the TF 

plane, and calculates the accumulator value for all the cells 

in Hough-Radon plane. 

References 

[1] S. Erkucuk, S. Krishnan, and M. Zeytinoglu. Robust audio 

watermarking using a chirp based technique. IEEE Intl. Conf. 

on Multimedia and Expo, 2:513–516, July 2002. 

[2] D. Kirovski and H. Malvar. Spread-spectrum watermarking 

of audio signals. IEEE Transactions on Signal Processing, 

special issue on data hiding, 51:1020–1034, April 2003. 

[3] L. Le and S. Krishnan. Time-frequency signal synthesis 

and its application in multimedia watermark detection. 

EURASIP Journal on Applied Signal Processing, 2006:Article 

ID 86712, 14 pages, 2006. 

[4] J. Lee, H. Kim, and J. Lee. Information extraction method 

without original image using turbo code. Proc. International 

Conference on Image Processing, Greece, 3:880–883, October 

2001. 

[5] S. Peleg and B. Friedlander. The discrete polynomialphase 

transform. IEEE Transactions on Signal Processing, 

43:1901–1914, August 1995. 

[6] S. Pereira, S. Voloshynovskiy, M. Madueno, S. Marchand- 

Maillet, and T. Pun. Second generation benchmarking and 

application oriented evaluation. In Information Hiding Workshop 

III, Pittsburgh, PA, USA, April 2001. 

[7] A. Ramalingam and S. Krishnan. Robust image watermarking 

using a chirp detection-based technique. IEE Proceedings on 

Vision, Image and Signal Processing, 152:771–778, December 

2005. 

[8] R. Rangayyan and S. Krishnan. Feature identification in the 

time-frequency plane by using the Hough-Radon transform. 

IEEE Trans. on Pattern Recognition, 34:1147–1158, 2001. 


31

2006 International Joint Conference on Neural Networks 

Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada 

July 16-21, 2006 

Automatic Content-Based Image Retrieval Using Hierarchical 

Clustering Algorithms 

Kambiz Jarrah, Student Member, IEEE, Sri Krishnan, Senior Member, IEEE, and Ling Gum, Senior 

Member, IEEE 

Amt-The overall objective of this paper is to present a 

methodology for guiding adaptations of an RBF-based 

relevance feedback network, embedded in automatic content- 

based image retrieval (CBIR) systems, through the principle of 

unsupervised hierarchical clustering. The self-organizing tree 

map (SOTM) is essentially attractive for our approach since it 

not only extracts global intuition from an input pattern space 

but also injects some degree of localization into the 

discriminative process such that maximal discrimination 

becomes a priority at any given resolution. The main focus of 

this paper is twwfold: introducing a new member of SOTM 

family, the Directed SOTM (DSOTM) that not only provides a 

partial supervision on cluster generation by forcing divisions 

away from the query class, but also presents a flexible verdict 

on resemblance of the input pattern as its tree structure grows; 

and modifying the current structure of the normalized graph 

cuts (Ncut) process by enabling the algorithm to determine 

appropriate number of clusters within an unknown dataset 

prior to its recursive clustering scheme through the principle of 

self-or ganizing normalized graph cuts (SONcut). 

Comprehensive comparisons with the Self-Organizing Feature 

Map (SOFM), SOTM, and Ncut algorithms demonstrate 

feasibility of the proposed methods. 

I INTRODUCTION 

Content-based image retrieval (CBIR) relies on 

characterization of images based on their visual contents. 

These visual contents consist of low-level features 

including colour, texture, shape, etc.- that offer a multidimensional 

vector representation of an image within the 

feature space. 

One of the major requirements for designing an effective 

CBIR system is to reduce the gap between low-level features 

and high-level concepts by tailoring the human perceptual 

subjectivity to the retrieval process. Human-computer 

interaction (HCI) systems have demonstrated a successful 

training the system with suitable samples (images). This 

dependency of the system on users' inputs may add 

excessive human errors to the adaptation process due to 

subjective interpretations of image contents by each 

individual. 

To overcome this problem, an unsupervised learning 

approach with a hierarchical architecture is required to guide 

these adaptations automatically and more toward relevant 

samples. SOTM has show an effective behavior in 

minimizing human interactions and automating the search 

process by efficiently classifying an unknow and non- 

uniform data space into more meaningful clusters. 

The main focus of this paper is as follows: a) introducing 

a new member of SOTM family, the Directed SOTM 

(DSOTM), that dynamically controls generation of new 

centres and decides on resemblance of input samples, with 

respect to the query, during the learning phase of the 

algorithm; and b) modifying the current structure of the 

normalized graph cuts (Ncut) algorithm [2] to make it more 

adaptive to the nature of the input pattern by adding a self- 

determination mechanism to its algorithm to decide on 

appropriate number of clusters prior to its iterative clustering 

process. We call the modifiedNcut, the self-organizing Ncut 

(SONcut). 

This paper provides some details on both DSOTM and 

SONcut algorithms in Section 2; Section 3 presents an 

overall description on the structure of CBIR system used in 

this work; a comprehensive comparison between the 

proposed classifiers and their conventional counterparts is 

also presented in Section 4; Section 5 concludes the paper 

with some remarks. 

11. UNSUPERVIXD CLUSTERING APPROACHES 

behavior for this purpose [I]. In such systems, users directly In this section, an overview of the proposed unsupervised 

supervise the learning process by constantly providing and hierarchical clustering algorithms using DSOTM and 

SONcut is presented. 

This work was supported by Natural Sciences andEngineering Research A. Drected Sekf-Organ~zmg Tree Map (DSOTM) 

Council of Canada (NSERC) and Ryerson Graduate Scholarship Program 

K Jarrah, barrak@e.ryerson.ca, is affiliated with the Multimedia 

Research Laboratory (RML) and Signal Analysis Research Group (SAR), 

L31 is an unsupervised machine learning 'gorithm 

and is inspired by principles found in Kohonen's self- 

Ryerson University ( mw ryerson ca), Toronto, Canada. 

S Knshnan, hisknan@e.ryersora.ca, is the char of Department of 

Electrical and Computer Engineering, Ryerson University, Toronto, 

Canada, andis affiliatedwith the Signal halysis Research Group (SAR) at 

organizing feature map (SOFM) [4]. 

The tree structure of the SOTM is constructed by 

randomly selecting an isolated root node (centre) and 

the same university 

L Gum, dguan@e.ryerson.ca, is the Canada Research Char, Ryerson 

University, Toronto, Canada, and is affiliated with the Multimedia Research 

repeatedly presenting the remaining of the patterns to the 

network. The pattern (sample) which is found to be closest 

to the centre with respect to current similarity measurement 

Laboratory (RML) at the same university 

0-7803-9490-9/06/$20.00/©2006 IEEE 

3532 

32 


Fig 1 Two-dimensional mapping (Left) clustering using SOFM and @ ight) clustering using SOTM. It is evident that redundant nodes in the 

lattice topology of SOFIA can produce unnecessary boundmes by having some of the centres trapped in low-density regions of the input pattern 

is declared to be the winner. Every such presentation of 

input patterns slightly modifies the winning node's position 

in the network: aposition that eventually evolves toward the 

centre of mass of the current class. This gradual adaptation 

of the node's position is controlled by an exponential 

decaying function called the kearnzng rate. The learning rate 

resets to its initial value each time a new centre is generated. 

Therefore, sufficient time is given to the network to adapt 

itself to the presence of new samples, thus, the tree grows 

larger and the similarity measurement tends to be more 

accurate. The generation of new centres (branches of the 

tree) is controlled by a hierarchy function, called the 

th~shokd~nction, which decreases over time. If an input 

node is encountered whose similarity exceeds this threshold 

function (i.e. is significantly different from all nodes in the 

current SOTM map) a new node is generated. The new node 

is attached as a leaf node of its closest representation in the 

current SOTM mapping, thus over time, a tree structure 

evolves [q. 

Similar to SOFM, SOTM aims at preserving the 

topological relationships between patterns in the original 

input space. However, unlike SOFM, SOTM classifies a 

large group of patterns by building and evolving a tree 

structure that tends to form neighborhood relationships by 

reflecting a degree of similarity between the new and 

already classified patterns. 

Although, image indexing with the SOFM was perceived 

to be a robust and effective solution that tolerates even very 

high input vector dimensionalities [5], the lattice topology of 

SOFM network makes it essentially undesirable for 

clustering purposes due to concentration of a fraction of 

nodes in the map resulting of the best-matching unit 

computation [6]. SOFM has nodes that can easily get 

trapped in regions of low density and, therefore, can simply 

lose its ability to represent the underlying topology of the 

input pattern. For instance, let us assume there are two high 

density regions in the input space, representing two distinct 

clusters. Let us also assume that there are maximum two 

nodes in the structure of SOFM to correctly allocate both 

regions. If those two nodes were separated by a third node 

and each converged to the two adjacent regions of high 

density, then the third node could easily get trapped in 

between the regions. As a result, it can change the true 

boundaries of high density classes by pulling some of the 

3533 

33 

samples from the two real clusters and allocating them to the 

middle node as is illustrated in Fig. 1. In this figure, the 

second node of the SOFM network has dragged some of the 

data points from the first cluster and has generated a new but 

redundant class. The tree structure of SOTM, however, is 

succesdul in determining the high-density regions. 

Problem with the SOTM algorithm is two-fold: it 

unsuitably decides on the relevant number of classes; and 

often loses track of the true query position. Decision on 

which clusters are relevant in the SOTM is postponed until 

after the algorithm has converged. This is because there is 

no innate controlling process available for the algorithm to 

influence cluster generation around the query centre (the 

SOTM clusters entirely independently). Losing a sense of 

query location within the input space can have an undesired 

effect on the true structure of the relevant class and can 

force the SOTM algorithm to spawn new clusters and form 

unnecessary boundaries within the query class as is 

illustrated in Fig. 2. In this figure, the SOTM forms a 

boundary near the query contaminating relevant samples, 

where as some supervision is maintained in the DSOTM 

case, preventing unnecessary boundaries fiom forming. 

Therefore, retaining some degree of supervision on the 

cluster generation around the query class seems to be vital. 

Due to the limitations of SOTM, we propose the 

Directed SOTM (DSOTM) algorithm in this work. In the 

DSOTM algorithm, decision on association of input pattern 

to query image is gradually made as each sample is 

presented to the system. It also contains a controlling 

mechanism that keeps track of the query centre by forcing 

the centre of relevant class to remain in the vicinity of the 

query position. Therefore, it can dynamically control 

generation of new centres and can determine the relevance 

of input samples, with respect to the query, as the tree 

structure grows. On the other hand, it limits the synaptic 

vector adjustments according to its reinforced learning rules 

and constrains cluster generation by preventing the 

spawning of redundant centres around the query position; 

since this part of the map is already occupied by relevant 

class centre. 

The DSOTM algorithm is summarized as follows: 

InihkEzrslion: Choose a mot node, {u3t fTom the 

available set of input vectors, { xk)f , in a random manner. 

J is the total number of centroids (initially set to 1) and K is 

the total number of inputs (i.e. images); 


Fig 2 Two-dimensional mapping (Left) Input pattern with 5 distinct clusters, (Middle) 14 centres are generatedusing SOTM, and (hght) 5 centres are 

generated using DSOTM Over-classification around the query (bangle) mll result in erroneous relevance identification 

Si&@ Meus~~remnt Randomly select anew data point, centre to the query, then mark x(t) as a relevant sample and 

x, find the best-matching (winning) centroid, j*, by update its centroid (winning neuron) toward the query 

minimizing a predefined Euclidean distance criterion in (1). position according to the degree of resemblance of the 

sample using: 

U m : 

If Ix(t)-wj*(t) i H(t), where H(t) is the 

hierarchy function used to control the levels of the tree and 

decays exponentially over time from its initial 

value,H(t,)>cx, according toH(t+l)=/l-H(t)-exp(-tlp), 

where p = max(t) llog,,[H(t)] and h is the threshold 

constant, 0

Fig 3: Two-dmensional mapping: (Left) Input pattern with 7 dstinct clusters, (Middle) 8 centres are generated using Ncut, and (fight) 7 centres are 

generated using SONcut. Over-classification around the quev (triangle) will result in enoneous classification of the relevant class. 

nodes in the input pattern,) and assoc(A, A) and where H(t) is defined similar to the hierarchy function used 

assoc(B, B) measure the total intra-cluster similarity in the DSOTM algorithm, then increment maximum allowed 

Ncut by 1; 

(association) in A and B, assoc(A,V) is the Continuation: Continue with the Similarity Measurement 

total connection from nodes in cluster A to all nodes in the ,tep until no noticeable changes in the feature map are 

graph; assoc(B,V)is defined similarly. w, is a nonnegative observed; 

weight function measuring degree of similarity between two Graph Generation: Given the input pattern, set up a 

samples of input patterns and is defined as: weighted graph, G = (V, E), compute the weight, wm, on 

('1 

each edge, Em, using (8) and create affinity, W, and 

diagonal, D, matrices; 

where d(p,q) is a pre-defined distance metric (i.e. 

Euclidean distance) and k is a user defined constant to 

control decreasing rate of the weight function and is 

empirically sets to 0.2. By using this function, the smallest 

eigenvector remains constant and Ncut can find relatively 

right partitions [2]. 

As Shi and Malik have also discussed, the optimal 

partitioning (minimum possible Ncut) can be computed by 

solving the generalized eigenvalue system. The second 

smallest eigenvector of the generalized eigensystem is then 

used to partition the graph. 

In this paper we have used the Ncut algorithm [2] for the 

purpose of unsupervised data clustering. The Ncut 

partitioning method can be recursively applied on the input 

pattern to generate more than two clusters. Decision on 

maximum number of centres in the input pattern to stop the 

clustering process is a challenging problem. In this work, we 

Eigensystem Transformation: Solve (D- W)x = A- Dx for 

eigenvector with the smallest eigenvalue; 

Graph Bipartition: Use the eigenvector with the second 

smallest eigenvalue to bipartition the graph; 

Partitioning Continuation: Consider current partitions for 

supplementary subdivision. Continue with repartitioning 

until the Ncut value reaches to its maximum allowed. 

In summary, we have proposed an unsupervised 

hierarchical Ncut algorithm that is able to estimate the 

maximum number of allowed Ncuts by training the 

algorithm using the principles found in the DSOTM 

architecture. Thus, by dynamically adapting the Ncut 

algorithm to the nature of the input pattern, problem of overpartitioning 

the relevant class can be prevented. Fig. 3 

depicts importance of adapting such predictive mechanism 

for the Ncut clustering algorithm and illustrates 

effectiveness of employing such mechanism to avoid overclassification 

around the query centre. 

have integrated - the Ncut algorithm - with the principles found 

in DSOTM to automatically estimate appropriate number of 

clusters in the input pattern and set the maximum allowed 

Ncut accordingly. We call this Ncut algorithm with self- 

oriented centre detection, the Self-Organizing Normalized 

cuts, SONcut. 

The proposed SONcut algorithm is as follows: 

Initialization: Choose a root node, {I-,)~=~, 

from the 

available set of input vectors, { x,)L, in a random manner. 

N is the maximum allowed Ncut (initially set to 1) and K the 

total number of inputs; 

Similarity Measurement: Randomly select a new data point, 

x, and find the winning centroid, n*, by minimizing a 

predefined distance criterion in (1); 

Maximum Allowed Ncut Estimation: If Ilx(t) - I-,, 11 > H (t) 

3535 

35 

Previously in [7], we proposed an automatic CBIR engine 

that was structured around an unsupervised learning 

algorithm, the DSOTM. To reduce the gap between high- 

level concepts (semantics) and low-level statistical features 

and to evolve the search process according to what the 

system believes to be the significant content within the 

query, the above engine was integrated with a process of 

feature weight detection using genetic algorithms (GA) as 

illustrated in Fig. 4b. In this paper we use a relatively 

simpler CBIR architecture (see Fig. 4a and Fig. 5) to solely 

compare abilities of the proposed hierarchical clustering 

algorithms for the purpose of data classification with other 

three techniques, SOTM, SOFM, and Ncut. 


p3 

Feature 

Extraction 

u 

Database 

+ Mrbal 

Search 

Unit 

+ 

mlh 

Feature 

b qwp 

Extraction 

u 

Database 

Rehkval 

Search 

Unit 

/ 

/ 

queryimeiric 

adaptation 

DeteFted We@k 

query/meiri 

daptatnn 

Relevance 

Genetic 

Feedback 

Algorithm 

Fig 4a Machine controlled CBIR system [I] Fig 4b -Machine controlled GA-based CBIR system [7] 

1 

- - - - - - - - - Inifiaal - - - - Search - - - - - - - - - - - - Interface - - - - - - - - - - - - A&mafic - - - - - - - Search - - - - - - - - - - - - - - - 

Feature Extraction 1 

.....* Fmture Extraction 2 

4 

Unsupervised Ckilier 

I j 

I j 

I 4 

I j 

I 

Query Imge 

I Query Modfiation 

' 

L ------------------------ I I - -------------------------------- 2 

Fig 5 Another representation of the CBIR system of Fig 4a 

reaulta 

b 

The whole idea of the automatic image retrieval process is from the previous iterations, is then selected to substantially 

to unsupervisedly tailor the retrieval process to users' notion represent the relevant class through the Query Modz$cahon 

of similarity by utilizing an unsupervised learning technique module. 

to perform required decision makings about relevance of In this work, Colour Histograms, Colour Moments, 

retrieved images instead of the human user. 

Wavelet Moments, and Fourier Descriptors were used in the 

This automatic refinement mechanism is possible by Feature Extrachon I, whereas Hu's seven moment 

adapting a flexible architecture for the classifier to learn invariants (HSMI) and Gabor Descriptors accompanied with 

from and re-adjust to the nature of input pattern in a greater Colour Histograms and Colour Moments were used in 

extent using predefined competitive learning algorithms: a Feature Extract~on 2. 

method that is capable of giving a conforming ability to the 

network to perform avariety of computational tasks [q. 

Colour histograms and colour moments were computed in 

the HSV and RGB colour spaces respectively. Wavelet 

In Fig. 5, the second unit of Inzhak Sea~ch module, the Moments were extracted f?om mean, p, and standard 

Feature Extractzon I, deals with calculating features from a deviation, a, of three-level wavelet transform applied on an 

high volume image database. Consequently, standard set of image. Boundary-based Fourier shape parameters were 

content descriptors (an example might be MPEG-7), are extracted by converting the edge parameters from Cartesian 

extracted in this module to provide a more generic and rapid to Polar coordinates and, subsequently, applying the Fast 

interface to existing databases based on the chosen standard. Fourier Transform (FFT) to obtain top 10 low-frequency 

Extracted features are used to retrieve the most similar components. Texture features were computed from p and a 

images (in the relative sense) based on predefined distance 

metric. The top Q images are then displayed back to the user 

through an Interface block. Subsequently, user may request 

an automatic search which operates independently. Upon 

this request, control of the system is switched to the 

automatzc search module, wherein, another set of features 

with higher perceptual quality are extracted ffom the top Q 

of Gabor filtered images to construct 48 dimensional feature 

vectors, and finally, region-based HSMI shape parameters 

were extracted by converting the colour images into binary 

segmented ones and then extracting the shape parameters 

from those segmented images. 

retrieved images from the initial search, using Feature 

Extraction 2. Although computation of those features could A number of experiments were conducted to compare 

be intensive, this module allows for the use of more behaviors of the automatic CBIR engine in the Fig. 5 with 

proprietary or specialty features. Such a module may the presence of various unsupervised clustering algorithms 

enhance perceptual discrimination beyond that which might such as Ncut, SONcut, SOFM, SOTM, and DSOTM 

othenvise be possible through standard features alone. These classifiers. The simulations were carried out using a subset 

features are then used as seeds to train the Unsupemzsed 

Ckasszjep.. A new query, based upon the selected images 

of the Core1 image database consisting ofnearly 5 100 JPEG 

3536 

36 


TABLE I 

EXPERIMENTAL RESULTS IN TERMS OF RR FOR THE CBIR SYSTEM WITH 

NO FEATURE WEIGHT DETECTION MECHANISM FIG 4) 

Classifiera Set A Set B Set C Average 

Ncut 41 3% 47 3% 51 9% 46.8% 

SOFM 51 1% 44 3% 56 6% 50.7% 

SOTM 51 5% 51 1% 58 4% 53.7% 

SONcut 51.9% 52.8% 57.0% 53.9% 

TABLE 11 

EXPERIMENTAL RESULTS IN TERMS OF RR FOR THE CBIR SYSTEM WITH 

GA-BASED FEATURE WEIGHT DETECTION ALGORITHM FIG 5) [7] 

~lasslfier~ Set A Set B Set C Average 

Ncut 67 3% 64 5% 65 1% 61 6% 

SOFM 65 1% 66 7% 68 3% 66.7% 

SOTM 66 8% 72 1% 74 4% 71.1% 

SONcut 68 8% 72 8% 73 9% 71.8% 

DSOTM 78 3% 76 7% 80 5% 78.5% 

a Ncut Normalized Graph Cuts, SONcut Self-Organizing Normalized 

Graph Cuts, SOFM Self-Organizing Feature Maps, SOTM Self- 

Organizing Tree Maps, DSOTM Directed Self-Organizing Tree Maps 

colour images, covering a wide range of real-life photos, 

from 51 different categories. Each category consisted of 100 

visually associated objects to simplify measurements of the 

retrieval accuracy (RR) during the experiments. 3 sets of 5 1 

images were drawn from the database to form sets A, B, and 

C. Each set consists of randomly selected images such that 

no two images were from the same class. Retrieval results 

were statistically calculated ti-om each of the 3 sets. In the 

simulations, a total of 16 most relevant images were 

retrieved to evaluate precision of the retrieval. 

The experiment results are illustrated in Table 1. In the 

Ncut, SOFM, and SOTM algorithms, the maximum number 

of allowed cluster generation was set to P, P < Q, where Q 

is the total number of retrieved images from the initial 

search; P was empirically set to 8. A 4x2 grid topology was 

used in the SOFM structure to locate maximum 8 possible 

cluster centres. A hard decision on the resemblance of the 

input samples was made: if the sample is closer to one 

centre than any other centres, in terms of a predefined 

distance metric, it belongs to that centre. Table 2 illustrates 

results aRer feature weight detection using GA-based 

method described in [I. 

Although the Ncut algorithm is a top-down classification 

process and aims to extract global impressions of the input 

pattern and present a hierarchical description of it, 

employing a predictive mechanism to estimate true number 

of clusters prior to spawning new neurons is proven to be 

beneficial. This predictive mechanism will enforce afrontier 

on the classification process and inhibits unnecessary 

centres to be generated around the query position. As a 

result, a more accurate impression of relevance can be 

achieved by using SONcut algorithm. 

The SOTM algorithm not only extracts global intuitions 

of the input pattern, it also introduces some degree of 

localization into the discriminative process to achieve 

maximal discrimination at any given resolution (or number 

of classes.) Moreover, the ability of SOTM to span and force 

3537 

37 

division in the extremes of the data in the early, delaying 

division of most similar aspects until later stages of learning, 

and a flexible tree-like topologies (more plastic than SOFM) 

makes it essentially sensitive to the most dominant 

differences in the data and, thus, less prone to classification 

errors and more attractive to the retrieval applications. 

Despite all the above advantages of using SOW-based 

classifiers, retaining some degree of supervision to prevent 

unnecessary boundaries from forming around the query 

class seems to be crucial. The DSOTM algorithm not only 

provides a partial supervision on cluster generation by 

forcing divisions away ti-om the query class, it also makes a 

soft decision on resemblance of the input patterns by 

constantly modifying each sample's membership during 

learning phase of the algorithm. As a result, a more robust 

tree structure as well as a better sense of likeness can be 

finally achieved. 

V. CONCLUSION 

The framework for a novel unsupervised clustering 

algorithm based on DSOTM was introduced in this work. A 

modification on the current structure of Ncut algorithm was 

also proposed in this paper. This modification provides a- 

priori knowledge for the algorithm to determine appropriate 

number of clusters, based on principles found in DSOTM, 

prior to its hierarchical clustering operation. Performance of 

the proposed methods was compared with other 

conventional clustering methods (i.e Ncut, SOFM, and 

SOTM) by using a computer controlled CBIR system. 

SOTM outperforms both Ncut and SOFM and performs 

fairly close to SONcut even with its blind top-down data 

exploration. This is due to its flexible tree-shape structure as 

well as its competitive learning algorithm that injects some 

degree of localization into the discriminative process. The 

experimental results also illustrate promising performance of 

utilizing DSOTM in the structure of automatic CBIR 

engines. 

REFERENCES 

[I] P. Muneesawang and L Gum, "Minimizing user interaction by 

automatic and semiautomatic relevance feedback for Image retrieval," 

Proc. IEEE I ~ Conj L on Image Processing, Rochester, USA, vol 2, 

pp 601-604, Sept 2002 

[2] J Shi and J Malik, "Normalized cuts and image segmentation," IEEE 

Trans. Pattern hadysas and Machine In&lligence, Vo1 22, Issue 8, 

pp 888 -905,Aug 2000 

[3] H S Kong, "The Self-Organizing Tree Map, and its Applications in 

Digital Image Processing," PhD Thesis University of Sydney, 

Australia, 1998 

[4] T Kohonen, "The self-organizing map," Proc. of rdae IEEE, Vol 78, 

Issue 9, pp. 1464 - 1480, Sept 1990 

[5] S. Haykin, Neural hbtworh: A Covprehensiw Foundufion, Prentice 

Hall, Inc , 1999, second edition. 

[6] J Randall, L Gum, X Li W Zhang, "Investigations of the selforganizing 

tree map," Proc. of6f.h International Conference on Neural 

Infirmataon Processing, Vol 2, pp 724 - 728, Nov 1999 

[7] K Jarrah, M Kyan, S Krishnan, and L Guan, "Computational 

intelligence techniques and their Applications in Content-Based Image 

Retrieval," IEEE Inf. Conj on Multamen'aa & Expo (ICME) , submitted 

for publication, 2006 


1424403677/06/$20.00 ©2006 IEEE 33 

38 

ICME 2006 


34 

39 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

{ } 

 

{ } 

 

 

 

= 

 

 

 

 

= 

 

 

 


= 

 

− 

= K 

 

 

 

≤ 

− 

 

> σ 

+ 

= λ ⋅ 

⋅− 

ρ 

 

< λ < ρ = 

 

 

 

 

 

+ 

= 

 

+ α ⋅ β 

 

⋅ 

− 

 

 

 

α = α 

⋅ − 

 

 

≤ α ≤ α 

α = 

β 

 

 

 

 

 

 

 

⎛ ( ) ⎞ 

∑ ∑⎜ 

 

− 

β 

= 

 

− 

= 

⎟ 

⎜ 

− 

= 

= ⎟ 

 

 

⎝ σ ⎠ 

 

= 

 

− 

 

 

β 

 

β 

 

 

 

 

 

α 

 

 

 

 

 

 

− − ≤ 

 

 

 

 

 

 

+ 

= 

 

+ α ⋅ β 

 

⋅ 

− 

 

 

 

35 

40 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

= 

 

≤ ≤ 

 

 


= × 

× 

× 

 

 

 

 

 

 

 

 

 

− 

= [ 

 

] 

 

 

 

= 

K 

K 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

( −) 

 

 

36 

41 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


DISCRETE POLYNOMIAL TRANSFORM FOR DIGITAL IMAGE WATERMARKING 

APPLICATION 

Lam Le, Sridhar Krishnan, and Behnaz Ghoraani 

Dept. of Elec. & Comp. Eng., Ryerson University, Toronto, Canada 

E-mail: (lle)(krishnan)@ee.ryerson.ca and bghoraan@rnet.ryerson.ca 

ABSTRACT 

In this study, we propose a new way to detect the image watermark 

messages modulated as linear chirp signals. The 

spread spectrum image watermarking algorithm embeds linear 

chirps as watermark messages. The phase of the chirp 

represents watermark message such that each phase corresponds 

to a different message. We extract the watermark message 

using a phase detection algorithm based on Discrete 

Polynomial Phase Transform (DPT). The DPT models the signal 

as polynomial and uses ambiguity function to estimate the 

signal parameters. The proposed method not only detects the 

presence of watermark, but also extracts the embeded watermark 

bits and ensures the message is received correctly. The 

robustness of the proposed detection scheme has been evaluated 

using checkmark benchmark attacks, and we found a 

guaranteed maximum bit error rate of 15%, which watermark 

message is correctly detected using DPT. 

Keywords: Image Watermarking, Spread Spectrum, Data Hiding, 

Discrete Phase Polynomial Transform, Hough-Radon Transform, 

Chirp Modulation 


Chirp signals are present ubiquitously in many areas of science 

and engineering, so the Discrete Polynomial Phase Transform 

(DPT) [1][2] has been extensively studied in recent years 

to estimate the phase parameters of the chirp signals. One of 

the recent applications of chirp signals is in data watermarking. 

The huge success of the Internet allows for the transmission, 

wide distribution, and access of electronic data in an 

effortless manner. Watermarking is one of the possible solutions 

to the problem that content providers are faced with 

the challenge of how to protect their electronic data. Thereby, 

multimedia data creators and distributors are able to prove 

ownership of intellectual property rights without forbidding 

other individuals to copy the multimedia content itself. In 

this study, we propose a chirp-based detection method to detect 

watermark messages in an image watermarking scheme 

[3][4] which embeds linear chirps as imperceptible and statistically 

undetectable watermark messages. Different chirp 

rates, i.e., phases, represent watermark messages such that 

each phase corresponds to a different message. The narrowband 

watermark messages are spread with a watermark key 

(PN sequence) across a wider range of frequencies before embedding. 

The resulting wideband noise is added to the perceptually 

significant regions of the original image. We use 

the block-based discrete cosine transform (DCT) scheme for 

inserting the watermark. As a result of image manipulations 

some message bits are extracted by the detector may be in 

error potentially resulting in the detection of the wrong watermark 

message. The proposed watermarking detection algorithm 

detects the presence of watermark, and extracts the 

embeded watermark message bits even at presence of bit error 

in the received watermark message. Our motivation to 

use DPT technique as watermark detector is to achieve high 

estimation accuracy with less computational complexity. 

2. DISCRETE POLYNOMIAL PHASE TRANSFORM 

(DPT) 

DPT is a parametric signal analysis approach for estimating 

the phase parameters of polynomial phase signals. The phase 

of many man-made signals such as those used in radar, sonar 

and communications can be modeled as a polynomial. The 

discrete version of a polynomial phase signal can be expressed 

as: 

 

M 

x(n) =b0exp j am(n∆) m 

 

(1) 

m=0 

where M is the polynomial order (M =2for chirp signal), 

0 ≤ n ≤ N − 1, N is the signal length and ∆ is the sampling 

interval. The principle of DPT is as follow. When DPT is 

applied to a mono-component signal with polynomial phase 

of order M, it produces a spectral line [1] . The position of 

this spectral line at frequency ω0 provides an estimate of the 

coefficient âM .AfterâM is estimated, the order of the polynomial 

is reduced from M to M −1 by multiplying the signal 

with exp −jâM (n∆) M . This reduction of order is called 

phase unwrapping. The next coefficient âM−1 is estimated 

the same way by taking DPT of the polynomial phase signal 

of order M − 1 above. The procedure is repeated until all the 

coefficients of the polynomial phase are estimated. DPT order 

1424403677/06/$20.00 ©2006 IEEE 1569 

42 

ICME 2006 


M of a continuous phase signal x(n) is defined as the Fourier 

transform of the higher order DPM [x(n),τ] operator: 

DPTM [x(n),ω,τ] ≡F{DPM [x(n),τ]} 

= 

N−1 

DPM [x(n),τ] exp −jωn∆ , 

(M−1) τ 

where τ is a positive number and: 

(2) 

DP1 [x(n),τ]:=x(n) (3) 

DP2 [x(n),τ]:=x(n)x ∗ (n − τ). (4) 

DPM [x(n),τ]:=DP2 [DPM−1 [x(n),τ] ,τ] (5) 

The coefficients aM (a1 and a2) are estimated by applying the 

following formula [1]: 

âM = 

where 

and 

1 

M!(τM ∆) M−1 argmaxω {|DPTM [x(n),ω,τ]|} , 

(6) 

â0 = phase 

DPT1 [x(n),ω,τ]=F{x(n)} , (7) 

DPT2 [x(n),ω,τ]=F{x(n)x ∗ (n − τ)} , (8) 

N−1 

 

n=0 

ˆb0 = 1 

N−1 

N 

n=0 

 

x(n)exp −j 

 

 

x(n)exp −j 

M 

am(n∆) m 

 

m=1 

M 

am(n∆) m 

 

m=1 

(9) 

(10) 

The estimated coefficients are used to synthesize the polynomial 

phase signal: 

ˆx(n) = ˆ 

M 

b0exp j âm(n∆) m 

 

(11) 

m=0 

3. CHIRP-BASED WATERMARKING 

The watermarking method used in this study is a novel watermarking 

method using a linear chirp based technique applied 

on image and audio signal[3][4]. The chirp signal x(t) (or 

m) is quantized and having value -1 and 1 as in m q . m q is 

then embedded into the multimedia files. The quantization 

process introduces harmonics in the time-frequency representation, 

but the slope of the quantized chirp is the same as that 

of the chirp signal x(t). The detail of the embedding and extracting 

of watermark is followed. 

1570 

43 

3.1. Watermark embedding 

Each bit m q 

k of mq is spread with a cyclic shifted version pk 

of a binary PN sequence called watermark key. The results 

are summed together and generate the wide band noise vector 

w: 

w = 

N −1 

k=0 

m q 

k pk, (12) 

where N is the number of watermark message bits in m q . 

Thewidebandnoisew is then carefully shaped and added to 

the audio or DCT block of the image so that it will cause imperceptible 

change in signal quality. In the audio watermarking 

application, to make the watermark message imperceptible, 

the amplitude level of the wideband noise w is scaled 

down to be about 0.3 of the dynamic range of the signal. In 

the image watermarking application, the length of w to be embedded 

depends on the perceptual entropy of the image. To 

embed the watermark into the image, the model based on the 

just noticeable difference (JND) paradigm was utilized. The 

JND model based on DCT was used to find the perceptual entropy 

of the image and to determine the perceptually significant 

regions to embed the watermark. In this method, the image 

is decomposed into 8×8 blocks. Taking the DCT on the 

block b results in the matrix Xu,v,b of the DCT coefficients. 

The watermark encoder for the DCT scheme is described as 

X ∗ u,v,b = 

Xu,v,b + t C u,v,b wu,v,b, if Xu,v,b >t C u,v,b ; 

Xu,v,b, otherwise 

(13) 

where Xu,v,b refers to the DCT coefficients, X∗ u,v,b refers to 

the watermarked DCT coefficients, wu,v,b is obtained from 

the wideband noise vector w, and the threshold tC u,v,b is the 

computed JND determined for various viewing conditions such 

as minimum viewing distance, luminance sensitivity and contrast 

masking. Fig.1 shows the block diagram of the described 

watermark embeding scheme. 

Original 

Image 

x i,j 

PN Sequence 

Linear Chirp 

Message 

Block based 

DCT 

p 

m q 

X u,v 

Calculate 

JNDs 

Circular 

Shifter 

p k 

Modulator 

w 

Watermark 

Insertion 

J u,v 

Fig. 1. Watermark embedding scheme. 


X* u,v 

Watermarked 

Image

3.2. Watermark detection 

Fig.2 shows the block diagram of the described watermark 

decoding scheme. The detection scheme for the DCT based 

watermarking can be expressed as 

ˆwu,v,b = Xu,v,b − ˆ X∗ u,v,b 

tC u,v,b 

 

ˆwu,v,b, if Xu,v,b >t 

ˆw = 

C u,v,b ; 

0 otherwise 

(14) 

(15) 

where ˆ X∗ u,v,b are the coefficients of the received watermarked 

image, and ˆw is the received wideband noise vector. Due to 

intentional and non-intentional attacks such as lossy compression, 

shifting, down-sampling the received chirp message ˆm q 

will be different from the original message mq by a bit error 

rate BER. We use the watermark key, pk to despread ˆw, 

and integrate the resulting sequence to generate a test statistic 

〈 ˆw, pk〉. The sign of the expected value of the statistic depends 

only on the embedded watermark bit m q 

k . Hence the 

watermark bits can be estimated using the decision rule: 

ˆm q 

k = 

 

+1, if 〈 ˆw, pk〉 > 0; 

(16) 

−1, if 〈 ˆw, pk〉 < 0. 

We repeat the bit estimation process until we have an estimate 

Fig. 2. The proposed watermark detection scheme. 

of all the transmitted bits. Though it is possible to form an 

estimate of the chirp sequence from the received bits, we improve 

the robustness of the detection algorithm by detecting 

the chirp using Discrete Polynomial Phase Transform (DPT) 

a phase detection algorithm. 

3.3. DPT-based watermark estimation method 

The embeded watermarks in this algorithm are linear chirps, 

and the received watermark can be represented as 

x(n) =exp(a1(n∆)j + a2(n∆) 2 j) (17) 

Since DPT method is able to estimate the polynomial coefficients 

of chirp signals with a very short computation time, 

1571 

44 

1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

−1 

0 20 40 60 80 100 120 140 160 

1 

0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

−0.6 

−0.8 

(a) 

−1 

0 20 40 60 80 100 120 140 160 

(b) 

Fig. 3. Original and Estimated Watermark at (a)BER of 

13.6% and correlation coefficient of 0.9891 (b)BER of 19.3% 

and correlation coefficient of 0.9516 

we apply DPT to estimate a1 and a2 coefficients. Fig.3 shows 

the original and estimated watermark messages at bit error 

rates of 13.6% and 19.3%; the correlation coefficients of the 

original and estimated watermarks are 0.9891 and 0.9516 respectively. 

Our computer simulation shows that the required 

calculation time is 630 times faster than the other similar chirp 

watermark detection scheme[3][4][5]. 

4. RESULTS AND DISCUSSION 

We evaluated the proposed scheme using 10 different images 

of size 512×512. The sampling frequency fsb of the watermarks 

equals to 1 kHz. Hence the initial and final frequencies, 

f0b and f1b of the linear chirps representing all watermark 

messages are constrained to [0-500] Hz. We embed these 

chirps into the images for a chip length of 10000 samples, 

In our tests, we used a single watermark sequence having 182 

message bits. To measure the robustness of the watermarking 

algorithm, we performed the attacks specified in the checkmark 

benchmark attacks[6]. Table 1 shows the watermark 

detection results for ten watermarked images after performing 

the attacks specified in the checkmark benchmark attacks. 

The numbers in the brackets under category ‘Attack’ represent 

the number of attacks in that particular class of attacks. 

The ‘Detection Average’ represents the percentage of attacks 

for which the watermark is detected under each class of attacks. 

We make decesion on the correct detection of watermark 

based on the correlation between the estimated chirp 

and the embedded watermark. Experimentally, the threshold 

for correlation coefficient is set to 0.9. The maximun BER 

for MAP attack with 100% detection rate is 15%, and for the 

case of JPEG attack, in which the maximum BER is 19.9%, 


DPT was able to detect 100% of watermark messages. Table 

2 shows DPT-based technique performance for one of the images 

under the specified attacks. The results demonstrate the 

fact that the proposed scheme based on DPT-based technique 

promises to estimate the watermark messages up to a BER of 

15%; also in many cases it shows the ability of watermark detection 

up to a BER of 19%. 

Attacks 

Detection Average(100%) 

DPT HRT 

Remodulation(4) 65 57.5 

Copy(1) 100 97 

MAP(6) 100 100 

Wavelet(10) 84 84 

JPEG(12) 100 97.5 

ML(7) 57 56 

Filtering(3) 100 100 

Resampling(2) 85 90 

Colour Reduce(2) 35 45 

Table 1. Watermark detection results of 10 images for checkmark 

benchmark attacks. 

Attacks 

image1 

BER(%) DPT 

dpr(3) 2.84 0.9988 

dpr(5) 11.93 0.9954 

dprcorr(3) 6.25 0.9871 

dprcorr(5) 14.77 0.9916 

medfilt(2) 1.7 0.9950 

medfilt(3) 3.4 0.9884 

medfilt(4) 23.3 0.1355 

trimmedmean(3) 13.63 0.9891 

trimmedmean(5) 31.82 0.1274 

midpoint(3) 3.98 0.9965 

midpoint(5) 23.4 -0.0008 

dither 6.25 0.9936 

thresh 19.31 0.9516 

Table 2. Bit error rates and the correlation coefficients of the 

proposed method respectively. 

The performance of the algorithm is compared with another 

similar approach, Hough-Radon Transform (HRT)[3] [4][5]. 

Table 1 compares the detection result for DPT and HRT-based 

methods with the same watermarking capacity. The DPTbased 

algorithm results in higher or equal detection rate in the 

seven types of attacks, and has less computational complexity 

than HRT-based method. Typically, running time of DPTbased 

is 6000 time less than that of the HRT-based method in 

Matlab. The watermarking capacity of DPT-based technique 

1572 

45 

depends on the values of coefficients a1 and a2. As expected, 

using high resolution of coefficients a1 and a2 will increase 

the capacity of the watermarking. However, this would also 

reduces the robustness of the method. Compared to the previous 

method based on HRT, the proposed method has high 

capacity of 182×182; it is also more robust than the HRTbased 

method as indicated in Table 1. 

5. CONCLUSION 

In this paper, we proposed a watermark detection method applied 

in an image watermarking algorithm that embeds linear 

chirp as watermark messages. The watermark message 

is added to the perceptually significant regions of the image 

to ensure robustness of the watermark to common image 

processing attacks. A phase detection algorithm based on 

DPT detects the phase of the watermark message. The proposed 

technique has the ability to detect the chirp message 

embedded in signals and subjected to different BERs due to 

attacks on the image watermark and provides a fast deduction 

with high accuracy. Our studies confirm the robustness of the 

algorithm to checkmark benckmark attacks. 

6. REFERENCES 


[1] S. Peleg, B. Friedlander, “The discrete polynomialphase 

transform,” IEEE Transactions on Signal 

Processing, Vol. 43 Issue 8 , pp. 1901–1914. 1995. 

[2] S. Peleg, and B. Friedlander, “Multicomponent signal 

analysis using the polynomial-phase transform,” IEEE 

Transactions on Aerospace and Electronic Systems, 

Vol. 32 Issue 1 , pp. 378–387. 1996. 

[3] S. Erkucuk,S. Krishnan and M. Zeytinoglu, “Robust 

Audio Watermarking Using a Chirp Based Technique,” 

IEEE Intl. Conf. on Multimedia and Expo, vol. 2, pp. 

513–616, 2002. 

[4] A. Ramalingam and S. Krishnan, “A Novel Robust Image 

Watermarking Using a Chirp Based Technique,” 

IEEE Canadian Conf. on Electrical and Computer Engineering, 

vol. 4, pp. 1889–1892, 2004. 

[5] R.M. Rangayyan and S. Krishnan, “Feature identification 

in the time-frequency plane by using the Hough- 

Radon transform,” Trans. Pattern Recognition, vol. 34, 

pp. 1147–1158, 2001. 

[6] S. Pereira, S. Voloshynovskiy, M. Madueno S. 

Marchand-Maillet, and T. Pun, ‘Second Generation 

Benchmarking and Application Oriented Evaluation’, 

Information Hiding Workshop III, Pittsburgh, PA, USA, 

April 2001.

IMPROVING POSITION ESTIMATES FROM A STATIONARY 

GNSS RECEIVER USING WAVELETS AND CLUSTERING 

Mohammad Aram, Baice Li, Sridhar Krishnan, Alagan Anpalagan 

Ryerson University Multipath Mitigation (RUMM) Lab 

Department ofElectrical and Computer Engineering, Ryerson University, Toronto, ON, M5B 2K3 

maram bli krishnan alagan@ee.ryerson.ca 

Bern Grush 

Applied Location Corporation, 34 Dodge Rd, Toronto, ON, M1N 2A 7 bgrush@appliedlocation.com 

Abstract 

Many positioning applications utilize global navigation 

satellite systems (GNSS) derived position estimates for 

stationary positions. Inexpensive navigation-grade receivers 

provide estimates within a few meters in relatively open skies, 

while more specialized devices, typically distinguished by 

specialized antenna design and additional post processing can 

achieve sub-meter accuracy. These latter devices can be two 

orders of magnitude more expensive than navigation-grade 

receivers but are still subject to measurement error due to 

severe multipath in built-up areas. 

In our experiments we post-process positions computed by 

an inexpensive receiver by applying waveletfilteringfollowing 

by clustering and characterization. This produces a reliable 

and significant reduction in variance of the estimate, a 

normalization of the data scatter-distribution and a 

characterization of the estimate that is amenable to a wider 

range of statistical comparisons and tests than would be 

possible for unfiltered, highly non-Gaussian distributions, 

especially as occur in urban canyon circumstances. 

Keywords. GNSS; Urban canyon; Multipath mitigation; 

Wavelets; k-means; RAIM 


Ongoing developments in GNSS space segment (Galileo and 

GPS modernization) are poised to provide significantly more 

and better ranging signals for positioning applications. Recent 

innovation in high-sensitivity receiver technology (HSGNSS) 

enables the acquisition of attenuated and obstructed signals. 

These additional signals dramatically lower the probability of a 

gap [5,6,7] (loss of lock on enough signals to compute a 

position) in challenging signal environments such as in "urban 

canyon", heavy foliage, indoors, etc. While inertial navigation 

may fill in those gaps in dynamic applications (navigation, 

logistics tracking), it cannot help stationary or near-stationary 

applications such as survey, E9 11, asset or personnel location, 

or metered parking. 

HSGNSS signal measurements are biased and especially 

noisy due to excessive multipath and low-power signals [2]. 

Taken together, GPS modernization, Galileo and HSGNSS, 

means the potential opportunity of many more applications, but 

generally in harsh signal environments. Specific noise sources 

are entirely dependent on conditions local to the antenna of the 

1-4244-0038-4 2006 7758 

IEEE CCECE/CCGEI, Ottawa, May 2006 

46 

receiver in question and are not addressable by augmentation 

such as differential GPS (DGPS) or wide area augmentation 

systems (WAAS), or broad-area correction, such as atmospheric, 

etc. Even traditional receiver autonomous integrity 

monitoring (RAIM) has diminished utility since it was 

developed for signal environments with an assumption of zero 

or one fault in a field of 5 to 11 signals. We can now project 

near-future, integrated GPS/Galileo applications with 4 to 22 

signals in harsh environments where many or all signals are 

disturbed. 

To tackle these harsher signal environments, new antenna 

designs [2] and new fault detection and elimination techniques 

(FDE) that extend RAIM approaches [6,7,8] are being 

developed. Specialized antennas add system costs and the FDE 

techniques are computationally complex so that they may be 

impractical for larger signal sets. 

This paper describes an alternate approach: a process that 

includes wavelet filtering, weighted clustering and 

characterization of position estimates from a stationary 

receiver. This approach results in reduced variance of the 

estimate and a normalization of the data-scatter which, in turn, 

provides an inexpensive method for applications that require 

accuracy of 1-2m for short-dwell readings (under ten minutes) 

in many multipath circumstances. As space segment 

improvements (Galileo, GPS modernization) and receiver 

design improvements (high sensitivity) continue to come onstream, 

multipath mitigation such as we propose here tends to 

reduce the relative difference in accuracy between open skies 

and urban canyon. 

The next section of this paper describes our experimental 

methods, including data collection and processing algorithms. 

The third section describes and demonstrates our results for 

each of wavelet filtering, widowing and clustering. 

2.1. Data Collection 

2. Experimental Methods 

To support a variety of experiments, we gathered streetlevel, 

urban canyon, carrier phase and position data at multiple 

locations in downtown Toronto (Canada). Four sites were 

selected to represent distinct levels of urban canyon effects 

ranging from moderate to extreme multipath interference. At 


each location we collected ten 15-minute samples over five 

sidereal days, for a total of forty 900-second data sets. 

Figure 1 shows the data collection setup we used. 

For this particular experiment, we simply used the 3D 

position estimates generated by the receiver without 

consideration for outlier removal. Fig 2 shows a typical sample 

showing high positioning variability. Fig 3 demonstrates that 

even the geometric mean of a 15-minute sample can be highly 

variable in severe multipath. We wish to mitigate both forms of 

variability. 

Figure 1: Data collection equipment consisted of: u-blox TM-LP 15 

(not HS) evaluation kit with u-center ANTARIS software and a 

laptop. An active antenna was mounted on a portable antenna mount 

Im above the ground. We did not use an external ground plane. 

40 

30 

20 

. 10 w _ 

10 

0) 

C~~~~~~~~~~~~~ 

"-10 

0 

C 

-20 

-30 

-40 

-40 -30 -20 -10 0 1 0 

easting deviation (m) 

. I 

0 

-i,= 0 0 

) a 

*_ E 

.-D 

tO)t 

En 

40Ui 

U I I 

Af 

- ALn. . 

-40 

E asti n g dev i atio n of ge ometric 

means of 15-min samples (m) 

Figure 3: We sampled the same locations in 15-minute samples over 

5 sidereal days. The geometric mean of each of these samples can 

drift considerably. At this location, a spread of about 40 meters in 

both Easting and Northing is apparent over the 10 samples taken. 

2.2. Processing Algorithms 

Our process comprises of two fundamental steps: filtering 

using wavelet analysis, and an inverse-variance weighted 

estimate of the mean position using either a moving window or 

a k-means algorithm to cluster the data. 

0 moving window Gaussified 

from Wavelet variance weighting data scatter 

from filtring or with 

receiver K-means lower 

variance weighting variance 

Since multipath error is a time varying process, wavelet 

analysis can be used effectively to mitigate its effects. We 

tested various wavelets including Daubechies, Coiflets, 

Symlets, Morlet, and Meyer. Although the results from 

Symlets and Daubechies were very similar, the analysis was 

carried out using the 'Daubechies order 7 (db7)' filter and 

wavelet coefficients were modified based on thresholding [4]. 

Outlier removal was applied to the wavelet output by 

fi: excluding all points exceeding 3G from the mean of the filtered 

data (where a is the standard deviation). 

Rather than simply computing the geometric mean of the 

wavelet filtered data as a new position estimate, a subsequent, 

independent process was applied to the position data output 

from the wavelet filter. Noting that the variance of positioning 

data, especially in urban canyon, is non-stationary (varies with 

time), we reasoned that weighting each datum inversely with 

20 30 40 its local variance would tend to suppress the contribution to the 

mean estimated position from high-velocity data segments. 

Figure 2: 15 minutes of data collected with the ecluipment 

in Fig 1. Such data segments 

This is typical of about half of our street-level read 

Toronto. 

can be caused by a satellite rising or 

lings in downtown falling at the horizon or changing from line-of sight to non-line 

of sight multipath (or vice-versa) and other biasing effects. 


759 

47 

0 

4 

40

_ . r W-. n r 

We tried two ways of estimating local variance: temporal 

and spatial. Temporal variance weighting is easily achieved by 

computing the variance of short temporal data segments 

(windows) and then by inversely weighting the local means of 

those temporal windows relative to their local variance. 

Spatial variance weighting can be achieved via spatial data 

clustering. Over a 15 minute sample in a harsh signal 

environment one can observe the spatial non-stationarity of the 

position estimate as two or more clusters of points in the 

scatter (fig 4). If we use a statistical clustering algorithm, such 

as k-means, we would tend to group spatially similar estimates 

regardless of whether they are temporally adjacent. The mean 

of each such cluster can then be weighted by the inverse of its 

variance. k-means is more computationally intensive than a 

moving temporal window, but it can be expected to perform 

somewhat better. This is because a cluster is unlikely to span a 

positioning discontinuity, while a temporal window is more 

likely to do so. 

3. Experimental Results 

3.1. Wavelet filtering 

The effect of our wavelet filtering was to always reduce 

variance (fig 4) and to often Gaussify a sample (normalize its 

data scatter) by reducing both skew and kurtosis (table 1). 

+ o 

+ 

,-0 

f 

.s 10 

10 

. 

V 

0

The output of this wavelet filtering is consistent: lower 

variance (fig 5) and Gaussifed data (fig 6), centered very 

nearly at the same geometric mean. However we know that the 

mean itself "wanders" over time (fig 3), so that the first 

moment still retains a bias effect that we now wish to reduce. 

3.2. Windowing 

Recognizing that the variance process for these data sets is 

non-stationary, we wish to weigh more heavily data segments 

that are momentarily stationary (low instantaneous velocity) 

and weigh less heavily data segments that exhibit high 

instantaneous velocity. While such a process can not 

necessarily discriminate between multipath and non-multipath 

contaminated data, it does take advantage of the fact that 

ranging signals undergoing rapid change in multipath 

circumstances exhibit more high-velocity bursts, hence we can 

reduce the impact of these momentary data subsets for a 

stationary receiver. 

For our temporal windowing process, we experimented with 

several window lengths and window overlaps. We report here 

using windows of 20 seconds that overlap by 10 seconds. We 

then inversely weighted each local mean by the local variance 

and computed a new weighted mean for the full sample as: 

N X 

2 

i=l 07 

This process has the effect of causing a set of means from a 

single location to show reduced scatter. In other words, this 

process tends to remove noise from the mean of a 15-minute 

data set collected in urban canyon (fig 7). 

40 1 

10 

-1l 

0 

-40 L 

-40 -20 0 

Easting (m) 

20 40 

Figure 7: Shows the relative shift in final position estimates for all 40 

15-minute datasets in our experiment. There is one black and one red 

ellipse representing the 3cy bounds for each of four locations and 10 

means each calculated from 900 per-second samples for each 

location. The black points and ellipses are for the wavelet output and 

the red are the same for the output after the windowing process. 

761 

49 

3.3. k-means clustering 

When examining raw GPS data plots, especially the noisier 

ones, one often sees areas of two or more clusters of data 

connected by sparse, high-velocity segments. We reasoned that 

if we could isolate those clusters, calculate local means and 

once again weight them by their inverse variance we would see 

an even greater improvement in the ability to reduce the spread 

in the geometric means, sample-over-sample. 

To do this, we applied a k-means clustering algorithm 

(k=15), calculated the mean and variance for each of these 15 

clusters and computed a weighted mean for the entire dataset, 

as before. 

The overall result of this latter approach (fig 8) provides a 

further improvement over the windowing approach (fig 7) 

reducing the variance in the "wandering means" (fig 3). By 

examining the concentric black-red 3G ellipses one can see a 

reduction of 20-35%. 

4. Conclusions 

Wavelet filtering can be used to reduce variance, skew and 

kurtosis in GPS position data collected by a stationary receiver. 

Temporal windowing and spatial clustering of that output 

can be used to further reduce data biases in urban canyon that 

tend to make even aggregated mean estimates "wander" about 

their true position. 

These experiments, while successful, leave considerable 

room for refinement. Future work includes: setting dynamic 

thresholds for wavelet filtering, non-linear treatment of the 

inverse-weighting for the moving windows, determining k 

dynamically for the k-means algorithm, or using fuzzy cmeans. 

Indeed, the fixed, 15-minute sampling period of this 

experiment can also be dynamic allowing greater accuracy 

when time/cost permits and more rapid results in locations of 

modest multipath. 

-40 -20 0 20 40 

Easting (m) 

Figure 8: Shows the same information as in fig 7, except that the red 

data and ellipses represent the output after the k-means process. It is 

evident in individual results as it is in these summary plots that kmeans 

out-performs the moving window process in our experiment. 


Acknowledgements 

This work of Ryerson University Multipath Mitigation 

(RUMM) labs was supported by Ryerson University (Toronto, 

Canada) and a grant from the Natural Sciences and 

Engineering Research Council of Canada (NSERC). 

References 

[1] M. Aram, S. Krishnan, A. Anpalagan, and B. Grush, "Wavelet 

Analysis and Data Processing of GPS Signals for High Precision 

Position Computation", unpublished. 

[2] T. H. D. Dao. H. Kuusniemi, and G. Lachapelle, "HSGPS 

Reliability Enhancements Using a Twin-Antenna System", 

Proceedings of The European Navigation Conference GNSS 

2004, Rotterdam, 17-19 May 2004. 

762 

50 

[3] I. Daubechies, "Ten Lectures on Wavelets", CBMSNSF Regional 

Conference Series in Applied Mathematics, vol. 91, SIAM, 

Philadelphia, 1992. 

[4] D. L. Donoho, "De-noising by Soft-Thresholding" IEEE Trans. 

on Information Theory, Volume 41, Issue 3, May 1995 p.613. 

[5] Y. Feng, "Predictions Using GPS with a Virtual Galileo Constellation 

- Future GNSS Performance", GPS World, March 2005 

[6] H. Kuusniemi, "User-Level Reliability and Quality Monitoring 

in Personal Satellite Navigation", PhD Thesis, Tampere 

University of Technology, Finland, 2005. 

[7] A. Morrison, S. Krishnan, A. Anpalagan, and B. Grush, 

"Receiver Autonomous Mitigation of GPS Non Line-of-Sight 

Multipath Errors ", Institute ofNavigation, National Technical 

Meeting (ION-NTM) 2006 

[8] R. Puri, A. El Kaffas, A. Anpalagan, S. Krishnan, and B. Grush, 

"Multipath Mitigation of GNSS Carrier Phase Signals for an On- 

Board Unit for Mobility Pricing," CCECE, 2004. 


KEYSTROKE IDENTIFICATION BASED ON GAUSSIAN MIXTURE MODELS 

Danoush Hosseinzadeh, Sridhar Krishnan, April Khademi 


Ryerson University, Toronto, ON - M5B 2K3 Canada 

Email: (danoushh@hotmail.com) (krishnan@ee.ryerson.ca) (akhademi@ieee.org) 

ABSTRACT 

Many computer systems rely on the username and password 

model to authenticate users. This method is widely used, yet it 

can be highly insecure if a user’s login information has been 

compromised. To increase security, some authors have proposed 

keystroke patterns as a biometric tool for user authentication; 

they can be used to recognize users based on how 

they type. This paper introduces a novel method that applies 

GMMs to keystroke identification. The major benefit of this 

method is the ability to update the user’s model each time he 

or she is authenticated. Therefore, as time goes on, each user 

model accurately reflects the changes in that user’s keystroke 

pattern. Using this method, a FAR and a FRR rate of approximately 

2% was achieved. However, it should be noted 

that 50% of the test subjects were the traditional ”two finger” 

typists and therefore, this had a disproportionately negative 

impact on the results. 


Undeniably, computers have become an essential part of daily 

life for many people around the world. One of the main reasons 

for this trend is that computers allow us to access information 

from any part of the globe. Additionally, they allow 

us to perform many functions that would otherwise require a 

physical presence else where, such as banking, shopping and 

some personal tasks such as online chatting and so on. 

Despite their importance, computer systems are generally 

protected with primitive techniques, such as usernames and 

passwords. Since passwords can be stolen, accidentally revealed 

or even cracked by dictionary programs, there has been 

a great number of electronic crimes in recent years. In fact, 

reports indicate that in 2002, online retailers lost an estimated 

US$1.64 billion dollars in fraudulent sales and an additional 

US$1.82 billion in legitimate sales that looked suspicious [1]. 

To prevent crime and increase security, access should only 

be given to the correct users. To achieve this goal, some authors 

have suggested the use of keystroke identification as a 

method of preventing unauthorized users from accessing a 

computer system [2][3][4][5]. Keystroke identification is a 

biometric tool based on the principle that every person has 

a unique typing pattern, similar to a hand written signature 

[2][5]. Particularly, for regularly typed strings, this pattern 

can be very consistent and therefore, it can be effective for 

user identification. Furthermore, we argue that a person’s 

keystroke pattern would be harder to duplicate than a signature 

because an intruder does not have an unlimited number of 

tries to practice it. In a commercial system, a user who cannot 

successfully log in after a predetermined number of attempts, 

could be locked out from the system, therefore, limiting the 

intruder’s practice time. Studies have also shown that even 

among professional typists there is a great deal of variability 

in the keystroke patterns [6]. This makes user forgery very 

difficult. 

By exploiting these keystroke patterns, we can add an additional 

layer of security to the username/password model. 

Even if authorized persons reveal their passwords, no unauthorized 

user can gain access to the system. This idea has 

many internet-based applications, especially for online banking, 

email and user account protection, just to name a few. In 

fact, we can completely change the username/password security 

model to a model which only relies on keystroke patterns. 

Aside from increased security, this model would benefit users 

because they will not have to remember many different usernames/passwords 

pairs for different accounts. Also, the possibility 

of a user forgetting their password or a user having a 

password that is easy to decipher would be reduced. 

In this paper, a brief review will be presented on what features 

could be extracted from keystroke patterns and under 

what conditions good features can be acquired. Also, a new 

method for modeling these features based on Gaussian Mixture 

Models (GMMs) is proposed. For completeness, a brief 

review of GMMs is presented before describing the novel 

algorithm used. Lastly, the results and conclusions are presented. 

2. KEYSTROKE FEATURES 

2.1. Features From Keystrokes 

It has been shown that for a given user at least two unique features 

can be extracted from keystroke patterns [6]. Keystroke 

patterns, which are produced by the user during typing, exhibit 

unique timing characteristics. One such characteristic is 

the keystroke latencies (KL), which is the time between strik- 

142440469X/06/$20.00 ©2006 IEEE III 1144 

51 

ICASSP 2006 


ing two consecutive keys. Another characteristic (feature) is 

the key down time (KD), which is the time a particular key 

is held down. These features have been used in previous research 

to produce good results in user identification. 

For a string of length N, thereareN − 1 KL data points 

and N KD data points. These data points can be used to create 

two feature vectors. Fig. 1 shows the KL and KD plot 

for a particular user (one of the authors) that has typed his 

name repeatedly. Fig. 1 is included to illustrate the stability 

and strong correlation that exists between each of the feature 

vectors, KD and KL. 

220 

200 

180 

160 

140 

120 

100 

80 

200 

150 

100 

KL Features Vector 

da an no ou us s_ _h hh ho os ss se ei in nz za ad de eh 

KD Features Vector 

50 

d a n o u s h _ h o s s e i n z a d e h 

Fig. 1. Several plots of the keystroke latency (KL) and key 

down time (KD) feature vectors for one user. The bold line is 

the average of the vectors. The space character is represented 

by “ ”. 

2.2. Designing Good Features 

For keystroke identification, a robust feature pattern is one 

that is stable over repeated trials. To produce a stable feature 

pattern, the typist should be able to type the given text 

without any hesitation. Strings that require the typist to stop 

and think about the next letter or cause them to pause and 

search for a certain key, will result in an unstable pattern. As 

mentioned before, research has shown that the best results are 

obtained when users type familiar text such as, their first and 

last names. Such features are intuitively easy to type because 

they have been used for many years. Therefore, a distinct pattern 

can be seen when users type their name. 

Another important consideration when selecting appropriate 

text, is the number of characters. Shorter text tends to 

increase classification error because it can be more easily re- 

III 1145 

52 

produced by others [5]. This is true because fewer number of 

characters have a less complex patterns and can be imitated 

more easily by imposters. The same problem exists with hand 

written signatures, where short and simple signatures are often 

easy to copy. 

In previous work, it has been suggested that no less than 

ten characters should be used for keystroke identification [5]. 

In this work, the user is required to enter at least ten characters, 

which can be easily accomplished with the first and last 

name of the individual. At the same time it will be said that no 

additional effort was made to increase the minimum character 

length, because it might be difficult or annoying for some 

users to meet the requirement. This would also pose a strict 

requirement if the user’s full name does not meet the minimum 

character requirement, or if the user chooses a different 

string. These factors could have a negative impact on false 

acceptance rates (FAR) and false rejection rates (FRR). 

2.3. Data Acquisition Model 

To collect timing information, a data acquisition application 

named ’KbApp’ was designed for the Windows operating system. 

With this application, keystroke timing error was minimized 

to less than 0.5 milli-seconds, with the option of reducing 

it to 100 nano-seconds. However, this error will not have 

a significant impact on the results because the average feature 

has a time value that is to the order of 100 milli-seconds. 

2.4. Review of GMMs 

GMMs are a well known method for modeling the probability 

distribution of random events. By several weighted L dimensional 

Gaussian functions, it is possible to closely approximate 

any distribution, provided that enough training data is 

available. The complete GMM can be expressed by the mean 

vector µi, covariance matrix Σi and mixture weights wi as 

given below: 

λ = {wi, µi, Σi}, i =1, ...., K (1) 

Using the model λ, we can obtain the the likelihood that 

x belongs to the model λ by 

K 

p(x|λ) = wibi(x), (2) 

i=1 

where bi is given by an L-dimensional Gaussian PDF as shown 

below: 

1 

bi(x) = 

(2π) L/2 |Σi| −1/2 

exp − 1 

2 (x − µ)t Σ −1 

 

i (x − µ) 

(3) 

GMMs can be very effective in modeling the type of distributions 

found in keystroke patterns, which are shown in Fig. 1. 

To verify the likelihood that a given feature vector x belongs 

to the a model λ, the natural logarithm of the associated 

probability is used. This value, which we call the Log- 

Likelihood (LL) is given below: 

K 

 

LL = log {p(x|λ)} = log wibi(x) (4) 


i=1

3. A NOVEL KEYSTROKE IDENTIFICATION 

METHOD 

The novel method proposed in this paper uses GMMs to model 

keystroke timing information and uses the log-likelihood measure 

to the authenticate the user based on a threshold. 

3.1. GMM Training and Verification 

To produce a GMM, the user is first required to enroll into the 

system by typing their full name ten times. These ten samples 

produce twenty feature vectors; ten KL vectors and ten 

KD vectors. From these two sets of ten sample vectors, two 

GMMs can be trained, one for the KD feature and one for the 

KL feature. The expectation maximization (EM) algorithm 

was used to train the GMMs. 

Upon verification, the user is required to re-enter their full 

name. From this test vector, the KL and KD feature vectors 

are extracted and compared with the user’s model, which is 

obtained from the enrolment session. Equation 4, is used to 

calculate the log-likelihood that the test vector (x) belong the 

the given model. This result is then compared with the user’s 

threshold before access is granted or denied. 

The results show the statistics for the system when access 

is based on using the KD feature, the KL feature and a combination 

of KL and KD features. In the later case, the test 

vector is compared with both user models (KL model and KD 

model) before access is granted. Also, each time the user is 

authenticated successfully, both GMM models and thresholds 

are updated with the new information. 

3.2. Calculating Model Thresholds 

To obtain the user’s threshold, the Leave-One-Out-Method 

(LOOM) is used. The LOOM is as follows: for N feature 

vectors, N − 1 vectors are used to train the model and the 

last vector is used to test the likelihood that it belongs to that 

model, using Equation 4. This test can be performed N times, 

where each time a different vector is used to test the model. 

The final results of the LOOM produces N likelihood measures 

and can be expressed by 

LLj = log {p( xj|λ)} , j =1, 2, .., N (5) 

where λ is a GMM that has been trained with N − 1 vectors 

not including the jth vector and xj is the test vector. 

These N log-likelihood results are further processed before 

selecting the models threshold. From these likelihood 

values, the minimum value that falls within the range of three 

standard deviations away from the mean is set as the model 

threshold, as given below: 

Threshold= min 

∀j 

LLj | ( LLj − LL ) < 3σ 

where LL is the mean and σ is the standard deviation of the 

LL values obtained from the leave-one-out-method. 

(6) 

III 1146 

53 

The model generation and threshold calculation procedures 

are repeated every time the user has been verified so that 

the model and threshold are adaptive and can change with the 

user over time. 

3.3. Authenticating The User 

User authentication is the main goal of this work. To achieve 

this, the keystroke model should be robust enough to produce 

a low false rejection rate (FRR) and a low false acceptance 

rate (FAR). FAR is the rate at which intruders can gain access 

to a valid user’s account, and FRR is the rate at which valid 

user’s are denied access to their own account. Obviously, both 

these measures should be as low as possible. 

In this approach, authentication is performed in two stages. 

In the first stage, if the user is denied access, they are given 

a second chance to entre their name. By using this method a 

significant improvement was be seen in the FRR and is discussed 

in the results section. 

4. EXPERIMENTAL RESULTS 

Before presenting the results, the reader is reminded that the 

number of initial training vectors used to calculate the model 

thresholds was ten. Because it is desired to have an accurate 

threshold based on the training vectors, the LOOM was used, 

as described in Section 3.2. It has been shown that the LOOM 

provides the least unbiased estimate for small databases [7]. 

Therefore, the model thresholds used to authenticate the users 

are optimal given the size of the database. 

The results for FRRs and FARs for three different cases 

are presented in Table 1. It should be noted by the reader 

that the algorithm should also function well in terms of FRR 

and FAR, over time. The main reason for this behavior is that 

the proposed method adaptively selects the threshold that best 

suits the individual user, based on the LOOM. Also, the algorithm 

has shown that using a two stage verification process 

(ie. the user is given a two chances for authentication), decreases 

the FRR significantly. 

To perform imposter tests, two typists were chosen to observe 

and imitate the other users’ typing pattern. The results 

indicate an average FAR and FRR of about 2% using both 

features. These figures are comparable to other techniques 

however, a direct comparison with other methods cannot be 

justified because in each experiment a different database has 

been used. In our database, four out of the eight typists were 

the traditional ”two finger” typers. We believe this led to 

poor performance in both the FAR and the FRR because these 

types of users do not produce a very stable keystroke pattern 

and at the same time can be copied easily. Therefore, because 

their finger patterns can be easily seen and imitated by the 

imposter users, the FAR results presented here are skewed. In 

terms of FRR, these users also do not perform well because 

they have a lot of variation in their typing pattern. In fact, 


Table 1. Experimental Results for FRR and FAR 

KL Feature KD Features KL&Kd Features 

User FRR(%) FAR(%) FRR(%) FAR(%) FRR(%) FAR(%) 

1 0 0 0 0 0 0 

2 0 0 0 0 0 0 

3 5.3 14.3 0 14.3 5.3 7.1 

4 0 9.5 0 0 0 0 

5 0 0 0 0 0 0 

6 5.6 0 5.6 0 8.3 0 

7 0 50 0 10 0 10 

8 0 0 5.9 20 5.9 0 

Average(%) 1.4 9.2 1.4 5.5 2.4 2.1 

more users should be enrolled before the performance can be 

fully evaluated. 

This experiment obtained two features from the keystroke 

data and performed three similarity tests. The combination 

of the KL and KD features should produce a lower FAR and 

higher FRR compared to using either of the features individually. 

This is due to the fact that a user must correctly produce 

both features simultaneously. These trends were observed in 

the results, as can be seen from Table 1. A major benefit of 

this method over existing techniques is the ability to update 

the user model each time as he/she is successfully authenticated. 

Therefore, as time goes on, each user’s model accurately 

reflects the changes in that person’s keystroke pattern. 

5. CONCLUSIONS 

A novel method for authenticating computer users based on 

keystroke identification was presented. Upon verification, the 

keystroke latencies and key hold-down times for the user’s 

keyboard inputs were recorded and compared with a predefined 

individualistic model. Access was granted if the user’s 

input reached a certain threshold. A new method for calculating 

model threshold was also introduced using the LOOM 

and log-likelihood of the feature vectors. 

Ideally the FAR and the FRR should be very small with 

more emphasis give to former because a security breach is 

more critical than a valid user being forced to re-authenticate. 

Based on this logic, the best results were obtained using both 

the KL and KD features simultaneously; which produced a 

FRR of 2.4% and a FAR of 2.1%. 

Despite the fact that these results are based on a small 

database, it has been shown by this work that GMMs can be 

used effectively to identify users based on their keystroke patterns. 

Furthermore, despite the fact that 100% classification 

accuracy was not achieved, more users should be enrolled 

using this approach before a definitive answer can be given 

about the capability of the system. As mentioned earlier, the 

results presented are skewed because of the type of users enrolled 

(50% of users were two finger typers). This technique 

III 1147 

54 

could be further improved to incorporate the varied nature of 

the different typists. 

GMMs may be used with other metrics to improve both 

the FAR and FRR, or the threshold procedure can be modified 

to produce more accurate resutls. In future works, we 

intend to investigate these areas with a larger database. 

6. REFERENCES 

[1] Alen Peacock, Xian Ke, and Matthew Wilkerson, “Typing 

patterns: A key to user identification,” IEEE Security 

& Privacy Magazie, vol. 2, no. 5, pp. 40–47, Oct. 2004. 

[2] Rick Joyce and Gopal Gupta, “Identity authentication 

based on keystroke latencies,” Communication of the 

ACM, vol. 33, no. 2, pp. 168–176, February 1990. 

[3] Oscar Coltell, Jose M. Dabia, and Guillermo Torres, 

“Biometric identification system based on keyboard filtering,” 

in Proc. IEEE 33rd Int. Carnahan Conf. on 

Secutrity Technology, Madrid, Oct. 1999, pp. 203–209. 

[4] Saleh Bleha, Charles Slivinsky, and Bassam Hussien, 

“Computer-access security systems using keystroke dynamics,” 

Pattern Analysis and Machine Intelligence, 

IEEE Transactions on, vol. 12, no. 12, pp. 1217–1222, 

December 1990. 

[5] Livia C. F. Araujo, Luiz H. R. Sucupira Jr., Miguel G. 

Lizarraga, Lee L. Ling, and Joao B. T. Yabu-Uti, “User 

authentication through typing biometrics features,” Signal 

Processing, IEEE Transactions on, vol. 53, no. 2, pp. 

851–855, Feb. 2005. 

[6] R. Gaines, W. Lisowski, S. Press, and N. Shapiro, “Authentication 

by keystroke timing: Some preliminary results,” 

Tech. Rep. R-256-NSF, Rand Corporation, Santa 

Monica, CA, USA, May 1980. 

[7] Keinosuke Fukunaga, Introduction to Statistical Pattern 

Recognition (2nd ed.), Academic Press Professional Inc., 

San Diego, CA, USA, 1990. 


SOCCER VIDEO RETRIVAL USING ADAPTIVE TIME-FREQUENCY METHODS 

Jonathan Marchal*, Cornel Ioana*, Emanuel Radoi*, André Quinquis*, Sridhar Krishnan** 

* : ENSIETA, E3I2 Laboratory, 2 rue François Verny, Brest - FRANCE 

E-mails : marchajo@ensieta.fr, ioanaco@ensieta.fr, radoiem@ensieta.fr, quinquis@ensieta.fr 

** : Dept. Electrical and Computer Engineering, Ryerson University, Toronto – CANADA 

E-mail : krishnan@ee.ryerson.ca 

ABSTRACT 

The retrieval of soccer highlights is a suitable technique 

for video indexing, required by the multimedia database 

management or for the development of television on 

demand. For these purposes, it should be interesting to have 

an automatic annotation of events happened in soccer 

games. One solution consists in analyzing the audio 

soundtrack associated to the soccer video and to detect the 

interesting frames. 

In this paper we use the adaptive time-frequency 

decomposition of the soundtrack as a feature extraction 

procedure. This decomposition is based on the Matching 

Pursuit concept and a dictionary composed of Gabor 

functions. The parameters provided by these 

transformations constitute the input of the classification 

stage. The results provided for real soccer video will prove 

the efficiency of the adaptive time-frequency representation 

as a feature extraction stage. 


Soccer video highlights retrieval is not only a subject of 

research but also a need considering the huge amount of 

data that we can find on the internet. Most of the video 

archives are not indexed and an automatic parsing approach 

is a marketable issue. The development of television on 

demand has also this need of video index. Viewers could 

have access to the information they need, without having to 

watch hours and hours of videos. 

Several methods have been proposed, based on the 

information provided by video frames, such as camera 

motion, court lines detection, motion vectors, location and 

movements of the players as in [1], others on audio/video 

features extraction [2] or only on audio features, for instance 

dominant and excited speech [3]. 

In this paper, we propose a method based on audio 

feature extraction. There are typically a tremendous amount 

of crowd activity, which differs depending on the type of 

highlight in a game. For instance, when a goal is scored, the 

crowd cheers are increasing progressively before it, and 

continues for a few seconds after. For a penalty or free-kick 

goal, the crowd cheers are sudden, whereas when a goal is 

missed, crowd cheers begin and stop soon after. Finally, 

during a normal game sequence, crowd activity is usually 

not particularly intense. Considering these observations, we 

assert that if the human ear is able to distinguish the crowd 

reaction, signal processing tools would be able to do so. 

The idea behind this work is to use an adaptive time 

frequency decomposition (ATFD) [4,5] on the audio 

soundtrack of the sequences as a starting point for the 

feature extraction and classification. 

The paper is organized as follows. In Section 2 a brief 

presentation of adaptive time-frequency decomposition 

concept is done. The classification of the soccer sequences, 

based on the parameters provided by the ATFD, is described 

in Section 3. The efficiency of the proposed method is 

analyzed trough the results in the Section 4. We conclude 

our discussions in Section 5. 

2. ADAPTIVE TIME-FREQUENCY TECHNIQUES 

Most of the natural signals are non-stationary. Since 

their structure is generally complex, some transformations 

in a more intuitive representation spaces are usually well 

suited. Linear expansions in a single parameter basis, 

whether it is a Fourier, wavelet, or any other basis are not 

flexible enough. A Fourier basis provides a poor 

representation of functions well localized in time, and 

wavelet basis are not well adapted to represent functions 

whose Fourier transforms have a narrow high frequency 

support. Thus, a flexible decomposition technique can be 

considered for representing signal components whose 

localization in time and frequency vary widely. 

Matching pursuit (MP), introduced in [4], is a technique 

that decomposes a signal into a linear expansion 

of waveforms that belong to a redundant dictionary of 

functions. These waveforms, selected in order to best match 

the signal structure are selected among a dictionary of timefrequency 

atoms. The aim of the algorithm is to obtain a 

parsimonious description in order to estimate the original 

signal with as fewer coefficients as possible. Generally, 

considering a signal x and a dictionary 

142440469X/06/$20.00 ©2006 IEEE V 509 

55 

ICASSP 2006 


D 

as 

{ gγ 

γ ∈ Γ, 

g = 1} 

= γ 

, the signal decomposition is expressed 

= ∑ 

γ ∈Γ 

x λ g , (1) 

where the decomposition coefficients λ are obtained by 

γ 

the inner product between the signal x and the function gγ. : 

λγ = x, g . γ 

The MP builds up the signal decomposition one element 

at a time, picking up the most energy dominant component 

first. The MP begins by projecting the signal x on a function 

gγD 0 

∈ and computes the residue Rx = x − x, gγ g . 

0 γ0 

Thus, the Rx is orthogonal to gγ . The MP algorithm 

0 

chooses gγ∈ D such that xg , is maximum: 

γ 

0 

γ 

γ α 0 

γ 

γ ∈Γ 

γ 

0 

xg , ≥ sup xg , , (2) 

where α ∈ ( 0,1] 

is an optimal factor. The MP iterates this 

procedure by decomposing the residue. If we suppose the 

m th order residue R m x has been computed, the next iteration 

chooses 

g ∈ D such that : 

γ 

and deduces R m+1 x by : 

m 

R x g ≥ R x g , (3) 

m m 

, γ α sup , 

0 

γ 

γ∈Γ 

m 1 m m 

R x R x R x, gγ g 

m γm 

+ = − (4) 

Summing for m between 0 and M-1 yields 

M−1 M−1 

m M M 

, γm γm m γm 

m= 1 m= 

1 

∑ ∑ (5) 

x = R x g g + R x = a g + R x 

where am (m=0,..,M-1) are the decomposition coefficients. 

There are two major factors which guarantee the 

success of a MP algorithm. The first one is the choice of 

stop criteria. Since the MP is an iterative decomposition, 

establishing the number of iterations, M, is very important 

for the considered application. One of the most used criteria 

is the choice of M such that the residual energy is smaller 

than a percentage of the signal energy: 

2 2 M + 1 

2 

M ε x − R − minimal (6) 

This criterion is not well adapted when a signal to noise 

ratio (SNR) is relatively small [5]. In this case, the signal 

energy contains the noise contribution and, consequently the 

correct setup of M is almost impossible. 

However, since in our application the signals of interest 

are the soccer soundtracks we can assume that the noise is 

relatively small and, more importantly, its level and 

properties are almost the same for all signals. In these 

V 510 

56 

conditions, we can use the criterion (6) whose ratio ε is 

empirically setup. 

The second factor which guaranties the efficiency of the 

MP is the choice of the elementary functions g . Intuitively, 

γ 

the parameters of these functions should ensure a good 

matching on the signal’s time-frequency structures. A 

common choice is to design a function with as much 

degrees of freedom (i.e., control parameters) as possible. On 

the other hand, the time-frequency resolution is another 

property of interest, especially for feature extraction 

applications. According to these requirements, we consider 

for our application an elementary function defined as 

⎛t−u ⎞ 

g () t = g⎜ ⎟e 

γ m 

1 m j( 2π 

fmt+ φm) 

sm 

⎝ sm 

⎠ 

These atoms are issued by dilatation (sm), modulation 

(fm) and translation (um) of the Gaussian window 

2 

1/4 t 

g() t 2 e π − 

= (8) 

The fourth parameter, φm, stands for the initial phase. 

According to this definition, inspired from [4], the atoms 

(called Gabor functions) are characterized by three 

parameters : um, sm, fm, φm. This type of elementary functions 

allows to define an adaptive time-frequency tilling, unless 

the Gabor or wavelet transform (figure 1). 

Fig. 1. T-F tilling : MP versus wavelet decompositions 

The choice from an arbitrary time-frequency tilling 

constitutes the main property interesting for characterization 

purposes. It will be used in the next section for the 

separation of soccer events based on the MP analysis of the 

corresponding soundtracks. 

3. CLASSIFICATION OF SOCCER HIGHLIGHTS 

Most of the time, what interests a soccer watcher are the 

highlights, such as goals, of course, but also missed goals 

and scored free-kicks and penalties are of interest. Hence, 

we have chosen sequences of theses three types, adding a 

"normal game" class, which is relevant in order to 

differentiate an interesting sequence (from the 3 classes 

above) from an "uninteresting" one, in terms of highlight 

retrieval. This defines 4 classes as illustrated in Fig. 2 : 

goals, missed goals, penalties/free-kicks, normal game. 


(7)

Fig. 2. Video sequences isolated for soccer retrieval experiments 

Once the classes’ definition has been done rigorously, 

the sequences were grabbed form the Internet and from TV 

recordings. We retained duration of 5 seconds to analyse 

each highlight video sequence. Indeed, when watching these 

sequences, we can note that usually, the crowd cheers are 

growing during the 2 seconds before the ball crosses the 

goal line, and continues until 3 seconds after, in most cases. 

All the audio soundtracks of the video sequences have been 

extracted in a mono, 8 bits, 44,1kHz format. It corresponds 

to 220500 samples per sequence. The scheme for the feature 

extraction and classification of the audio soccer sequences 

shown in Fig. 3 and is as follows: 

Soccer 

sequences 

Soundtrack 

extraction 

MP 

decomposition 

Dimensionality 

Classification 

reduction 

Fig. 3. Scheme of the soccer event classification 

The first step is the extraction of the soccer sequence 

soundtrack. The extracted signals are inputs of the MP 

decomposition. Knowing that we consider N=220500 

samples per sequence we setup the parameters of the 

elementary function dictionary as follows : 

- the time parameter, un, ranges from 0 to 220499; 

- the scale parameter, sn, ranges from 2 to the closest 

integer of log2(N) (17 in our case); 

- the frequency parameter, fn, ranges from 0 to 22050 Hz 

(the half of sampling frequency). The number of frequency 

parameters is given by the required spectral resolution. In 

our application we consider 8192 values which corresponds 

to a spectral resolution 2.65 Hz. 

With the parameters ranging the intervals given 

previously, the experimental results provided for real data 

show that the stop criteria (6) needs less than 2200 

iterations. For this reason we will limit the number of the 

iterations to this value. The decomposition parameters are 

organized in a matrix structure as indicated in (9) 

V 511 

57 

Iteration index 

Energy Octave Frequency Time Phase 

E1 

E2 

. 

. 

. 

. 

. 

s1 

s2 

. 

. 

. 

. 

. 

Experimentally, we observed that these parameters have 

different discrimination efficiency when applied on audio 

signals. For example, the time parameter is difficult to be 

used in our case since it is impossible to synchronize the 

crowd reaction of all sequences at a fixed sample. 

Therefore, only the frequency and scale parameters will be 

used for the classification purpose of audio sequences. It 

constitutes a first step of data size reduction. Nevertheless, 

while we work with 2500 iterations, the number of 

classification parameters is about 5000. In order to reduce 

the dimensionality of input data, the linear discriminant 

analysis (LDA) technique is applied [6]. LDA is a 

supervised learning projection that uses information on the 

within-class scatter and between-class scatter to construct a 

projection matrix in the reduced space. It maximizes the 

ratio of between-class variance to the within-class variance 

in any particular data set thereby guaranteeing maximal 

separability. As it will be shown in the next section the LDA 

improves the classification performances compared to the 

classification in the original space. 

Finally, using the feature vectors provided by LDA we 

used the nearest neighbors classifier for the classification 

task [6]. This operation will be performed in two phases. 

The first one, learning stage, consist in processing a training 

set of features with apriori known class. The second step, 

testing, is based on the computation of the distance between 

a new unknown feature vector and each feature vector from 

the training set. The short distance corresponds to the 

nearest neighbor. This algorithm will be more 

computationally intensive as the size of the training set 

grows but the performances will be improved. 

The method proposed in this section has been used for 

the classification of the soccer sequences for a significant 

dataset. The most important results will be presented in the 

next section. 

4. RESULTS 

In this section we present the results obtained for a data 

set which consists of 47 sequences, composed of 10 goals 

sequences, 9 missed goals, 21 normal game sequences and 7 

penalties. The sequences are decomposed applying MP 

algorithm, returning parameters as illustrated in the matrix 

(9). 

The main idea behind the classification process is to 

compare the modulation frequencies of the Gabor functions 

with comparable scales for each sequence. This principle 

has been establishing by comparing the histogram of the 


f1 

f2 

. 

. 

. 

. 

. 

t1 

t2 

. 

. 

. 

. 

. 

φ1 

φ2 

. 

. 

. 

. 

. 

(9)

frequency parameters issued from the MP algorithm applied 

for data corresponding to each class . This is illustrated in 

the Fig. 4 for the scale 13. 

Fig. 4. Mean frequency distributions for the 4 classes 

As feature parameters we use the vectors of bins centers. 

Empirically, we found that the best bins centers vector is of 

size 12. Applying the LDA algorithm for the vectors 

obtained for our dataset, three non-zero eigenvalues are 

obtained. We choose these 3 eigenvectors as a 3D 

projection space. The classification is then performed using 

the nearest neighbors (NN) method coupled to the 

Mahalanobis distance. The LOO (Leave-One-Out) crossvalidation 

technique [7] has been used because of the 

reduced number of examples in the database. The results are 

shown in figure 5... 

Fig. 5. Clusters given by classification in the reduced space 

Note that the four classes are close but properly 

separated in fig. 5.a, whereas the points corresponding to 

penalties sequences are spread among the other classes in 

fig. 5.b This comes from the fact that all the penalty 

sequences chosen in the set are also scored penalties, so 

although the crowd cheers are more sudden than for a goal 

V 512 

58 

sequence, they are very similar. Table 1 provides the 

classification results with and without dimensionality 

reduction (provided by LDA and PCA – principal 

component analysis). 

Table 1. Classification rates 

The classification accuracies obtained clearly shows that 

the LDA is well adapted to transform the data provided by 

the MP algorithm. 

5. CONCLUSION 

In this paper, we have proposed a new technique for 

soccer events classification based on the Matching Pursuit 

algorithm. The dictionary of elementary functions has been 

manipulated according to the application in hand. 

After MP decomposition the feature parameters have 

been projected via a dimensionality reduction stage. The 

LDA technique based on the nearest neighbor method yields 

better classification performances and improves the 

computational efficiency of the classification stage. In the 

future works, we intend to use other parameters of the 

functions with a larger dataset 

ACKNOWLEDGEMENTS 

The authors would like to thank Lastwave Software 

developers and Karthi Umapathy of Ryerson University for 

providing the Matching Pursuit routines. 

6. REFERENCES 

[1] Y.Gong, L.T.Sin, C.H.Chuan, H.Zhang, M.Sakauchi, 

“Automatic parsing of TV soccer programs”, Proc. ICMCS95, 

Washington DC, USA, 1995. 

[2] K. Wan, C. Xu, “Efficient multimodal features for automatic 

soccer highlight generation”, 17th ICPR04, 2004. 

[3] K. Wan, C. Xu, “Robust soccer highlight generation with a 

novel dominant speech feature extractor”, IEEE International 

Conference on Multimedia Expo ICME, 2004. 

[4] S. Mallat and Z. Zhang, “Matching pursuit with time-frequency 

dictionaries”, IEEE Trans. Signal Processing vol. 41, pp. 3397- 

3415, Dec. 1993. 

[5] S. Mallat, A Wavelet Tour of signal processing, Academic 

Press, 1998. 

[6] R.O. Duda, P.E. Hart, D.H. Stork, Pattern Classification (2nd 

ed.), Wiley Interscience, 2000. 

[7] K. Fukunaga, Introduction to statistical pattern recognition 

(2nd ed.), Academic Press Professional, Inc. San Diego, CA, USA, 

1990. 


SUPPORT VECTOR MACHINES BASED APPROACH FOR CHEMICAL 

PHOSPHORUS REMOVAL PROCESS IN WASTEWATER TREATMENT PLANT 

Talieh Seyed Tabatabaei 

Department of 

Electrical and Computer 

Engineering, Ryerson 

University, Toronto 

tseyedta@ee.ryerson.ca 

Abstract 

Tahir Farooq 

Department of 




tfarooq(ee.ryerson.ca 

In this research, support vector machine (SVM) is 

investigated to model the uncertainty in chemical phosphorus 

removal processes in wastewater treatment plants. SVM is a 

machine-learning method based on the principle of structural 

risk minimization, which performs well when applied to data 

outside the training set. The prediction whether or not the 

concentration of total phosphorus as P in the effluent will 

exceed the maximum allowable limit (1.0 mg1L) for a certain 

input is considered a supervised-learning problem. Least 

Squares Support vector machines (LS-SVMs) algorithm, which 

is a reformulation of standard SVMs, is used to design the 

class ifier. Performance of radial basis function (RBF), 

polynomial and multi-layer perceptron (MLP) kernels has 

been evaluated and a high classification rate of 88.520 was 

achieved using radial basisfunction (RBF) kernel. 

Keywords: Wastewater, phosphorus removal, SVM 


Nature has an amazing ability to cope with small amounts 

of water wastes and pollution, but it would be overwhelmed if 

we did not treat the billions of gallons of wastewater and 

sewage produced every day before releasing it back to the 

environment. Treatment plants reduce pollutants in wastewater 

to a level that nature can handle. 

Wastewater can be defined as the liquid, or water-carried 

wastes removed from residence, institutions, and commercial 

and industrial establishments, together with such ground 

water, surface water, and storm water [1]. 

Collecting, treating and reusing the wastewater are 

receiving an increasing interest these days. In addition to its 

aesthetic and sanitary advantages, we can look at it as a big 

financial aid, since we can reuse the treated wastewater in 

many applications (i.e. agriculture irrigation, urban irrigation, 

industrial reuses, ground water recharge, street cleaning , car 

washing, toilet flushing, and many more [2] ). 

Wastewater consists of physical, chemical, and biological 

components. Some of the contaminants of concern in 

1-4244-0038-4 2006 

IEEE CCECE/CCGEI, Ottawa, May 2006 

Aziz Guergachi 

Department of 

Information Technology 

Management, Ryerson 


a2guerga(ee.ryerson.ca 


318 

59 

Sridhar Krishnan 

Department of 




krishnan@ee.ryerson.ca 

wastewater to be removed are suspended solids, biodegradable 

organics, pathogens, nutrients, priority pollutants, refractory 

organics, heavy metals, and dissolved inorganics. Nutrients 

(i.e. nitrogen and phosphorus) are one of the most important 

contaminants of the wastewater. Both nitrogen and 

phosphorus are essential nutrients for growth [1, 2]. When 

discharged in the aquatic environment, these nutrients can lead 

to the growth of undesirable aquatic life. When discharged in 

excessive amount on land, they can also lead to the pollution 

of groundwater. 

Phosphorus is essential to the growth of algae and other 

biological organisms. The usual forms of phosphorus found in 

aqueous solutions include the orthophosphate, polyphosphate, 

and organic phosphate [1, 2]. Due to the presented negative 

affects of the phosphorus existed in wastewaters, along with 

the stringent discharged limits imposed on wastewater 

treatment plants, recently there has been increasing demand to 

achieve very low effluent total phosphorus. According to the 

phosphorus removal requirements that have been imposed (by 

International Joint Commission's Phosphorus Management 

Strategies Task Force) in Ontario, the typical effluent 

concentration limit should be 1.0 mg/L, based on total 

phosphorus [3]. However all provinces site-specific conditions 

may dictate more stringent requirements in terms of effluent 

total phosphorus limit. 

The process of phosphorus removal can be done either 

biologically or chemically. The data used in this paper is based 

on Ashbridges Bay Treatment Plant in Toronto, which uses 

the chemical method. Chemicals that are used in chemical 

phosphorus removal process include metal salts and lime. The 

most commonly used metal salts are ferric chloride, ferrous 

chloride, and aluminum sulfate. In the mentioned treatment 

plant ferrous chloride (FeCl2) is being used as the chemical 

precipitation for phosphorus removal. 

The theory of chemical precipitation reactions is very 

complex. There are many uncertainties that underlie all the 

chemical reactions. Due to the existence of numerous other 

particles other concurrent side reactions may happen in 

wastewater as well [1]. All these uncertainties bring about the

necessity of prediction, controlling and therefore some kind of 

intelligent system. 

In the last few years, numerous studies were carried out 

dealing with applications of Artificial Neural Networks and 

Fuzzy Neural Networks for modeling biological nutrient 

removal systems [18], fuzzy-logic based control strategies for 

biological nitrogen removal and dynamic enhanced biological 

phosphorus removal [20, 21], fuzzy controller for the level of 

biogas in the treated wastewater [19], whereas the amount 

work targeting the applications of chemical processes in 

wastewater treatment, especially chemical phosphorus 

removal, has been insufficient 

In this paper a novel approach based on support vector 

machines (SVMs) is proposed to control and classify the 

quality of the final effluent of wastewater treatment plants 

according to the International Joint Commission (IJC) 

phosphorus concentration standards. 

The paper is organized as follows: in section 2 the theory of 

Support Vector Machines in both linear case and non-linear 

case is discussed. Section 3 explains the data set preparation 

for classification. Section 4 demonstrates the results of the 

classifications and the graphs, and section 5 gives the 

conclusion. 

2. Support Vector Machines 

SVM introduced first by Vapnik and co-workers, and it is 

such a powerful method that in the few years since its 

introduction outperformed most other systems in a wide 

variety of applications. SVM has different applications. 

However it is mostly used as a binary classifier. Given a classlabeled 

training set, which in this work is a set of labeled input 

feature vector composed of input and control variables, the 

boundary between two classes is learnt using SVM. 

2.1. Linear Support Vector Machine 

Consider a binary classification problem with xi E Rd as 

its feature vector andyi E {-1, +1} the class labels (i.e. 

( x1, Yi ) I.. .,(inI Yn ) are the training sets). The hyperplane 

which separates the two classes is 

Tf(i) =Tw x+b = 0 (1) 

The function of SVM is based on choosing the hyperplane 

which minimize the margin between two classes (figure (1)) 

[4, 5, 6]. Thus, the hyperplane ( w, b ) that solves the 

optimization problem 

subject to 

minimize wv, b 1 _1 112 

2 

yi(< vXi> +b)21 i=1l. In 

(2) 

319 

60 

realizes the 

margin 1/2 || 

maximal margin hyperplane 

V || . The primal Lagrangian is 

with geometric 

L(wV, b, 

1n 

V= > ai [yi (< w xi~j> +b) -1I] 

2 ~~~~i=l 

(3) 

where ai > 0 are the Lagrange multipliers. 

The corresponding dual is found by differentiating 

respect to wv' andb: 

with 

subject to 

n I n 

W(a) = Yai -t Yi yj ai aj < xi-xj > 

i=l 2i,j=l 

n 

XYaiy =0 ai20 I= . 1 ....> In 

i=t 

But in many real-world problems the data is noisy; therefore 

there will in general be no linear separation. In this case 

instead of hard margin, we use soft margin (the noise tolerant 

version), and slack variables denoted by X, can be introduced 

to relax the constrains [4, 5, 6]. 

So our optimization problem would be 

subject to 

minimize w, b - V112 + C i 

2 i=l 

yi(< w.xj >+b) >1-.i Ji 2 0, i=I, ..... In 

Where C is regularization parameter which is a trade off 

between the empirical risk (reflected by the second term in 

(5)) and model complexity (reflected by the first term in (5)). 

The dual form of this case is the same as (4) except that the 

constraint is different: 

0+bo 

i=l 

where N, is the number of support vectors. 

This result shows that points that are not support vectors 

have no influence on the solution. 

2.2. Non-linear Support Vector Machines 


In most of the real-world cases the data points are not 

linearly separable. In this case we use a non-linear 

operator y (.) to map the data to a higher dimensional space F 

(feature space), where it can be classified linearly (Figure 2). 

(4) 

(5) 

(6)

Figure 1. A linear SVM classifier. Support vectors are 

those elements of the training set which are on the 

boundary hyperplanes of two classes. 

So the hypothesis in this case would be 

T) 

f(x~)= (px.) + b (7) 

which is linear in terms of the mapped data (y(o)). 

Now we can extend all the presented optimization problems 

for the linear case, for the transformed data in the feature 

space. 

If we define the Kernel function as 

K(x,y) = < (o). (o) > (8) 

where q is a mapping from input space to an (inner product) 

feature space F. 

Then the corresponding dual form is 

subject to 

margin 

o \\\\ P~Y Support 

O O ' /\ Hyperplane 

o o \ \\" 

o \' 

n I n 

W(a) =Yai -- Y i yjai ajK(.i.Xi) 

i=1 2 i, j=l 

n 

YXaiyi=0 a.i>0,i= 1 ..... I n 

i= 

And our final decision rule can be expressed as 

f (,bo ) yi ai K(i .x) + bo (10) 

i= 

whereN and ai denote number of support vectors and the 

non-zero Lagrange multipliers corresponding to the support 

vectors respectively. Note that we don't have to know the 

underlying mapping function, however it is necessary to 

define the Kernel function. Among the different kernel 

functions the most common kernels are polynomial, Gaussian 

radial basis function (RBF) and multi-layer perception (MLP). 

In LS-SVMs the inequality constrains in SVM are replaced 

with equality constrains. As a result the solution follows from 

solving a set of linear equations instead of a quadratic 

(9) 

0 10++ 00 

00 

0+ 

I O0 

Figure 2. Mapping data from input space to a higher 

dimensional feature space by a non-linear operator (p (.), in 

order to classify them by a linear function 

programming problem which we have in original SVM 

formulation of Vapnik and obviously we can have a faster 

algorithm. 

The primal problem of the LS-SVMs is defined as 

subject to 

minwb(e 

Jp(w,b,e) = 1/212 + y1/2XYe7 (11) 

i=l 

-To 

= yi[w o(xi)+b] -ei, i = 1 ..... d 


320 

61 

where y is a parameter analogous to SVM's regularization 

parameter. The main characteristic of LS-SVMs is the low 

computational complexity comparing to SVMs without quality 

loss in the solution. 

3. Dataset Preparation and Processing 

The dataset used in this study was obtained from 

Ashbridges Bay Wastewater Treatment Plant, Toronto. This 

dataset consists of 123 records. Each record is an observation 

of the input, control and output variables. Every record 

represents the average values of the variables over a period of 

one day. The input and control variables used in this study 

were selected after consultation with senior plant 

management. Total daily volume treated, peak flow rate, 

carbonaceous biochemical oxygen demand (CBOD), 

suspended solids (SS) and total phosphorus as P in influent are 

used as input variables. Ferrous chloride is used as control 

variable and is included in the input feature vector for training 

and testing of LS-SVM and LDA classifiers. Concentration of 

total phosphorus as P in effluent is used as the output variable. 

The dataset was randomly divided into two separate subsets. 

One of the subsets having 62 examples was used exclusively 

for training algorithms and the other one having 61 examples 

was used exclusively for testing. Any example from the 

training set was never used during testing phase and vice 

versa. A class label yi E {-1,+ 1} was assigned to every output 

value based on the threshold value of 1.0 mg/L. If the output 

variable exceeds the threshold, {+1 } class label is assigned to 

the output value, otherwise {-1 } class label is assigned. Class 

label assignment was done for both of the training and testing 

datasets before designing the LS-SVM and LDA classification 

algorithms. 

d

4. Experimental Results 

The objective of the LS-SVM and LDA classification 

algorithms is to correctly classify, whether or not the 

concentration of total phosphorus as P in effluent will exceed 

the threshold for a given set of yet-to-be-seen input patterns. 

Classification rate was used as a figure of merit. The 

classification rate was defined as the total number of correctly 

classified examples divided by the total number of examples 

classified times one hundred. The results of LS-SVM 

classification have been obtained using three different kernel 

functions; polynomial kernel, (KQ.~,j) = (9VT~+ I)d' where 

d is the degree of polynomial kernel), radial basis function 

kernel, (K(.~J) = exp(- IJ -11 2/ 2), where c is the width 

of RBF kernel) and multilayer perceptron kernel (MLP), 

(K(i~,j) = tanh(kiT'j + 0) ). MLP kernel does not satisfy 

Mercer condition for all k and 6. 

Fig. 4 shows the estimated classification rate achieved by 

LS-SVM classifier using RBF kernel with kernel width c= 

0.5, 1 and 2.5. The best classification rate achieved was 

88.520o when c= 0.5 and C is between 0.5 and 0.8. A 

similar classification rate was achieved when c 1 and C= 

0.5. The classification rate dropped to 86.880o when the value 

of c was changed to 2.5 and 0. 1. 

For polynomial kernel a consistent classification rate of 

86.8800 was achieved for a wide range of parameter settings. 

Although polynomial kernel did not achieve as good 

performance as RBF kernel, its performance was insensitive 

for a very wide range of parameter settings. 

Fig. 5 represents the classification rate obtained by MLP 

kernel with k =0.5, 1 and 2.5. The value of & was kept 

constant at 1. 

MLP kernel achieved the best classification rate of 86.880o 

for all the three values of k at different values of C. However, 

the results obtained by MLP kernel were very sensitive to the 

U~0 

90 

85 

80 

75 101- 1 00 1 01 1 02 0 

C 

Figure 4. Plot of LS-SVM classification rate versus 

regularization parameter C using RBF kernel with 

,c 

= 0.5,l1and 2.5 

321 

I0 

80 

(r) 

80 

0 

20 

60-----T -i 

40 -- 

60 -- 

.Ilu 

-101 

- 

-- 

80 T I -- r- , rr -- -I- 

60 

0.k2.5 


62 

1 nn. 

-J 

1 00 1 0 

1 02 

Figure 5. Plot of LS-SVM classification rate versus 

regularization parameter C using MLP kernel with 

k = 0.5, 1 and 2.5 

parameter settings. Hence polynomial kernel could be a better 

choice over MLP kernel. 

The same training and testing dataset was used to design 

and test LDA classifier and the best classification rate 

achieved with optimal parameter settings over the testing 

dataset was 68.850o. These results indicate the strong 

generalization ability of LS-SVM classifier. 

C 

5. Conclusion 

We have presented SVM based approach that utilizes the 

principle of structural risk minimization to model the 

uncertainty that underlies the chemical phosphorus removal 

process in wastewater treatment plants. A real dataset of 123 

examples has been obtained from Ashbridges Bay Wastewater 

Treatment Plant, Toronto. A Classifier based on LS-SVM has 

been designed through supervised learning to classify whether 

or not the concentration of total phosphorus as P in the 

effluent will exceed the maximum allowable limit. 

Performance of different kernel functions has been evaluated 

and all the three kernel functions performed well, especially 

the RBF kernel achieved a very promising classification rate 

of 88.5200 over the unseen testing dataset. For comparison the 

LDA classifier was also used in the study. The classification 

results showed that LS-SVM based approach outperformed the 

LDA method. 


We are thankful to Mark Rupke, Chris Monteith, Colin 

Marshall and Filemon Basa at Ashbridges Bay Treatment 

Plant, Toronto for providing us with valuable information. 

.. 

1 T

References 

[1] Metcaf and Eddy, Wastewater Engineering-treatment and 

reuse. NewYork: McGraw-Hill, 1991. 

[2] M. J. Hammer and M. J. Hammer Jr., Water and 

wastewater technology. New Jersey, Columbus: Prentice 

Hall, 2003. 

[3] N. W. Schmidtke and Assoc. Ltd. And D. I. Jenkins and 

Assoc. Inc., Retrofitting municipal wastewater treatment 

plantsfor enhanced biological phosphorus removal. 

Canada: Minister of supply and services Canada, 1986. 

[4] N. Cristianini and J. SH. Taylor, An introduction to 

Support Vector Machines and other kernel-based 

methods. United Kingdom: Cambridge University Press, 

2000. 

[5] C.J. Burges, "A tutorial on support vector machine for 

pattern recognition," Knowledge Discovery and Data 

Mining, vol. 2, pp. 12 1-167, June, 1998. 

[6] I.E. Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and 

R. M. Nohikawa, "A support vector machine approach for 

detection of microcalifications," IEEE trans. Med. Imag., 

vol.21, NO. 12, December, 2002. 

[7] P.H. Chen, C. J. Lin, and B. Scholkopf, "A tutorial on 

v- support vector machines," unpublished. 

[8] J. Salmon, S. King, and M.Osborne,"Framewise phone 

classification using support vector machines," 

unpublished. 

[9] S. Z. Li and G. Guo, "Content-based audio classification 

and retrieval using SVM learning," unpublished. 

[10] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. 

Scholkopf, "An introduction to kernel-based learning 

algorithm," IEEE Trans. Neural Networks, vol. 12, pp. 

181-201, Mar. 2001. 

[11] B. Scholkopf and A. J. Smola, Learning with kernels - 

support vector machines, regularization, optimization, 

and beyond. Cambridge, MA: MIT press, 2002. 

322 

63 


[12] V. Kecman, Learning and soft computing- support 

vector machines, neural networks, andfuzzy logic. 

Cambridge, MA: MIT press, 2001. 

[13] U. Jeppsson, Modeling aspects ofwastewater treatment 

process. Sweden, Lund: Reprocentralen, Lund university, 

1996. 

[14] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, 

Neural and adaptive systems -fundamentals through 

simulation. United States of America: John Wiley & sons 

Inc., 1999. 

[15] S. Haykin, Neural Networks - a comprehensive 

foundation. Hamilton, ON., Canada: Prentice Hall, 1999. 

[16] K. Pelckmans et al, "LS-SVMlab toolbox user's guide," 

unpublished. 

[17] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. 

Moor, and J. Vandewalle, Least square support vector 

machines. Singapore: World scientific publishing Co. Pte. 

Ltd., 2002. 

[18] D. S. Lee, J. M. Park, "Neural Networks Modeling for 

On-line Estimation of Nutrient Dynamics in a 

Sequentially- operated Bach Reactor," Journal of 

Biotechnology, vol. 75, pp. 229-239, June, 1999. 

[19] 0. C. Pires, C. Palma, J. C. Costa, I. Moita, M. M. Alves, 

and E. C. Ferreira, "Knowledge-based fuzzy system for 

diagnosis and control of an integrated biological 

wastewater treatment process," the 2nd IWA conference on 

instrumentation, control, and automation, June, 2005. 

[20] S. T. Yordaova, " Fuzzy two-level control for an aerobic 

wastewater treatment," proceedings. 2nd international 

IEEE conference, vol. 1, pp. 348-352, June, 2004. 

[21] S. Marsili and L Giunti, "Fuzzy predict control for 

nitrogen removal in biological wastewater treatment," 

IWA conference on water science technology, vol. 45, pp. 

37-44, June, 2002.

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings. 

Data Embedding in µ-law Speech with Spread 

Spectrum Techniques 

Libo Zhang, Heping Ding 

Institute for Microstructural Sciences, 

National Research Council, 

Ottawa, Ontario, Canada 

heping.ding@nrc-cnrc.gc.ca 

Abstract⎯This paper explores data embedding in G.711 µ-law 

speech signals with the spread spectrum techniques. Based on an 

optimized spread spectrum scheme, a simple but effective solution 

is presented for high-capacity embedding. Simulations show that 

the proposed scheme, when incorporated with the measure of the 

frequency masking effects, can achieve an embedding rate of 

about 100 bits per second with a 7% Bit Error Rate (BER), or 

1000 bps with a 10% BER. 

Keywords- µ-law speech, data embedding, speech coding, spread 

spectrum communication 

I. BACKGROUND 

The techniques to embed additional digital information into 

host signals imperceptibly can have many applications. For 

example, in digital watermarking, the digital copyright 

information is embedded into audio signals imperceptibly to 

protect the intellectual property. In another example shown in 

[1], the wide-band components are embedded into narrow-band 

speech signals to enhance the quality and intelligibility. 

The µ-law companded signal format, which is defined in 

ITU-T G.711, is the telephony standard in North America. For 

high capacity embedding in such signals, it is required to 

reliably transmit the embedded information, along with the host 

speech signal, across both the analog and digital telephony 

channels. Thus, the embedding should be robust against both 

band-pass filtering and Additive White Gaussian Noise 

(AWGN), which occur in normal telephony channels. 

In general, three conflicting criteria are used to evaluate such 

embedding systems. Imperceptibility means that the composite 

signal should be perceptually equivalent to the host signal; 

robustness refers to a reliable extraction even if the composite 

signal is degraded; and embedding rate is a measure of how 

much information can be embedded and transmitted. For our 

research in µ-law embedding, the embedding rate is more 

emphasized than with other research areas. 

Little was published on this research topic. Currently two 

categories of techniques could be used for this kind of data 

embedding, namely, the ones based on spread spectrum 

techniques [2] and those based on quantization-bin techniques 

[3]. Usually the conventional spread spectrum techniques could 

not achieve high embedding rates because a long spreading 

sequence is required just to reduce the host impact. [4] proposed 

a modified spread spectrum embedding algorithm that can 


Electrical and Computer Engineering Department, 

Ryerson University, 

Toronto, Ontario, Canada 

krishnan@ee.ryerson.ca 

inherently suppress the host impact. The scheme shows a very 

high robustness when applied to digital audio watermarking. 

In this paper, we optimize this modified scheme for the 

purpose of high capacity embedding in µ-law speech signals. 

The rest of the paper is organized as follows. Section II presents 

a generalized view of spread spectrum embedding schemes, 

with the modified scheme and its optimization being special 

cases. Section III incorporates the frequency masking effect to 

implement the proposed scheme. Section IV presents the 

simulation results and Section V gives a summary. 

II. SPREAD SPECTRUM SCHEMES 

Supposing that one bipolar information bit b ∈ ± 1 is to be 

embedded into x, an N–sample time or transform domain 

sequence of the host signal, the generalized spread spectrum 

embedding can be expressed as 

y = x - β ( x • w) 

w + αb w, 

0 ≤ β ≤ 1. 

(1) 

where y represents the composite signal; the pseudo-random 

spreading sequence w is of length N and zero-mean; the scalar 

α > 0 controls the embedding strength; the scalar β = 0 and 

β = 1 result in the conventional and the modified schemes, 

respectively; and the dot operator represents the normalized 

correlation of two length-N sequences and is defined as 

1 

u • v ≡ u iv . (2) 

i 

N 

∑ N i= 

1 

Degraded by the additive noise n during transmission, the 

received signal can be expressed as 

r = y + n = x - β ( x • w) 

w + αbw 

+ n . (3) 

The normalized correlation between the received and the 

spreading sequences can be found as 

c = r • w = αb + ( 1− 

β )( x • w) 

+ x • n . (4) 

Assuming that both the host signal and the noise are 

2 

2 

Gaussian, with x ~ N ( 0, 

σ x ) and n ~ N ( 0, 

σ n ) . According to 

(3), the correlation is also Gaussian and with 

2 2 2 

( 1 − β ) σ x + σ n 

c ~ N ( αb, 

) . (5) 

N 

IEEE Globecom 2005 2160 0-7803-9415-1/05/$20.00 © 2005 IEEE 

64 



Thus, the embedded information bit b can be estimated by 

b ˆ = sign( 

c) 

, and the performance, in term of Bit Error Rate 

(BER), is 

p 

where Q( 

x) 

= 

⎛ 

2 

µ ⎞ 

c ⎜ Nα 

Q( 

) = Q 

⎟ , (6) 

⎜ 

2 2 

σ 

⎟ 

c ⎝ 

( 1 − β ) σ x + σ n ⎠ 

= 2 

1 

2π 

∫ ∞ 

2 

u 

+ − 

2 e 

x 

du is the tail error function. 

As shown in (6), in the conventional scheme ( β = 0 ), both 

the host and the external noise degrade the extraction, which 

results in the poor performance of this scheme. In the modified 

scheme ( β = 1), 

the host impact is totally suppressed. However, 

the total embedding power, determining the perceptual 

distortion, grows from 2 

2 

2 σ 

α to 

x α + , which can be 

N 

deduced directly from (1). In high capacity embedding, where a 

2 

σ 

small N is preferred, even the minimal embedding power x , 

N 

obtained by setting α = 0 , may not be small enough to 

guarantee imperceptibility. 

In this paper, we propose to use a less-than-unity β to 

2 2 

2 β σ 

reduce the total embedding power to 

x 

P = α + . The 

N 

optimal β should minimize p, the extraction BER, while 

satisfying the power constraints to assure imperceptibility. 

We start with 

⎧0 

≤ β ≤ 1; 

⎪ 

⎨ ⎛ 

2 ⎞ ⎛ 

⎜ N ⋅α 

⎟ ⎜ 

⎪ p = Q 

= Q 

⎜ 

2 2 2 ⎟ ⎜ 

⎩ ⎝ 

( 1 − β ) σ x + σ n ⎠ ⎝ 

2 2 ⎞ 

, (7) 

N ⋅ P − β σ x ⎟ 

2 2 2 

( 1 − β ) σ + ⎟ 

x σ n ⎠ 

and, with the “embedded data to signal” and “signal to noise” 

2 

ratios defined as P 

DSR = and σ x SNR = , respectively, the 

2 

σ 

σ 

2 

x 

BER in (7) can be expressed by 

⎛ 

⎜ 

p = Q 

⎜ 

⎝ 

n 

2 ⎞ 

N ⋅ DSR − β ⎟ . (8) 

2 

⎟ 

( 1 − β ) + 1 

SNR ⎠ 

Next, we want to find * 

β , the β that minimizes (8) - or 

maximizes what is in the square root sign in (8). Since the noise 

n is not known at the time of embedding and normally 

2 2 

2 

σ 1 , one can choose β = 1 . When 

N ⋅ DSR < 1, 

* 

β can be found by letting 

2 

∂ ⎡ N ⋅ DSR − β ⎤ 2 ( N ⋅ DSR − β ) 

= 

= 0 (10) 

⎢ 

2 ⎥ 

3 

∂ β ⎣ ( 1 − β ) ⎦ ( 1 − β ) 

* 

therefore, β = N ⋅ DSR . To summarize, we have 

* 

β = min( N ⋅ DSR, 

1) 

(11) 

* 

The corresponding α is then 

* 2 2 

* ( β ) σ x 

α = P − . (12) 

N 

When N ⋅ DSR < 1, 

the best achievable BER with no noise 

considered is, according to (9), 

⎛ * ⎞ 

* ⎜ β 

p = Q ⎟ . (13) 

⎜ 

* ⎟ 

⎝ 

1 − β ⎠ 

Equation (13) can be used to estimate the maximal 

embedding rate of the proposed scheme. For example, given a 

* 

required BER p ≤ 3% 

, β = N ⋅ DSR must be at least 0.8 as 

per Fig. 1 (approximate to this “no noise” case and to be 

discussed later). Thus, the maximum rate is limited by 

f s f s ≤ DSR (bps), with f being the sampling frequency. 

s 

N 0. 

8 

For example, the rate limit is 100 bps when DSR=-20 dB. It will 

be decreased by the inherent noise from µ-law companding and 

external noise in the telephony channel. In the sequel, the term 

N ⋅ DSR is called the composite embedding power for 

simplicity. 

Figure 1. BER of spread spectrum embedding (SNR=30 dB) 

To show the improvement due to the optimization of β , (8) 

with SNR=30 dB and different composite embedding powers 

are shown in Fig.1. It can be seen that when the power is of the 

intermediate values, the performance can be improved 

significantly, e.g. from p = 18% 

of the conventional spread 

spectrum scheme to p = 3% 

when N ⋅ DSR = 0. 

8 . In the case of 

watermarking, where a large N can be used, the composite 

embedding power is normally large enough such that the 

optimization is not necessary. However in high capacity 

embedding, the composite power is often smaller because N 


65 



could not be very large; therefore, such optimization is 

necessary to achieve high capacity. 

By observing the optimizations in Fig. 1, we can see that, 

* 

although with noise ignored, β given by (11) is still a simple 

and reasonable approximation for the case of SNR=30 dB. 

III. MDCT DOMAIN IMPLEMENTATION 

Discussed in [5], frequency masking of human auditory 

system refers to the masking phenomenon between two 

simultaneously occurring components that are close enough in 

frequency; the stronger component may make the weaker one 

imperceptible. A masking model uses this effect to derive a 

masking threshold from the signal power spectrum. The 

amplitude changes made by embedding are perceptually 

irrelevant as along as they are under the threshold at each 

frequency. Thus, one can use the frequency masking effect to 

imperceptibly maximize the embedding power. 

The frequency masking effect is normally described in 

Fourier frequency domain. The Modified Discrete Cosine 

Transform (MDCT) with 50% overlapping between successive 

frames can perfectly reconstruct the original signal. It was 

shown in [6] that MDCT coefficients can be approximated by 

the corresponding Fourier ones with a modulating term. This 

similarity indicates that a masking model based on MDCT can 

be borrowed from that with the Fourier transform without 

causing too much error. 

In this research, the MDCT domain is chosen for embedding 

and the frequency masking effect is used. Being a scaled-down 

version of Model 1 of Layer 3 in MPEG-1 [5], our model 

consists of merely 18 non-uniform critical bands – to 

accommodate the 0~4 kHz range only. 

The block diagram of embedding/extraction is shown in Fig. 

4. Each 128-sample frame of the µ-law signal is first expanded 

to 16-bit linear PCM and then transformed into MDCT 

coefficients. 

The global masking threshold is computed from the MDCT 

power spectrum using the masking model discussed above. One 

further modification to that model is to relax the threshold in [5] 

by flattening the slopes of each component’s spreading function 

on both sides; therefore, the global threshold is raised and the 

embedding capacity is increased. As a result, we come up with 

the following two settings with different aggressiveness, 

• Perceptible but not annoying embedding artifacts with 

SDR≈17.0 dB, and; 

• Imperceptible embedding artifacts with SDR≈22.5 dB. 

Each to-be-embedded bipolar bit is spread by a 

pseudo-random spreading sequence of length N, which is 

determined by the required embedding rate, e.g., with a higher 

rate required, we need to embed more bits into a frame; therefore, 

a smaller N is adopted so that more N-sample spread sequences 

can be fitted into the frame. The resultant spread sequences are 

embedded into MDCT coefficients between 0.3~3.3 kHz 

according to (1). For each of the 18 critical bands, * 

β and 

* 

α are computed by using (11) and (12), respectively, where P 

is the masking threshold in that critical band. The inverse 

MDCT and the µ-law compression result in a µ-law signal that is 

embedded with the additional information then impaired by the 

µ-law compression. The extraction is simply based on the 

polarity of (4), as discussed earlier. 

IV. SIMULATIONS 

The measured relationship between the BER and the 

embedding rate can characterize the embedding capability and 

the robustness of the scheme. All information sequences are at 

least 200-bits long and the results are averaged over 10 runs, so 

the BERs are actually computed from at least 2000 bits to assure 

a high accuracy. The telephony channel is simulated by AWGN 

with SNR=35 dB and 0.3~3.3 kHz band-pass filtering. 

Simulation results are shown in Fig. 2 and Fig. 3, for 

SDR≈17.0 dB and SDR≈22.5 dB, respectively. It can be seen 

that the optimization of β does increase the performances of 

both the conventional and modified schemes. With slightly 

perceptible embedding artifacts, i.e., the case in Fig. 3, the 

proposal, with an optimal β , can achieve 100 bps with a BER 

less than 7%. 

Figure 2. Rate-Distortion at SDR=17.0 dB 

Figure 3. Rate-Distortion at SDR=22.5 dB 


66 



V. CONCLUSIONS 

In this research, we explored the possibility of using spread 

spectrum techniques for high capacity data embedding in µ-law 

speech signals. Our proposal can achieve about 7% and 10% 

BERs at 100 and 1000 bps, respectively. 

We like to make two observations here. First, the 

rate-distortion curves are quite flat especially for the low 

embedding power case as shown in Fig. 3, e.g. a BER decrease 

of only less than 5% with a large rate decrease - from 1000 bps 

to 100 bps, shown in both Fig. 2 and Fig. 3. In other words, 

increasing the spreading length N does not improve the BER 

significantly. Second, it is understood that the large quantization 

noise caused by the µ-law compression plays a major role in 

limiting the performance. Thus, it can be a future research topic 

to quantitatively study this signal-dependent noise and to find 

ways to compensate for its adverse impact in data embedding. 

ACKNOWLEDGMENT 

L. Zhang thanks for the generous support from the Institute 

for Microstructural Sciences, National Research Council of 

µ-law 

speech 

µ to linear 

expansion 

Spreading sequence 

Estimated 

information 

Extraction 

Linear 

speech 

MDCT 

Decomposition 

Masking 

Analysis 

MDCT 

Decomposition 

Figure 4. Block diagram of speech embedding 

Canada while doing this research as a visiting researcher at the 

Acoustics & Signal Processing Group. He would also thank the 

Electrical and Computer Engineering Department of Ryerson 

University, for the continuous support in his master program. 

REFERENCES 

[1] H. Ding, “Backward compatible wideband voice over narrowband 

low-resolution media,” Acoustics Research Letters Online 

(http://scitation.aip.org/ARLO), vol. 6, issue 1, pp. 41 – 47, January 2005. 

[2] D. Kirovski and H. S. Malvar, “Spread spectrum watermarking of audio 

signals,” IEEE Transactions on Signal Processing, vol. 51, no. 4, pp. 

1020-1033, April 2003. 

[3] J. Eggers, R. Buml, R. Tzschoppe and B. Girod, “Scalar Costa scheme for 

information embedding,” IEEE Transactions on Signal Processing, vol. 

51, no. 4, pp. 1003-1019, April 2003. 

[4] L. Zhang, “Perceptual data embedding in audio and speech signals,” 

Master Thesis, Ryerson University, Toronto, September 2004. 

[5] T. Painter and A. Spanias, “Perceptual coding of digital audio,” IEEE 

Proceedings, vol. 88, no. 4, pp. 451-515, April 2000. 

[6] H. V. Azghandi and P. Kabal, “Improving perceptual coding of 

narrowband audio signals at low rates,” IEEE International Conference on 

Acoustics, Speech, and Signal Processing, vol. 2, pp. 913-916, March 

1999. 

Masking 

Threshold 

Extra information 

Embedding 

Channel noise 

µ to linear 

expansion 

MDCT 

Reconstruction 

Linear to µ 

Compression 


67 


Proceedings of the 2005 IEEE 

Engineering in Medicine and Biology 27th Annual Conference 

Shanghai, China, September 1-4, 2005 

COMPARISON OF JPEG 2000 AND OTHER LOSSLESS COMPRESSION SCHEMES FOR 

DIGITAL MAMMOGRAMS 

April Khademi and Sridhar Krishnan 


Ryerson University, Toronto, ON M5B 2K3 Canada 

E-mail: akhademi@ieee.org, krishnan@ee.ryerson.ca 

Abstract 

In this study, we propose JPEG 2000 as an algorithm for 

the compression of digital mammograms and the proposed 

work is the first real-time implementation of JPEG 2000 on 

a mammogram image database. Only the lossless compression 

mode of JPEG 2000 was examined to ensure that the 

mammogram is delivered without distortion. The performance 

of JPEG 2000 was compared against several other 

lossless coders: JPEG-LS, lossless-JPEG, adaptive Huffman, 

arithmetic with a zero order and a first order probability 

model and Lempel-Ziv Welch (LZW) with 12 and 15 

bit dictionaries. Each compressor was supplied the identical 

set of 50 mammograms, each having a resolution of 

8bits/pixel and dimensions of 1024×1024. Experimental 

results indicate JPEG 2000 and JPEG-LS provide comparable 

compression performance since their compression ratios 

differed by 0.72% and both compressors also superseded the 

results of the other coders. Although JPEG 2000 suffered 

from a slightly longer encoding and decoding delay than 

JPEG-LS (0.8s on average), it is still preferred for mammogram 

images due to the wide variety of features that aid 

in reliable image transmission, provide an efficient mechanism 

for remote access to digital libraries and contribute to 

fast database access. 

Keywords: JPEG 2000, mammogram image compression, 

lossless compression, medical images 


A particular technology which has proved to be a vital diagnostic 

tool for doctors and other healthcare workers is 

mammography, which provides x-ray images of the breast. 

Mammogram images allow the trained interpreter to detect 

any abnormal growths or changes within the breast tissue, 

which could be an indication of breast cancer [1]. Since 

early detection of breast cancer is the leading way to reduce 

mortality rates, it is imperative that the diagnosing professional 

has efficient means of accessing and viewing a patient’s 

mammogram [2]. 

0-7803-8740-6/05/$20.00 ©2005 IEEE. 

3771 

68 

By digitizing mammograms and applying a series of signal 

processing techniques to them, it is possible to utilize 

technological devices and methods to make the necessary 

diagnostic tools more readily available to healthcare workers, 

potentially speeding up the diagnosis. 

Since digital mammograms are used for diagnosis, high 

resolution images are required to ensure that even the smallest 

irregularities are represented. As a consequence, mammogram 

sizes are large and are requiring significant amounts 

of bandwidth for transmission and a lot of memory for storage. 

To accommodate for this large file size, it is imperative 

to identify and make use of an optimal source encoding 

scheme dedicated to medical images. 

Primarily, this paper investigates JPEG 2000, the latest 

data compression technology, and applies it to mammogram 

images to provide lossless compression in a novel way. 

2. JPEG 2000 

This paper investigates the compression performance of 

JPEG 2000 on mammographic images and its rich feature 

set for a medical imaging application. 

Only JPEG 2000’s lossless compression mode was used 

since the application of interest is pertinent to mammograms 

that are to be used for diagnosis. For lossless compression 

of grayscale mammograms, JPEG 2000’s encoder and decoder 

are shown in Fig.1. 

A) Tiling: Tiling is performed to significantly reduce the 

computational overhead and memory requirements of some 

of the more demanding components within the JPEG 2000 

codec, since future processing is performed on the smaller 

tile components. This allows maximum interchange between 

devices with limited memory resources, like a Personal 

Digital Assistant (PDA), giving healthcare workers 

more versatility to manage, transmit and receive mammograms 

with little effort. Furthermore, each tile component 

can be extracted and reconstructed independently, permitting 

random access to portions of the bitstream. This is 

useful to doctors if a specific region within a mammogram 


GAUSSIAN MIXTURE MODELING USING SHORT TIME FOURIER TRANSFORM 

FEATURES FOR AUDIO FINGERPRINTING 

ABSTRACT 

In audio fingerprinting, an audio clip must be recognized by 

matching an extracted fingerprint to a database of previously 

computed fingerprints. The fingerprints should reduce the 

dimensionality of the input significantly, provide discrimination 

among different audio clips, and at the same time, 

invariant to the distorted versions of the same audio clip. In 

this paper, we design fingerprints addressing the above issues 

by modeling an audio clip by Gaussian mixture models 

(GMM) using a wide range of easy-to-compute short time 

Fourier transform features such as Shannon entropy, Renyi 

entropy, spectral centroid, spectral bandwidth, spectral flatness 

measure, spectral crest factor, and Mel-frequency cepstral 

coefficients. We test the robustness of the fingerprints 

under a large number of distortions. To make the system robust, 

we use some of the distorted versions of the audio for 

training. However, we show that the audio fingerprints modeled 

using GMM are not only robust to the distortions used 

in training but also to distortions not used in training. Using 

spectral centroid as feature, we obtain the highest identification 

rate of 99.1 % with a false positive rate of 10 −4 . 


An audio fingerprint is a compact representation of perceptually 

relevant portion of the audio content. An audio fingerprint 

should be able to identify audio files even if they 

are severely distorted by perceptual coding or common signal 

processing operations. The type of distortions a fingerprint 

should withstand depends on the application. For example, 

audio fingerprints designed for broadcast monitoring 

should withstand distortions such as time compression, 

dynamic range compression, and equalization. An Audio 

fingerprinting system has two principle components: fingerprint 

extraction and matching algorithm. The fingerprint 

requirements include computational simplicity, robustness 

to distortions, smaller size, and discrimination power over a 

Arunan Ramalingam and Sridhar Krishnan 


Ryerson University, Toronto, ON, Canada, M5B 2K3 

E-mail: (aramalin)(krishnan)@ee.ryerson.ca 

We would like to acknowledge Micronet for their financial support. 

0-7803-9332-5/05/$20.00 ©2005 IEEE 

large number of other fingerprints [1]. The matching algorithms 

should be efficient to able to identify an audio item 

from a database of hundreds of thousands of audio songs in 

a few seconds. A large number of fingerprinting schemes 

have been proposed. For some recent work, please see [2] – 

[5]. 

The overview of the proposed fingerprinting scheme is 

shown in Fig. 1. First the incoming audio clip is preprocessed 

and features are extracted from them. Then using 

these features, the audio clip is modeled using Gaussian 

mixtures. In the training phase, the mixture models of all the 

audio clips are stored in the database along with the metadata 

information. In the identification phase, the features 

from an unknown audio clip are used to evaluate the likelihood 

of all the models in the database. Then the model 

that is most likely to generate the features is identified as 

the correct audio clip. 

Audio 

Input 

Training 

Preprocessing Framing 

Identification 

Feature 

extraction 

GMM 

modeling 

Likelihood 

estimation 

Identification 

result 

Fig. 1. Proposed Fingerprinting System 

2. FEATURE EXTRACTION 

Fingerprint 

database 

In this work, we use the following features extracted from 

the short time Fourier transform (STFT) of the signal for 

fingerprint extraction. Let Fi = fi(u),u ∈ (0,M) be 

the Fourier transform of the i th frame, where M is the index 

of the highest frequency band. To increase the robustness 

of the fingerprint, the features are not extracted on 


72

the whole spectrum but on non-overlapping logarithmically 

spaced bands. Let Fi,b = fi(ub),ub ∈ (lb,ub) where lb and 

ub are the lower and upper edges of the band b. In each of 

the frame, the following features are extracted. These features 

have been used successfully in audio fingerprinting [6] 

and music classification [7]. 

1. Spectral Centroid (SC): The spectral centroid is the 

center of gravity of the magnitude spectrum of the 

STFT and is a measure of spectral shape and “brightness” 

of the spectrum. Spectral centroid is defined as 

ub 2 

u. |fi(u)| u=lb 

SCi,b = ub 2 . (1) 

|fi(u)| u=lb 

2. Spectral Bandwidth (SB): The spectral bandwidth is 

measured as the weighted average of the distances between 

the spectral components and the spectral centroid. 

Spectral bandwidth is defined as 

SBi,b = 

ub u=lb (u − SCi,b) 2 . |fi(u)| 2 

ub u=lb |fi(u)| 2 . (2) 

3. Spectral Band Energy (SBE): The spectral band energy 

is the energy in the frequency bands normalized 

by the energy in the whole spectrum. Spectral band 

energy is calculated as 

ub SBEi,b = 

2 

|fi(u)| u=lb 

M u=0 

|fi(u)| 2 . (3) 

4. Spectral Flatness Measure (SFM): The spectral flatness 

measure quantifies the flatness of the spectrum 

and distinguishes between noise-like and tone-like signal. 

Spectral flatness measure is defined as 

SFMi,b = 

ub 

u=lb 

1 

ub−lb+1 

|fi(u)| 2 

1 

u b −l b +1 

ub 2 . (4) 

|fi(u)| u=lb 

5. Spectral Crest Factor (SCF): The spectral crest factor 

is also a measure of the tonality of the signal. Spectral 

crest factor is defined as 

 

max |fi(u)| 

SCFi,b = 

2 

1 ub 2 . (5) 

ub−lb+1 |fi(u)| u=lb 

6. Shannon Entropy (SE): The Shannon entropy of a signal 

is a measure of its spectral distribution of the signal. 

Shannon entropy is defined as 

SEi,b = 

ub 

u=lb 

|fi(u)| log 2 |fi(u)| . (6) 

7. Renyi Entropy (RE): The Renyi entropy of a signal is 

also a measure of its spectral distribution. Renyi entropy 

is defined as 

REi,b = 1 

1 − r log 

 

ub 

|fi(u)| r 

 

. (7) 

u=lb 

We used Renyi Entropy of order r =2. 

8. Mel-frequency Cepstral Coefficients (MFCC): MFCC 

are perceptually motivated features based on the STFT. 

After taking the log-amplitude of the magnitude spectrum, 

the FFT bins are grouped and smoothed according 

to the perceptually motivated Mel-frequency scaling. 

Finally, in order to decorrelate the resulting feature 

vectors a discrete cosine transform is performed. 

In this work, we used 13 coefficients since this parameterization 

has been shown to be quite effective for 

speech recognition and speaker identification [8]. 

Let Xi be the set of features extracted for the frame i. Xi 

can be any one of the features described above. In order to 

better characterize the temporal variations of the signal, the 

first derivatives of the above features 

δi = δi − δi−1 

are also used included in the feature matrix. In an audio clip, 

successive frames are related in time. To include this time 

dependency, a time vector is added to the feature matrix. 

This time vector is taken as an incremental counter from 0 

to 1. Thus the feature matrix of the entire audio clip can be 

described as 

F ′ M = 

⎡ 

⎤ 

X1,δ1,t1 

⎢ X2,δ2,t2 ⎥ 

⎢ 

⎥ 

⎢ 

⎣ . 

⎥ 

(9) 

. 

⎦ 

XN,δN ,tN 

where N is the number of frames in the audio clip. Finally 

the feature matrix is mean subtracted and component wise 

variance normalized to get a normalized feature matrix FM. 


73 

(8)

3. GAUSSIAN MIXTURE MODELS 

Gaussian Mixture Models (GMM) have been successfully 

used in audio classification [7] and content based retrieval 

[9]. In this work, the technique is used to model an audio 

fingerprint as a probability density function (PDF), using a 

weighted combination of Gaussian component PDFs (mixtures). 

During the training phase, the GMM parameters of 

an audio fingerprint are estimated to maximize the probability 

of the audio frames present in the audio fingerprint. 

We use the Baum-Welch (Expectation-Maximization) algorithm 

to estimate the GMM parameters with initialization by 

k − means clustering. As the feature vectors in this work 

have reasonably uncorrelated components, computationally 

convenient diagonal covariance matrices can be used. We 

used GMM with 16 mixtures. Thus in the fingerprint extraction 

phase, each audio clip is modeled by GMM. During the 

matching phase the fingerprint from an unknown recording 

is compared with the database of pre-computed GMM and 

the GMM that gives the highest likelihood for the fingerprint 

is identified as correct match. 

4. RESULTS 

We used a database containing 250 five-second audio clips 

chosen from the categories of rock, pop, country, classical, 

and jazz. The audio clips are chosen from random portions 

of songs from Compact Discs. 

4.1. Robustness to Distortions 

We used several distorted versions of the audio clips to test 

the robustness of the proposed scheme. We used the following 

distorted versions in our tests. 

I. Compression – 1) MP3 at 32 kbps, 2) AAC at 32 

kbps, 3) WMA at 32 kbps, 4) Real encoding at 32 

kbps. 

II. Amplitude distortion – 1) 3 : 1 Compression above 

30 dB, 2) 3 : 1 Expander below 10 dB, 3) 3 : 1 compression 

below 10 dB, 4) Limiter at 9 dB, 5) ‘Superloud’ 

amplitude distortion, 6) Noise gate at 20 dB, 7) 

De-esser, 8) Nonlinear amplitude distortion. 

III. Frequency distortion – 1) Nonlinear bass distortion, 

2) Midrange frequency boost, 3) Notch Filter, 750 - 

1800 Hz, 4) Notch Filter 430 - 3400 Hz, 5) Telephone 

bandpass, 135 - 3700 Hz, 6) Bass cut, 7) Bass boost. 

IV. Change in pitch – 1) Lower pitch 2 - 6 %, 2) Raise 

pitch 2 - 6 %. 

V. Change in speed – 1) Linear speed increase 2 - 6%, 

2) Linear speed decrease 2 - 6%. 

VI. Resampling at 8 kHz 

VII. Echo addition 

To increase the robustness of the fingerprints, in addition 

to the original audio, some distorted versions of the 

audio are also used in training. We used the following distorted 

versions in our training: 1) Undistorted audio, 2) 3 

: 1 Compression above 30 dB, 3) Nonlinear amplitude distortion, 

4) Nonlinear bass distortion, 5) Midrange frequency 

boost, 6) Notch Filter, 750 - 1800 Hz, 7) Notch Filter 430 

- 3400 Hz, 8) Raise Pitch 1%, 9) Lower Pitch 1%. The 

log-likelihood of the test clips are evaluated for all the models 

in the database. Then the model that gives the highest 

log-likelihood is taken as the correct match. Table 1 shows 

the percentage of clips that are correctly identified for different 

features for distortions used in training as well as for 

distortions not used in training. The results show that it is 

not necessary to train the model for all possible distortions. 

By training the model to some representative distortions, we 

can obtain robustness to a wide variety of distortions. 

Table 1. Mean Recognition rate for distortions 

Train Test Mean 

MFCC 99.0 98.5 98.7 

Spectral centroid 99.4 99.1 99.2 

Spectral bandwidth 99.4 98.9 99.1 

Spectral band energy 98.8 98.8 98.8 

Spectral flatness measure 99.4 98.6 98.9 

Spectral crest factor 99.2 98.6 98.8 

Shannon Entropy 99.4 98.8 99.0 

Renyi Entropy 99.4 98.9 99.0 

4.2. False Positive Analysis 

In the previous section it was assumed that the test clip is 

present in the database. Hence the model that gives the 

highest log-likelihood value is identified as the correct match. 

However it is possible that the test clip may not be in the 

database. So there should be a criteria to reject the audio 

clips that are not in the database. A suitable threshold 

for log-likelihood can be used to vary the false positive and 

false negative rates. The false positive and the corresponding 

identification rate are shown in Figs. 2 and 3. The percentage 

of audio clips correctly identified at different false 

positive rates are shown in Table 2. Among the different 

features used, spectral centroid gives the highest identification 

rate of 99.1% with a false positive rate of 10 −4 .MFCC 

performs poorly with an identification rate of 13 %. All the 

features except the spectral flatness measure give an identification 

rate of more than 90 % with a false positive rate of 

10 −3 . 


74

Identification Rate 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

10 −7 

0 

10 −6 

10 −5 

10 −4 

10 −3 

False Positive Rate 

MFCC 

Spectral centroid 

Spectral bandwidth 

Spectral band energy 

Fig. 2. Identification rates at different false positive rates for 

MFCC, Spectral centroid, Spectral bandwidth, and Spectral 

band energy 

Table 2. Identification Rate at different false positive rates 

10 −4 10 −3 10 −2 

MFCC 13.5 98.4 99.3 

Spectral centroid 99.1 99.5 99.8 

Spectral bandwidth 93.2 98.4 99.3 

Spectral band energy 69.2 94.3 99.2 

Spectral flatness measure 31.8 56.4 96.6 

Spectral crest factor 93.0 98.4 99.3 

Shannon Entropy 71.1 93.9 99.4 

Renyi Entropy 64.0 99.3 99.7 

5. CONCLUSION 

Gaussian Mixture Models have been successfully used in 

many classification and identification problems in audio. In 

this work, we modeled audio recordings for audio fingerprinting 

by Gaussian mixtures using features extracted from 

the STFT of the signal. Even though we use some distorted 

samples of the audio during training, the system is robust to 

distortions not used in training. Using spectral centroid as 

feature, we obtain the highest identification rate of 99.1 % 

with a false positive rate of 10 −4 . 

6. REFERENCES 

[1] P. Cano, E. Batle, T. Kalker, and J. Haitsma, “A review 

of algorithms for audio fingerprinting,” in IEEE 

Workshop on Multimedia Signal Processing, 2002,December 

2002, pp. 169–173. 

[2] J. Herre, O. Hellmuth, and M. Cremer, “Scalable robust 

audio fingerprinting using MPEG-7 content de- 

10 −2 

10 −1 

10 0 

Identification Rate 

1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

10 −7 

0 

10 −6 

10 −5 

10 −4 

10 −3 

False Positive Rate 

Sprectral flatness measure 

Spectral crest factor 

Entropy 

Renyi’s Entropy 

Fig. 3. Identification rates at different false positive rates 

for Spectral flatness measure, Spectral crest factor, Shannon 

Entropy and Renyi Entropy 

scription,” in IEEE Workshop on Multimedia Signal 

Processing, 2002, December 2002, pp. 165–168. 

[3] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting 

system,” in Proc. of the 3rd Int. Symposium on 

Music Information Retrieval,, October 2002, pp. 144– 

148. 

[4] V. Venkatachalam, L. Cazzanti, N. Dhillon, and 

M. Wells, “Automatic identification of sound recordings,” 

IEEE Signal Processing Magazine, vol. 21, no. 

2, pp. 92 – 99, March 2004. 

[5] C.J.C. Burges, J.C. Platt, and S. Jana, “Distortion 

discriminant analysis for audio fingerprinting,” IEEE 

Transactions on Speech and Audio Processing, vol. 11, 

no. 3, pp. 165–174, May 2003. 

[6] E. Allamanche, B. Frba, J. Herre, T. Kastner, 

O.Hellmuth, and M. Cremer, “Cotent-based identification 

of audio material using MPEG-7 low level description,” 

in Proceeding of the International Symposium 

on Music Information Retrieval (ISMIR), Indiana, 

USA, October 2002. 

[7] G. Tzanetakis and P. Cook, “Musical genre classification 

of audio signals,” IEEE Tran. on Speech and Audio 

Processing, vol. 10, no. 5, pp. 293 – 302, July 2002. 

[8] L. R. Rabiner and B. H. Juang, Fundamentals of Speech 

Recognition, Prentice-Hall, Englewood Cliffs, NJ,, 

1993. 

[9] D. Pye, “Content-based methods for the management of 

digital music,” in Proceedings of ICASSP, 2000, vol. 4, 

pp. 24–27. 


75 

10 −2 

10 −1 

10 0

MULTIPATH MITIGATION OF GNSS CARRIER PHASE SIGNALS 

FOR AN ON-BOARD UNIT FOR MOBILITY PRICING 

Ronesh Puri, Ahmed El Kaffas, Alagan Anpalagan, Sridhar Krishnan 

Department of Electrical and Computer Engineering, Ryerson University, Toronto, ON, M5B 2K3 

rpuri | aelkaffa | alagan | krishnan @ee.ryerson.ca 

Bern Grush 

Applied Location Corporation, 34 Dodge Rd, Toronto, ON, M1N 2A7. bgrush@appliedlocation.com 

Abstract 

Inexpensive navigation-grade receivers are insufficiently 

accurate for the task of building a Global Navigation Satellite 

System [GNSS]-based parking meter for urban multipath 

conditions. Survey-grade instruments that demonstrate cm 

accuracy are inappropriate, and are two orders of magnitude 

too expensive, for this mass application. We identify three ways 

in which a digital signal processor added to a stationary, 

navigation-grade receiver can add considerable accuracy (in 

the range of 1-2 m, down from 5-10 m) for a parking meter. 

First, we apply a pseudo-multipath-based filter and a modified 

Receiver Autonomous Integrity Monitoring [RAIM]-derivative 

filter to the received carrier phase signals, allowing us to infer 

which signals are most affected by noise processes and to 

compute receiver position with the remaining signals for 

greater accuracy. Second, we take advantage of receiver 

stationarity to dwell on these signals for several minutes, 

allowing us to acquire a signal characterization metric that is 

more stable than might be possible with a non-stationary 

receiver. This is intended for non-repudiation. As a third step 

we will later experiment with ways to monitor the multipath 

behaviour of individual signals on approach to a parking event 

in a way that may allow us to more effectively weigh our initial 

signal selection criteria. Independent of these three 

opportunities, we also take advantage of dual GPS/Galileo 

receivers, a capability that we simulate in this experiment. 

Testing of the multipath mitigation filters described in this 

paper on two simulated GPS/Galileo datasets yielded 

reductions in the standard deviation of the position estimate 

that ranged from -4% to 61.6% (avg:34.4%) when compared 

to the control (unfiltered) position calculation. 

Keywords: GPS; Galileo; GNSS; Multipath Mitigation; RAIM; 

Urban Canyon; Parking; Parklog; Road-Pricing. 


A number of countries seek solutions for reliable and cost 

effective metering for zone-based road pricing. For economic 

and other reasons, GNSS signals are the prime target for this 

solution [1-4]. An alternative to the commonly expected 

“tracklog” is the use of a “parklog” a log of parking events 

with a minimal amount of data describing the intervening trip 

0-7803-8886-0/05/$20.00 ©2005 IEEE 

CCECE/CCGEI, Saskatoon, May 2005 

1578 

segments. The parklog is less data intensive, more accurate 

(i.e. more non-repudiatiable), and is a good proxy to a full 

tracklog in zone-pricing applications. In addition, whenever the 

accuracy of the endpoints of the trips the parking events is 

sufficient, this same meter could be used as a parking meter for 

that parking event, yielding a way to meter for any 

combination of road use, parking use and pay-as-you-drive 

insurance in a single system. The principle advantage of a 

three-in-one meter is the distribution of infrastructure costs 

over three sectors (road, parking and insurance) making it 

possible for a road-pricing meter to “pay for itself” in parking 

and insurance discounts from the motorist’s perspective [5]. 

To enable a highly effective device, we have set a design 

goal of 1.5m-2m accuracy, 99% of the time in 75% of the 

parking lots in a city with the building density of Toronto, 

Canada. Even when a parking location cannot be known 

accurately enough to assess a fee, both road-pricing and 

insurance-pricing can still proceed. This gives the “parklog” 

the flavor of disruptive technology disrupting both dedicated 

short range communication [DSRC] and the tracklog for roadpricing. 

2.1 Multipath Mitigation 

2. METHODS 

Among the noise sources contributing to positioning error, 

multipath is the most difficult to characterize in a way that 

allows unambiguous mitigation. When other error sources are 

controlled, multipath can become the largest remaining 

contributor to unmodeled noise/interference. The causes and 

properties of this process are well described elsewhere [6-8]. 

Of the four classes of mitigation techniques: antenna 

positioning, hardware compensation (antenna design), software 

mitigation and static antenna arrays with signal correlators [6], 

software mitigation is the only feasible approach for an onboard 

parking application. Antenna siting will seldom be 

optimal. Increased hardware size, complexity and expense are 

aesthetically, operationally and economically unacceptable, 

since the antenna for the target meter must be mounted in or on 

many millions of private vehicles. 

2.2 Simulating Galileo 

Collecting GNSS signals in densely built-up areas (“urban 

canyon”) often results in a diminished number of usable 


76

signals. On some occasions, when using only a single system 

such as GPS, there may be too few to calculate a horizontal 

position (a minimum of four signals is best). This accounts for 

the frequent loss of position lock requiring ancillary aids such 

as inertial navigation. In a parking application, we must rely on 

GNSS signals only, so that our process would frequently fail to 

fulfill our stated design goal without a redundant system, 

which in our case is the European Union’s Galileo, expected to 

be operational in 2008. 

Dual GPS/Galileo receivers are expected to improve 

position availability and accuracy considerably. As recently 

reported by Feng [9], dual receivers “will increase service 

coverage from 55% to 95% notably in the urban areas where 

most mass-market applications are developed.” The following 

table details the expected improvement. 

Analysis 

Scenario 

& Constellation 

Availability of 

20m 95% 2D 

accuracy 

28 GPS 

only 

(%) 

28 GPS 

+27Gal 

(%) 

Accuracy and 

availability – 

satellites only 

28 GPS 

only 

(m/%) 

28 GPS 

+27Gal 

(m/%) 

Accuracy 

availability 

differential 

28 GPS 

+27Gal 

(m/%) 

Open Sky 90% 100% 7/95 4/95 1.5/95 

Suburban 70% 100% 32/90 8/95 4/95 

Low-rise 30% 90% 17/50 14/95 7/95 

High-rise 15% 80% - 42/90 25/90 

Table 1: Performance improvements resulting from both GPS 

and Galileo constellations for urban operations (Table adopted 

from [9]) 

In our work we simulate a dual GPS/Galileo receiver by 

combining two sets of GPS data collected with a 

uBlox/ANTARIS TIM-LP receiver separated by three or more 

hours (i.e. not within three hours of an integer multiple of a 

sidereal day) so that the two visible satellite sub-constellations 

are essentially independent. A data sample from an actual dual 

receiver would exhibit at least as good a geometric distribution 

as would this “poor-man’s” simulation, hence we argue that 

this simulation technique does not unduly favor our approach. 

SW corner 10:25 and 13:58 (3.5 hr separation) 

Figure 1: An example of two GPS constellations viewed by a 

stationary receiver and separated by 3.5 hrs. See also Figure 3. 

2.3 Software Mitigation 

The key assumption in software mitigation is that it is 

possible to infer, in near realtime, which of the pseudo-range 

signals available at a given moment are contributing more error 

1579 

than others. Extensive work in this area, Receiver Autonomous 

Integrity Monitoring (RAIM), focuses on real-time 

determination of failure of one of several SVs (space vehicles) 

in safety of life applications [10], This work has been extended 

to include multiple failures and has led to the development of 

related techniques to determine which signals may be more 

subject to multipath disturbance in a dynamic, unaided manner. 

Our work relies on some of these extensions. 

Bisnath and Langley [6] extend earlier methods to compute 

an inferred GPS observable they call pseudo-multipath, 

incorporating pseudorange multipath, tracking error and 

receiver noise, making it a good indicator of unmodeled error 

and noise for position estimation, the predominant component 

of which is multipath. Related work by Nayak, et al [7] 

develops this same measure, which they call code-carrier 

residual (r). We use this formulation to weigh each signal in a 

data sample to determine whether to include that signal in the 

position calculation. 

Misra and Bednarz [10] extend RAIM techniques to deal 

with multiple SV failures. Their method, referred to, in this 

paper, as Misra04, provides for randomly selecting numerous 

subsets of 6 or 7 signals from a larger set of available signals, 

such as would be available to a dual GPS/Galileo receiver. 

Pseudo-random selection is constrained to minimize dilution of 

precision (horizontal dilution of precision (HDOP) in our 

case), and repeated selection and position calculations are 

clustered and outliers are observed to de-weight SVs. We use 

this algorithmic approach to exclude noisy signals that were 

not filtered out by the code-carrier residual (pseudo-multipath). 

These two methods in combination select the least noisy 

signals, constrained by a requirement for a constellation subset 

yielding good geometry for subsequent position calculation. 

Merge two, several-minute GPS readings, sufficiently 

separated in time to simulate a dual receiver 

 

Drop signal(s) based on Code-carrier residual filter [6,7] 

 

Drop signal(s) based on Misra04 (RAIM-derivative) filter [10] 

 

Compute LAT, LON using remaining signals 

 

Compute associated characterization 

Figure 2: The filtering and position calculation process is set 

up as illustrated here and detailed in the following section. 

The reader might question the efficacy of this degree of 

filtering given that the receivers are stationary. However, 

consider that in a complex multipath environment in which 

signals are collected for several minutes, the movement of the 

SVs, the movement of tree crowns, and passing vehicles might 

each effect the relative degree of multipath of each SV from 

moment-to-moment as it impinges on the stationary antenna. 

We will be exploring this further in subsequent work. 


77

3. THE PROPOSED ALGORITHM 

Following from the previous section, we detail each stage 

in the process: two filter stages, signal selection and final 

position calculation. 

Input to this process is carrier-phase data, captured every 

second. 

3.1 Pseudo-Multipath based filter 

For the first stage of our dual-filter method, we compute the 

pseudo-multipath observable, r, and its standard deviation, r, 

for each visible SV. 

r = 2dion - N + (p) + (), 

where: 

dion ionospheric delay (m) 

wavelength of L1 carrier (m) 

N the integer cycle ambiguity (cycles) 

(p) code noise (receiver noise + multipath) (m) 

() carrier phase noise (receiver + multipath) (m) 

The SVs are ranked in ascending order of the magnitude of 

r. Since we suspect signals with higher variance, we simply 

discarded the single most suspect signal. in this first 

experiment. 

The full derivation of r is developed in [7], and is described 

therein as containing: 

“twice the atmospheric error, the carrier phase ambiguity, code receiver 

noise, and code multipath. Carrier receiver noise and multipath can be 

neglected since they are very small compared to the code values. The 

ambiguity term is a constant if there are no cycle slips whereas the 

ionospheric error generally varies slowly over time. A piece-wise linear 

regression model can therefore be implemented to remove terms due to the 

ionosphere and ambiguity. Since the ionospheric error changes with time, a 

regression model [could be] implemented over predefined averaging 

intervals. … The resulting code-carrier residual, [r], will contain multipath 

and receiver noise and can be used for further analysis. Subtracting out the 

mean removes not only the integer ambiguity, but also the bias components 

present in all of the remaining terms. Code multipath is a nonzero mean 

process and this technique only isolates relative multipath effects and not 

the absolute multipath because the regression process removes the portion 

of multipath with nonzero mean. 

In our application, we are using this as one of two 

“advisors” to help us select the signals least disturbed by 

multipath. Hence, the fact that this is only a relative indicator 

and that it also incorporates minor components of other error 

sources does not detract from its value as a way to identify the 

noisiest signals. 

3.2 Modified Misra04 (RAIM-derivative) filter 

The steps we used in our adaptation of the Misra04 

algorithm [10], are as follows: 

1. Set K as the number of SVs in view less the one 

rejected by the pseudo-multipath filter; 

2. Divide the sky into six bins as shown in Figure 3; 

3. Characterize each SV (space vehicle) as belonging to 

one of the bins, based on its elevation and azimuth; 

4. Select 4K subsets of SVs from the original set of SVs 

as follows: 

select one SV randomly from each of the six bins. 

1580 

Select two more SVs from the remaining SVs 

If PDOP > 3, select one more from those remaining. 

5. Compute, then cluster 4K positions using these 

selections. 

6. Compute the mean of the cluster of positions; 

7. Compute the distance of each computed position from 

the mean of the cluster 

8. Divide the cluster into 5 concentric rings around the 

cluster mean each ring incremented by d=0.2M where 

M is the distance of the farthest position from the 

cluster mean. Hence the rings are at d, 2d, … 5d from 

the cluster mean. 

9. For each of the 4K positions, assign a value from 1 to 

5 to every contributing SV, based on the concentric 

rings that position falls in. 

10. Sum these assigned values from each of the K SVs 

11. Discard the signal for the highest ranked SV 

Horizon 

Elevation 

40° 

Figure 3: We divided the sky into six bins as described in [10]. 

The two symbols represent SVs from the two constellation 

configurations shown in Figure 1. 

3.3 Static Position calculation 

N 

We are now left with the original, merged dataset (i.e., the 

dataset that simulates a dual GPS/Galileo receiver) less the 

signals from the two SVs that were rejected as the least 

trustworthy signals. We compute receiver position at each 

second from this filtered dataset, then compute mean and 

covariance for these position sets. The mean is our new 

position estimate for the position of the stationary receiver and 

the covariance matrix is an element of characterization data. 


In order to gauge the efficacy of our processing we will 

need to compare position calculations with and without our 

process. Since we are reading carrier phase signals with a 

commercial receiver (TIM-LP) prior the application of 

proprietary filters to which we have no access, we are required 

to perform our own position calculations, both for our 

approach and for the control approach. This means that our 

position estimates may not be as accurate as those produced by 

the commercial receiver. However, it is the relative 

improvement in which we are interested. 


78

For our first test, we recorded two 15-minute data sets from 

a stationary receiver, 3.5 hours apart. The location was an older 

neighborhood, 3 or 4 meters from two 2-storey houses with a 

large-canopied tree about 6 meters away and other houses and 

mature trees somewhat further away. The effect of filtering this 

first GPS dataset can be seen in Figure 4. 

Figure 4: The larger scatter represents unfiltered position 

calculations, while the smaller scatter represents positions after 

filtering. The two ellipses represent the 3 distance (in 

degrees) from the mean of each cluster. 

0.3289x10 -8 0.0838x10 -8 

-0.5489x10 -8 1.3408x10 -8 -0.0516x10 -8 0.1977x10 -8 

Covariance: unfiltered scatter Covariance: filtered scatter 

Table 2: Covariance Matrices from two scatters in Figure 4 

(each element represents 2 ; hence, units are degrees squared). 

The covariance matrices from these two scatters are shown 

in Table 2. By comparing the ratios of standard deviations for 

LAT and LON taken from these matrices we get a sense of the 

percentage level of reduction achieved by these filters. 

To illustrate with the first element the variance in degrees 

LAT (LAT 2 ) we calculated the percentage change in standard 

deviation value as: 

1- (LAT-filtered / LAT-unfiltered) 

Hence the percentage change in standard deviation for LAT 

and LON are: 49.5% and 61.6%, respectively, representing a 

considerable reduction in sigma values. 

A second test, recorded similarly, with the two data subsets 

7 hours apart and several meters away, endured less multipath 

effects and provided good, but less dramatic results, shown in 

Figure 5. 

Figure 5: The second data set 

1581 

0.0330x10 -8 

-0.0051x10 -8 

0.0159x10 -8 

0.0519x10 -8 -0.0031x10 -8 

unfiltered filtered 

Table 3: Covariance Matrices from two scatters in Figure 5. 

0.0561x10 -8 

The percentage change in standard deviation values for 

LAT and LON are: 30.6% and -4%, respectively, representing 

a significant but mixed reduction in sigma values (LON did not 

improve, and may have worsened). 

5. CONCLUSIONS and FUTURE WORK 

We have shown that it is feasible to reduce positioning 

variance due to multipath in the case of a static GNSS receiver. 

In these two experiments the higher multipath data showed the 

greatest improvement of course much more testing is needed. 

By gathering signals for a modest amount of time (we propose 

7 to 10 minutes) and using techniques to isolate signals that 

contribute relatively more noise than others, and by taking 

advantage of the expected dual GPS/Galileo receivers, we are 

optimistic we can specify a processor that would be the 

positioning engine for a reliable in-vehicle meter for roadpricing, 

pay-as-you drive insurance, and most parking-pricing. 

For our first experiment with this approach to reduce 

variation in position error for a stationary GNSS receiver, we 

have successfully adapted and simplified two existing results 

from the literature. Clearly, making decisions to drop the least 

trustworthy signals help, but it is also understood that which 

signals are best at any one moment can change, even for a 

stationary receiver. For this reason, we are currently exploring 

with good success several additional ideas. These include 

time-slicing the signals into numerous smaller windows, 

iterative removal of 0 or more SVs (rather than removal of 

exactly one SV per filter), dynamic thresholds, and others. We 

expect to be able to improve considerably on the current 

results. 

REFERENCES 

[1] “Feasibility Study of Road Pricing in the UK: A Report to the Secretary 

of State, UK,” Department for Transport 2004. 

[2] T. Grayling, J. Foley, and N. Sansom, “In the Fast Lane,” Institute for 

Public Policy Research (IPPR) – UK, June 2004. 

[3] “Fair Payment for Infrastructure Use,” Commission of European 

Communities, 1998 

[4] H. Appelbe, “Taking Charge,” Traffic Technology International, 

October/November 2004, pg 52. 

[5] B. Grush, “The Delicate Art of Tolling (Part 1),” Tolltrans, 2004, pg 52; 

and Part 2, Traffic Technology International, Dec ‘04/Jan ’05. pg 58. 

[6] S. Bisnath and R. Langley, “Pseudorange Multipath By Means of 

Multipath Monitoring and De-Weighting,” KIS 2001, June, 2001. 

[7] R. Nayak, M. Cannon, C. Wilson, and G. Zhang, “Analysis of Multiple 

GPS Antennas for Multipath Mitigation in Vehicular Navigation,” 

Institute of Navigation National Technical Meeting, Jan 2000. 

[8] P. Dana, “Global Positioning System (GPS) Time Dissemination for 

Real-Time Applications”, Real-Time Systems, 12, 9-40. 1997 

[9] Y. Feng, “Combined Galileo and GPS: A Technical Perspective,” 

Journal of Global Positioning Systems Vol. 2, No.1: 67-72, (2003). 

[10] P. Misra and S. Bednarz, “Navigation for Precision Approaches”, 

GPSWorld, April 2004. 


79

A SIGNAL CLASSIFICATION APPROACH USING TIME-WIDTH VS FREQUENCY BAND 

SUB-ENERGY DISTRIBUTIONS 

Karthikeyan Umapathy 

Dept. of Electrical and Computer Engg., 

The University of Western Ontario, 

London, ON N6A 5B8, Canada 

Email: kumapath@uwo.ca 

ABSTRACT 

Time-frequency (TF) signal decompositions provide us with ample 

information and extreme flexibility for signal analysis. By applying 

suitable processing on the TF decomposition parameters, 

even subtle signal characteristics can be revealed. In many real 

world applications, identification of these subtle differences make 

a significant impact in signal analysis. Particularly in classification 

applications using TF approaches, there may be situations where 

a localized high discriminative signal structure is diluted due to 

the presence of other overlapping signal structures. To address 

this problem we propose a novel approach to construct multiple 

time-width vs frequency band mappings based on the energy decomposition 

pattern of the signal. These mapping are then analyzed 

to locate the highly discriminative features for classification. 

Initial results with two real world biomedical signal databases (1) 

Vibroarthrographic (VAG) signals and (2) Pathological speech signals, 

indicate high potential for the proposed technique. 


Time-frequency (TF) transformations have significantly contributed 

towards complex signal analysis and automatic classification. In 

classification applications using TF approach, it is often a small 

area or pockets of areas in the TF plane that actually exhibit the 

difference between the classes of signals. Within these small areas, 

there may be overlapping multiple signal components with varying 

discriminative characteristics. The overall discriminative power of 

the area is normally decided by the high energy signal components 

which dilute the discriminative characteristics of less energy signal 

components. It may so happen that a high discriminative but less 

energy component is masked by a less discriminative but high energy 

component. Typical biomedical signals contain a mixture of 

coherent and non-coherent signal structures with varying localized 

overlaps. Using some criteria, if we can separate these localized 

overlapping structures, it may lead to a better understanding of the 

signal thereby to extract high discriminative features for classification 

applications. 

In general, all real world signals contain both coherent and 

non-coherent structures. Coherent structure have definite TF localization 

unlike the non-coherent structures. Any iterative decomposition 

algorithm such as matching pursuits with TF dictionaries 

model the coherent structures during the initial iterations as 

they correlate well with the dictionary elements. The non-coherent 

Thanks to NSERC for funding this research work. 




Toronto, ON M5B 2K3, Canada 

Email: krishnan@ee.ryerson.ca 

structures on the other hand are broken into finer and finer structures 

till the information is diluted across the whole dictionary [1]. 

The contribution of coherent and non-coherent structures in a signal 

decide the energy capture pattern of the decomposition algorithms. 

The previous work [2] of the authors, introduced a novel timewidth 

vs frequency band mapping (constructed from the decomposition 

parameters) to identify the high discriminative TF tilings 

between different classes of signal using Local Discriminant Bases 

(LDB) algorithm. The proposed work uses a similar mapping, 

however splitting it into multiple mappings for identifying better 

discriminatory features. 

The paper is organized as follows: Section II covers methodology 

consisting of adaptive time-frequency transform, multiple 

TFD slices, multiple sn vs fn mappings, databases, feature extraction 

and pattern classification. Results and discussion are given in 

Section III and conclusions in Section IV. 

2. METHODOLOGY 

2.1. Adaptive Time-frequency Transform (ATFT) 

The signal decomposition technique used in this work is based on 

the matching pursuit (MP) [1] algorithm. MP is a general framework 

for signal decomposition. The nature of the decomposition 

varies according to the dictionary of basis functions used. When 

a dictionary of TF functions is used, MP yields an adaptive timefrequency 

transformation [1]. In MP any signal x(t) is decomposed 

into a linear combination of K TF functions g(t) selected 

from a redundant dictionary of TF functions as given by: 

K−1 

 

an t − pn 

x(t) = √ g 

exp {j(2πfnt + φn)} , (1) 

sn 

n=0 

sn 

where an is the expansion coefficient, the scale factor sn also 

called as octave or time-width parameter is used to control the 

width of the window function, and the parameter pn controls the 

temporal placement. The parameters fn and φn are the frequency 

and phase of the exponential function respectively. The signal 

x(t) is projected over a redundant dictionary of TF functions with 

all possible combinations of scaling, translations and modulations. 

The dictionary of TF functions can either suitably be modified or 

selected based on the application in hand. In our technique, we 

are using the Gabor dictionary (Gaussian functions) which has the 

best TF localization properties. At each iteration, the best correlated 

TF functions to the local signal structures are selected from 

0-7803-8874-7/05/$20.00 ©2005 IEEE V - 477 

80 

ICASSP 2005 


Frequency bands 

F4 

F3 

F2 

F1 

ME5 

s1 s2 ...........................sn 

Time−width 

ME5 = ME1 + ME2 + ME3 + ME4 

SPLIT 

the dictionary. The remaining signal called the residue is further 

decomposed in the same way at each iteration subdividing them 

into TF functions. 

2.2. Multiple TFD slices 

As explained in Section 1, in the initial iterations, the ATFT algorithm 

captures the coherent signal structures which have correlated 

TF dictionary elements and then as the number of iterations grows, 

it tries model the non-coherent structures by breaking them finer 

and finer till the information is diluted across the whole dictionary. 

The energy capture pattern can be extracted from the normalized 

decomposition parameter an. In order to explain how this energy 

capture pattern can be utilized to extract overlapping signal structures, 

let us take an example of a synthetic signal y(t) which is 

composed of a sinusoid, two chirps and random noise. The signal 

y(t) is given by: 

y(t) =w1s(t)+w2c1(t)+w3c2(t)+w4r(t) (2) 

where s(t) represent the sinusoid at approximately Fs/4, c1(t) is a 

linear chirp with increasing frequency cutting the sinusoid, c2(t) is 

another linear chirp with decreasing frequency again cutting both 

the sinusoid and c1(t). r(t) represents the random noise. The 

weight factors w1,2,3,4 are (1, .1, .01, .001) respectively. We performed 

the ATFT decomposition (1000 iterations) of y(t) using a 

Gabor dictionary. Figures 3(a) and 3(b) show y(t) in time domain 

and TF domain (spectrogram is used inorder to show all the three 

components at the same time). Here we deliberately introduced energy 

differences between the components so as to demonstrate the 

significance of energy capture pattern. Most of the times, the first 

few iterations capture significant amount of signal energy (coherent 

structures). Thereafter with the increase in the number of iterations 

we move from modeling coherent structures to non-coherent 

structures. The energy capture pattern of the ATFT decomposition 

for y(t) is shown in Fig. 2 (the first 50 iterations). The curve 

represents the normalized energy captured per iteration. We can 

see the energy captured per iteration drops as we move along the 

iterations. In this work as an example we split the energy capture 

pattern into 4 parts i.e. (E1) the number of iterations at which 

the energy captured per iteration drops to 10% of its initial value 

(initial value= 1), (E2) the number of iterations between 10% of 

Frequency band 

ME4 

ME3 

ME2 

ME1 

Fig. 1. Time-width vs frequency band mapping 

V - 478 

81 

Time−width 

initial value and 1% of initial value, (E3) the number of iterations 

between 1% of initial value and 0.1% of initial value, and (E4) the 

number of iterations between 0.1% of initial value to the end of 

decomposition. 

Normalized energy capture curve − log scale 

10 0 

10 −1 

10 −2 

10 −3 

0.1 

0.01 

0.001 

E1 E2 E3 E4 

Energy decomposition pattern( E1, E2, E3 and E4) 

10 

0 5 10 15 20 25 30 35 40 45 50 

−4 

Number of iterations 

Fig. 2. Energy capture pattern of the sample signal y(t). 

Following the energy capture pattern we accumulate the TF 

functions into the above explained four parts (E1-4). For this example, 

we had 5 TF functions for E1, 9 TF functions for E2, 16 TF 

functions for E3 and 970 TF functions for E4. The number of TF 

functions will give an idea that almost 99% of the signal energy 

needs only 30 (1 to E3) TF fucntions, whereas the remaining 1% 

signal energy (mostly noise like strutcures) needs 970 TF functions 

or more. Using these 4 sets of TF fucntions we construct 4 

different TFDs. i.e. splitting the original TFD of y(t) into 4 TFDs 

based on the energy capture pattern. The corresponding 4 TFDs 

are shown in Figs. 3(c), 3(d), 3(e) and 3(f). If we closely look at 

the TFDs, we can see the TFD in Fig. 3(c) showing the sinusoid 

s(t) alone, the TFD in Fig. 3(d) shows the disappearing sinusoid, 

the TFD in Fig. 3(e) shows the evolving chirp c1(t) signal from the 

sinusoid background and the TFD in Fig. 3(f) showing a stronger 

but noisy chirp c1(t), a faint evolving chirp c2(t) and the random 

noise. It is obvious to see that TFDs 3(c) to 3(f) are better individual 

representations of the signal components than the combined 

TFD 3(b). In this example if it so happens that one of the components 

that was masked by the overlapping strong component is 


Amplitude (.au) 

1.5 

1 

0.5 

0 

−0.5 

−1 

−1.5 

0 50 100 150 200 250 300 350 400 450 500 

Time samples 



Fs/2 

Fs/2 

(a) 

Time samples 

(c) 

Time samples 

(e) 

1024 

1024 



Fs/2 

Fs/2 

Time samples 

(b) 

Time samples 

(d) 

Time samples 

Fig. 3. (a) sample signal y(t), (b) TFD of the sample signal, (c) 

TFD of sample signal with TF functions of E1, (d) TFD of sample 

signal with TF function of E2 (e) TFD of sample signal with TF 

functions of E3 and (f) TFD of the residue signal 

the discriminator that we are looking for, then the proposed technique 

of generating multiple TF mapping using the energy capture 

pattern will be of immense help. Here it should be noted that the 

energy split shown in this example is not the best to show all the 

components individually and separately. This is just to give an 

idea about the possiblity of using the energy capture patttern for 

removing overlapping structures in complex situations. Also this 

approach may not work in all situations unless there are hidden 

signal structures either with (a) different energy contribution or 

(b) different contributions from coherent and non-coherent structures 

or both (a) and (b). Extending this same concept of multiple 

TF mappings, we now apply it on a novel time-width vs frequency 

band mapping as will be explained in the next Section 2.3. 

2.3. Multiple sn vs fn mappings 

In order to effectively analyze for classification applications, the 

ATFT signal decomposition parameters need to be rearranged in a 

pseudo dictionary format. There are five parameters as explained 

in Section 2.1 viz. an, sn, fn, pn and φn that represent the index 

of each of the dictionary element. After a signal is decomposed 

into TF functions, we group the TF functions with the time-width 

parameter sn in X axis and the the fn parameter in the Y axis. 

In order to reduce the computational complexity instead of using 

all the possible values of the fn parameter we break the frequency 

range into M bands only. whereas sn takes all the possible values 

(2 1..14 ) depending on the length of the signal. Each combination of 


Fs/2 

(f) 

1024 

1024 

1024 

V - 479 

82 

sn with one of the M frequency bands form a cell which contains 

the cumulative normalized energy of all the TF functions falling 

in that particular combination of sn and frequency band. The left 

side of the Fig. 1 shows a sample time-width vs frequency band 

mapping. In the proposed work we used 4 frequency bands, which 

means we transform the decomposition parameters of a signal into 

14 time-widths (sn) vs 4 frequency band mapping. 

From this time-width vs frequency band mapping we can readily 

obtain the energy distribution of the signal in terms of the timewidth 

and frequency band (center frequency) decomposition parameters. 

Depending upon the application one can choose say 

K number of cells that covers an area corresponding to certain 

amount of signal energy. This area will provide the sn and fn 

ranges which are significant for that particular application. This 

area can be arrived by averaging the time-width vs frequency band 

of N sample signals. For classification applications this can be 

done using LDB as demonstrated in authors previous work [2]. 

Considering the benefits of multiple TFD slices in signal analysis 

as explained in Section 2.2, instead of using one time-width vs frequency 

band mapping that covers all the signal energy, we slice it 

into L time-width vs frequency band mappings as shown in Fig. 1 

(L =4). This L sliced time-width vs frequency band mappings 

are expected to separate out the overlapping energy distribution of 

the TF functions based on the energy capture pattern and thereby 

enhance the discriminatory power of the cells. We performed classification 

on two biomedical signal databases to verify the effectiveness 

of the proposed technique of splitting the time-width vs 

frequency band mapping. 

2.4. Databases 

(1) Vibroarthographic (VAG) signals: These are the vibration signals 

emitted from the human knee joints during an active movement 

of the leg and can be used to detect the early joint degeneration 

or defects. Extensive work [3] has been done using timefrequency 

approach in analyzing these signals. Few important 

characteristics of the VAG signals which make them difficult to 

analyze are as follows: (i) Highly non-stationary, (ii) Varying frequency 

dynamics, and (iii) Multi-component signal. The database 

consists of 36 signals with 19 normal and 17 abnormal signals. 

(2) Pathological speech signals: These are speech signals recorded 

from the pathological and normal talkers in a sound-attenuating 

booth at the Massachusetts Eye and Ear Infirmary. All signals 

were sampled at 25 kHz. The signals were the first sentence of 

the rainbow passage, ’when the sunlight strikes rain drops in the 

air, they act like a prism and form a rainbow’. More details on the 

classification of this database can be found in author’s previous 

work [4]. The database used in this study consists of 30 signals 

with 15 normal and 15 pathological signals. 

2.5. Feature Extraction and Pattern Classification 

Signals from both the databases were decomposed using the ATFT 

algorithm (5000 iterations) as explained in Section 2.1. For each 

signal, 4 time-width vs frequency band mappings were created using 

the decomposition parameters. The energy split used for generating 

these 4 mappings were same as the one used in the example 

of synthetic signals (E1-4). In these 4 mappings, each row 

of the mapping represents the signal energy distribution over all 

time-widths for a particular band of frequencies. Let us name the 

mappings as ME1, ME2, ME3 and ME4 and the frequency bands 


as F1, F2, F3 and F4 as shown in Fig. 1. Now for each combination 

of MEx and Fx we extract P × 14 energy values from 

the cells as feature matrix, where P is the number of signals in 

the database. From the 16 combinations of MEx and Fx, only 

non-zero feature matrices are used for classification. In order to 

compare the results with the original non-split time-width vs frequency 

mappings (let it be ME5), another set of 4 feature matrices 

were generated using the same procedure. When tested with the 

Ho-Kashyap [5] algorithm, most of these 20 combinations (MEx 

and Fx) for both the databases favored non-linear separability to 

achieve maximum classification accuracies. However, as the main 

focus of the proposed technique is to demonstrate the relative improvement 

in discrimination between the split and non-split timewidth 

vs frequency mappings, we restrict our analysis to a linear 

classifier. The extracted features were fed to a Linear Discriminant 

Analysis (LDA) based classifier using SPSS [6]. The classification 

accuracy was validated using the leave-one-out method which is 

known to provide a least bias estimate. 


A two stage classification was performed for the VAG database. 

In the first stage, we performed a two group classification classifying 

the VAG signals into normal and abnormal. Table 1 shows 

the highest classification accuracy achieved out of the 20 combinations 

of MEx and Fx. We observed an overall classification accuracy 

of 88.9% using leave-one-out (Cross. V) based LDA for the 

combination of ME4 and F3. This is higher than the classification 

accuracies reported by existing works for this database. There is 

no difference in the classification accuracy comparing it with the 

combination of non-split ME5F3. This is because F3 is non zero 

only for ME4 which means, there is no overlap in F3. So eventually 

ME4F3 and ME5F3 were the same. The results also gave a 

clue that the discriminatory information between normal and abnormal 

lies in F3. 

Table 1. Table showing 2 group classification accuracy achieved 

for the VAG database. Cross.V = Leave-one-out method LDA, % 

= Percentage of classification 

Method Groups Normal Abnormal Total 

Cross.V Normal 15 4 19 

Abnormal 0 17 17 

% Normal 78.9 21.1 100 


We now performed the second stage of classification on the 17 

abnormal VAG signals. The abnormal VAG signals in the database 

are from different kinds of knee pathologies. Chondromalcia patella 

(CMP) [3] is one of the pathologies which has four categories (I, 

II, III and IV) of grading based on the severity. It is a difficult 

task to classify them by their gradings using the VAG signals. Out 

of the 17 abnormal signals, 10 were CMP signals. We performed 

a three groups classification on this 10 signals viz. grade(I and 

II), grade (II and III) and grade (III and IV). We observed a perfect 

classification of 100% using leave-one-out based LDA for the 

combination of ME2 and F1. None of the other combinations including 

the non-split ME5F1 could achieve 100% classification. 

This result explains the fact that splitting the time-width vs frequency 

band mappings does enhance the discriminatory power and 

also indicates the discriminatory features for CMP lies in the ME2 

and F1 mapping. 

V - 480 

83 

Similarly we performed a 2 group classification (normal and 

pathological) for the pathological speech database. Table 2 shows 

the highest classification accuracy achieved out of the 20 combinations 

of MEx and Fx. An overall classification accuracy of 

93.3% was achieved using the leave-one-out based LDA. The reported 

classification is for the combination of ME1F1 and nonsplit 

ME5F1. In which case we observe the classification accuracy 

to remain same with or without splitting the time-width vs 

frequency mapping. However the results give a clue that the discriminatory 

information lies in ME1 and F1. 

Table 2. Table showing the 2 group classification accuracy 

achieved for the pathological speech database. Cross.V - Leaveone-out 

method LDA, % = Percentage of classification 

Method Groups Normal Pathological Total 


Pathological 0 15 15 

% Normal 86.7 13.3 100 



Enhancing the discriminatory power of the TF representations using 

a TFD splitting approach was proposed. The technique was explained 

using a synthetic signal and two real world signal databases. 

Using the technique on the VAG database showed a significant 

improvement in the sub classification of abnormal signals. Although 

the results are inconclusive for the real world databases, 

this approach may better suit for identifying finer discriminatory 

features inside global classifications. Adaptively choosing the energy 

split might improve the significance of the proposed technique. 

Future work involves in arriving at a suitable energy split 

ratio based on the nature of the signal, increase the number of frequency 

bands and extract visual feature treating the time-width vs 

frequency mapping as an image. 

5. REFERENCES 

[1] S. G. Mallat and Z. Zhang, “Matching pursuit with timefrequency 

dictionaries,” IEEE Trans. Signal Processing, vol. 

41, no. 12, pp. 3397–3415, 1993. 

[2] K. Umapathy, S. Krishnan, and A. Das, “Sub-dictionary selection 

using local discriminant bases algorithm for signal classification,” 

in Proceeding of IEEE Canadian Conference on 

Electrical and Computer Engineering, Niagara falls, Canada, 

May 2004, pp. 2001–2004. 

[3] S. Krishnan, “Adaptive signal processing techniques for analysis 

of knee joint vibroarthrographic signals,” in Ph.D dissertation, 

University of Calgary, June 1999. 

[4] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, “Timefrequency 

modeling and classification of pathological voices,” 

in Proceedings of IEEE Engineering in Medicine and Biology 

Society (EMBS) 2002 Conference, Houston, Texas, USA, Oct 

2002, pp. 116–117. 

[5] M. H. Hassoun and J. Song, “Adaptive Ho-Kashyap rules for 

perceptron training,” IEEE Trans. on Neural Networks,vol.3, 

no. 1, pp. 51–61, 1992. 

[6] SPSS Inc., “SPSS Advanced statistics user’s guide,” in User 

manual, SPSS Inc., Chicago, IL, 1990. 


INDEXING OF NFL VIDEO USING MPEG-7 DESCRIPTORS AND MFCC FEATURES 

Syed G. Quadri, Sridhar Krishnan and Ling Guan 

Dept. of Electrical and Computer Engineering, Ryerson University 

Toronto, Canada, M5B 2K3 

{squadri,krishnan,lguan}@ee.ryerson.ca 

ABSTRACT 

In this paper, we propose an application system to classify 

American Football (NFL) Video shots into 4 categories, 

namely: Pass plays, Run plays, Field Goal/Extra Point plays 

(FG/XP) and Kickoff/Punt plays (K/P). The proposed system 

consists of two stages. The first stage is responsible for 

play event localization and the latter stage is responsible for 

feature mapping and classification. For play event localization 

we have proposed an algorithm that uses MPEG-7 motion 

activity descriptor and mean of the magnitudes of motion 

vectors, in a collaborative manner to detect the starting 

point of a play event within a video shot with 83% accuracy. 

The indexing and classification stage uses MPEG-7 motion 

and audio descriptors along with Mel Frequency Cepstrum 

Coefficients (MFCC) features to classify the events into 4 

categories using Fisher’s LDA. We obtain indexing accuracy 

of 92.5% by using leave-one-out classification technique 

on a database of 200 video shots taken from 4 different 

games obtained from 4 different networks. 


The concept of On-Demand entertainment and programming 

is fast becoming a reality with the popularity of digital TV 

channels. Nearly every professional sports league and team 

in North America has a digital channel boasting of On-Demand 

programming and statistics. But the reality is that it takes 

nearly three to four hours in post-production work to prepare 

the highlights for a game. For example, on NFL Sunday 

Ticket you get Highlights-On-Demand on Monday morning 

for the games played on Sunday. In order to minimize 

this delay, we need a system that can analyze the contents 

of the broadcast and derive the semantics from the input. 

These semantics can be made available to the users 

for querying in order to create a true On-Demand experience. 

Recently a lot of research has been conducted on automating 

the process of indexing and annotating the sports 

video streams. Nearly all the major sports have been used 

to test the indexing and retrieval systems. One of the major 

projects working in generating semantic sports video an- 

notations is the ASSAVID project. As detailed in [1], this 

project focuses on developing a system that can categorize 

different types of sports and provide users with an interface 

to query events in a particular sport. 

In [2] Miyauchi et. al., used audio, textual and visual 

information to classify NFL video into events like touchdowns 

and field goals. In [3] Lazarescu et. al., classified 

different types of formations within NFL games using the 

natural language commentary from the game, the geometrical 

information about the play and the domain knowledge. 

In [4], Nitta et. al. used closed caption text and audio visual 

information to classify plays into 3 categories namely: 

scrimmage, FG/XP and K/P. 

All of the works mentioned above rely on domain knowledge 

to classify different high level concepts within American 

football. On the other hand, we propose a system that 

classifies recurring events of the game without using any domain 

knowledge. These recurring events are the most basic 

components of the game. By classifying these basic components 

first we can look for higher concepts contained within 

each of the basic events and thus generate a hierarchical 

graph of concepts which varies from low level to high level. 

In this work we focus on utilizing existing standard descriptors 

of MEPG-7 as the basic feature set. In [5], the authors 

have proposed applications for generating summary highlights 

in sports domain using MPEG-7 motion descriptor, 

but to our knowledge no one has used MPEG-7 audio and 

motion descriptors to index recurring events in the American 

football domain. 

Section 2, will detail the algorithm proposed for localization 

of play events within NFL video shots along with 

an analysis on the performance of the algorithm. Section 

3, provides details on the features set used for indexing and 

the classification scheme utilized. Section 4, presents the 

results of the classification scheme and Section 5, provides 

the concluding remarks and future directions. 

2. LOCALIZATION OF PLAY EVENT 

Sports have a very well defined structure. They have a set 

of rules that must be followed in order for the game to be 

0-7803-8874-7/05/$20.00 ©2005 IEEE II - 429 

84 

ICASSP 2005 


Figure. 1. Motion Vector magnitudes for various plays 

played properly. Many sports such as golf, baseball, bowling 

and American football have a requirement that the team 

or players must be in a distinctive position before each play. 

In Golf, the player positions himself by the ball in order to 

hit it in a certain direction. Likewise in American football 

the two teams first line up face to face before the ball is 

snapped to begin the play. The common theme among all 

these sports is that before the play starts, the level of motion 

activity in the video is lower compared to when the play 

has started. This distinction in the motion activity is utilized 

in the proposed algorithm to segment play events from 

non-play events. Figure 1, shows the magnitude of motion 

vectors in different types of NFL plays. 

2.1. Proposed Play Event Detection Algorithm 

The primary objective of the algorithm is to detect the key 

frame that can be used as the starting point of the play event 

in the shot. The end point of the play event is not extracted, 

as in most American football video shots containing play 

events, the shot usually terminates at the end of the play. 

In order to extract the intensity of motion descriptor, 

MPEG-1 video motion vectors are used. Only the motion 

vectors from the P frames are analyzed in order to speed 

up the processing time. In MPEG-7 the motion activity descriptor 

represents the Standard Deviation of motion vector 

magnitudes within a frame. The intensity of motion activity 

descriptor along with the mean of the motion vector magnitudes 

is used collaboratively in the algorithm to detect the 

starting point of the play event. An analysis of 20 video 

shots selected from each category was conducted to estimate 

the thresholds for the mean and standard deviation of 

motion vectors. The following steps detail the algorithm: 

Step 1: Find a P frame with a mean value of 4 or higher 

Step 2: Determine the gradient of the mean values within a 

window (3 or 4 adjacent frames) 

II - 430 

85 

Step 3: If gradients are all positive mark the frame as possible 

starting point, else go back to Step 1. 

Step 4: If the intensity of motion descriptor has a value of 2 

or higher, return frame number as the starting point 

Step 5: Otherwise, determine the gradient of the standard 

deviation values within a window (3 or 4 adjacent frames) 

Step 6: If the gradients are all positive return the frame number 

as the starting point, else go back to Step 1. 

2.2. Play Event Detection Algorithm Evaluation 

The above algorithm was tested on the American football 

video shot database which consists of 200 video shots taken 

from 4 different games and 4 different networks. In order 

to measure the performance of the algorithm, we have to 

establish the ground truth about the starting point of the play 

event within each video shot. This was accomplished by 

having an observer manually index the frame number which 

best represented the start point of the play event. 

Comparison of results was done by getting the delta between 

the ground truth frame number and the frame number 

estimated by the algorithm. The results still needed to be 

evaluated in terms of what this delta meant in actual time 

domain. That is we need to evaluate if the algorithm is estimating 

a starting point too early or if it is estimating the 

starting point after a certain amount of delay. 

Since MPEG-1 video has a frame rate of 30 frames/sec, 

building a histogram whose bin size was 30 frames would 

give a general idea of how apart the estimated frame numbers 

were from the ground truth in actual time domain. Figure 

2, shows the histogram of the number of shots within 

each time unit. Negative time units represents early detection 

and positive time units represents a delayed detection. 

From Figure 2 we can see that the algorithm detects the 

starting points of the play with 83% accuracy. That is 166 of 

the 200 video shots in the database had the starting points 

detected within ±1 seconds of the original starting point. 

The accuracy of the algorithm can be increased to 86.5% by 

increasing the window size from 3 frames to 4 frames. But 

this change in window size has its own side effect. By increasing 

the window size we are looking for motion activity 

being sustained for a longer period of time, which means 

we will get more shots with delayed detection. 

3. INDEXING AND CLASSIFICATION 

One of the biggest application areas for MPEG-7 is multimedia 

indexing and retrieval. Since the introduction of 

MPEG-7 standard, there has been significant research effort 

put in developing applications based on descriptors from 

MPEG-7, but to date there has been only a few applications 

that utilize MPEG-7 descriptors for sports video indexing 

and retrieval. The application we are proposing is a first 


Figure. 2. Performance of algorithm in actual time 

in the American football domain, which utilizes MPEG-7 

motion and audio descriptors along with MFCC features. 

In American football domain visual or motion features 

play a significantly dominant role in discriminating between 

different types of plays as evident from Figure 1. Therefore 

firstweevaluatetheefficacy of using motion descriptors 

for an American football video indexing system and then 

we evaluate the changes in system performance by adding 

audio descriptors and MFCC features. 

3.1. MPEG-7 motion features 

According to MPEG-7 description [6], the standard deviation 

of the magnitude of motion vectors formulate the intensity 

of motion descriptor. The descriptor takes on the 

value of 1 through 5, low value meaning low intensity of 

motion. Experiments done by using 5 levels showed that 

most of the motion descriptors were quantized into 2 or 3 

levels. Thus to provide better motion activity resolution the 

descriptor was quantized into 12 levels. Similarly according 

to MEPG-7 description the dominant direction descriptor is 

calculated by quantizing the angles of the motion vectors 

into 8 levels. In this work the same 8 quantization levels 

were used to define the dominant direction descriptor. 

A 2D feature map was created by combining the two 

motion activity descriptors. The motivation behind this was 

to create a feature set that can model both the intensity of 

motion and the direction of motion, thus discriminating between 

high intensity motion in upward direction versus high 

intensity motion in lateral direction. As can be seen from 

Figure 3, the feature map provides a unique representation 

of only 12 × 8 dimensions for both the intensity and direction 

of motion. In the feature map, blue colour corresponds 

to low values and red colour corresponds to high values. 

II - 431 

86 

Figure. 3. Motion feature map 

3.2. MPEG-7 audio features 

The motivation behind using audio descriptors is due to the 

fact that most sports have a certain vocabulary associated 

with each event. Almost all the announcers will utilize some 

of the vocabulary to describe similar events. Therefore we 

wanted a compact representation of audio characteristics to 

describe the general tone and pitch of the announcer. The 

objective is to analyze the similarity in the spoken sound 

between similar events. 

We used 3 MPEG-7 basic spectral audio features, namely: 

Audio Spectrum Envelope (ASE), Audio Spectrum Centroid 

(ASC) and Audio Spectrum Flatness (ASF) to achieve 

our objective. The ASE descriptor represents the power 

spectrum of an audio signal and can be calculated by taking 

the Fourier transform (FFT) of the audio signal which 

is windowed using a Hamming window with an overlap of 

50% between adjacent windows. 

The ASC descriptor represents the center of gravity of 

the power spectrum. This is calculated by adding the energy 

in each frequency bin in the FFT spectrum and dividing it 

by the total energy in the frame as shown below: 

ASC(l) = ΣK−1 

k=0 k.|P (l, k)|2 

Σ K−1 

k=0 

|P (l, k)|2 

, (1) 

where k is the frequency bins index. The descriptor shows 

which frequencies are dominated in the spectrum. 

The ASF descriptor represents the overall tonal component 

in the power spectrum of the audio signal. It is calculated 

by calculating the geometric mean of the audio frame 

and dividing it by the arithmetic mean of the audio frame as 

shown by the equation: 

ASF(l) = (ΠK−1 

k=0 |P (l, k)|2 ) 1 

N 

1 

N ΣK−1 

k=0 |P (l, k)|2 

, (2) 

where k is the frequency bins index and N is the size of the 

short time fourier transform window. 


All the above descriptor were quantized into 10 levels, 

thus providing a feature set of 30 dimensions. 

3.3. MFCC features 

Due to the fact that most of the video shots contain a lot of 

crowd noise, and we want to extract the perceived rhythm 

and sound of the spoken content, we needed a feature that 

can model the human hearing and also works well under 

noisy condition. MFCC has been used extensively in the 

speech recognition systems as it tries to emphasize the frequencies 

that are more perceptive to the human ear. 

First the audio file is pre-processed in order to remove 

the silent segments. Then 13 MFCC coefficients are extracted 

for each segment. Each of the segments have 50% 

overlap, and thus there is lot of redundancy between adjacent 

MFCC values. In order to reduce the dimension of the 

matrix, the MFCC values are passed to a feature reduction 

stage. The MFCC features are reduced to a 12 × 64 matrix. 


In order to evaluate the efficacy of the feature set, we used 

simple classification scheme such as Fisher’s Linear Discriminant 

Analysis (LDA). In a specific sense LDA also 

commonly refers to techniques in which a transformation 

is done in order to maximize between-class separability and 

minimize with-in class variability. LDA works on the feature 

set with no prior assumptions about the nature of the 

data set. It tries to compute a weight vector w, which when 

multiplied by the input feature vector x would generate discriminant 

functions gi(x). For C classes problem we define 

C discriminant functions g1(x)...gC(x). The feature vector 

x is assigned to a class whose discriminant function is the 

largest value of x. 

The test database consists of 200 video shots with durations 

varying from 5 seconds to about 25 seconds. In the 

database there are 88 pass plays, 67 run plays and 45 kicking 

plays. A total of 8 different teams were used to create 

the database from 4 different networks. This variety in the 

database ensured that the sample space of our work was diverse 

and included all major broadcasters. 

Table 1, shows the indexing results of using MPEG-7 

motion and audio descriptors along with MFCC features. 


In this paper we have proposed a system with two main 

components. The first component finds the starting points 

of play events within a video shot. The second component 

is responsible for indexing and classificationofeventsin 

the American football domain. Both the components of the 

system utilize MPEG-7 motion descriptors, while MPEG-7 

II - 432 

87 

Play MPEG-7 MPEG-7 MPEG-7 motion 

Events motion motion+audio audio+MFCC 

Pass 79.5% 85.2% 94.3% 

Run 92.5% 91.0% 89.6% 

FG/XP 87.5% 87.5% 93.8% 

K/P 65.5% 82.8% 93.1% 

Overall 82.5% 87.0% 92.5% 

Table 1. Classification Performance Summary 

audio and MFCC features are added to enhance the indexing 

capabilities of the system. 

Although there is no baseline to compare our results 

with, but somewhat similar works reported in indexing and 

retrieval of American football events [2] [3] [4], have shown 

indexing accuracy of 84%, 81% and 84% respectively. In 

this work the we obtained classification accuracy of 82.5% 

by using a MPEG-7 motion features alone, while the above 

mentioned works used multiple modalities. By using multiple 

modalities, our system is able to index the events into 4 

categories with 92.5% accuracy. 

6. REFERENCES 

[1] W.J. Christmas B. Levienaise-Obadia J. Kittler, 

K. Messer and D. Koubaroulis, “Generation of semantic 

cues for sports video annotation,” in Proc. of IEEE 

Intl. Conf. on Image Processing. 

[2] N. Babguchi S. Miyauchi, A. Hirano and T. Kitahashi, 

“Collaborative multimedia analysis for detecting semantical 

events from broadcasted sports video,” in 

Proc. of IEEE 16th Intl. Conf. on Pattern recognition. 

[3] G. West M. Lazarescu, S. Venkatesh and T. Caelli, “On 

the automated interpretation and indexing of american 

football,” in IEEE Intl. Conf. on Multimedia Computing 

and Systems. 

[4] N. Babaguchi N. Nitta and T. Kitahashi, “Extracting 

actors, actions and events from sports video - a fundamental 

approach to story tracking,” in Proc. of IEEE 

Intl.Conf. on Pattern recognition. 

[5] R. Radhakrishnan Z. Xioing and A. Divakaran, “Generation 

of sports highlights using motion activity in combination 

with a common audio feature extraction framework,” 

in Proc. of IEEE Intl. Conf. on Image Processing. 

[6] P. Salembier B.S. Manjunath and T. Sikora, Introduction 

to MPEG-7: Multimedia Content Description Interface, 

John Wiley and Sons, England, UK, 2002. 


2004 International Conference on Signal Processing 8 Communications (SPCOM) 

AUDIO SIGNAL FEATURE EXTRACTION AND CLASSIFICATION USING 

LOCAL DISCRIMINANT BASES 

Karthikeyan Umupathy, Raveendra; K. Rao 

Dept. of Electrical and Computer Engg. 

The University of Western Ontario 

London, ON, Canada N6A 5B9 

Email: kumapath@uwo.ca. rkrao@eng.uwo.ca 

ABSTRACT 

Automatic cIassilication of audio signals is an intcresting 

and a challenging task. With the rapid growth of' multimedia 

content over Internet, intelligent content-based audio and 

video retrieval techniques are required to perform efficient 

search over vast databases, Classification schemes form the 

basis of such content-based retrieval systems. In this paper 

we propose an audio classification scheme using Local Dis- 

criminant Bases (LDB) algorithm. The audio signals were 

decomposed using wavelet packets and the high discrimi- 

natory nodes were selected using the LDB algorithm. Two 

different dissimilarity measures were used to sclcct the LDB 

nodes and to extract features from them. The features were 

fed to a Linear Discriminant Analysis based classifier for 

a six group (Rock, Classical, Country, Folk, Jazz and Pop) 

and a four group (Rock, Classical, Country and Folk) clas- 

sifications. Overall classification accuracies as high as 77% 

and 88% were achieved for the six and four group classifi- 

cations respectively using a database of 170 audio signals. 

1.. INTRODUCTION 

Over the years many existing techniques [14] have addressed 

the problem of classification and content-based retrievat 

of audio signals. The general methodology of audio 

classification involves extracting discriminatory features 

from the audio data and feeding them to a pattern classifier. 

The features can be extracted either directly from the 

time domain or from a transformation domain depending 

upon the choice of signal anaIysis tool. Some of the audio 

features that have been succcssfully used for audio classification 

include mel-frequency cepstral coefficients (MFCC) 

[3], spectral similarity, timbral texture [3], band periodicity 

[2], zero crossing rate [2], entropy [5] and octaves (61. 

Few techniques generate a pattern from the features and use 

it for classification by the degree of correlation. Few other 

techniques use the numerical values of the features with statistical 

classification packages. 

0-7803-8674-4/04/$20.00 02004 IEEE 457 


Dept. of Electrical and Computer Engg. 


Toronto, ON, Canada M5B 2K3 

Email: krishnan @ee.ryerson.ca 

Audio signals are highly non-stationary in nature and 

the best way to analyze them is to use a joint time-frequency 

(TF) approach. The previous works [5,6] of the authors 

have demonstrated the success of TF approach in audio clas- 

sification. In [5], audio features such as entropy, centroid, 

centroid ratio, bandwidth, silence ratio, energy ratio, fre- 

quency location of minimum and maximum energy were 

extracted from the spectrogram of the audio signals. These 

features werc fed to a Linear Discriminant Analysis (LDA) 

based classifier to perform a six group classification. An 

overall classification accuracy of 93% was reported with a 

database of 143 audio signals. In [6], the distribution values 

of thc TF decomposition parameter 'octave' over 3 bands of 

frequencies were used as the audio features and a similar six 

group classification was performed with a database of 170 

audio signals. An overall classification accuracy of 97.6% 

was rcparted. 

In order to perform efficient TF anaiysis on the signals 

for classification purposes, it is essential to locate the sub- 

spaces on the TF plane that demonstrate high discrimination 

between different classes of the signals. Once the target sub- 

spaces are identified, it is easier to extract relevant features 

for classifications. In the proposed work we use Local Dis- 

criminant Bases algorithm (LDB) [7] with wavelet packet 

bases to identify these target subspaces in the TF plane to 

classify the audio signals. The optimal choice of LDBs de- 

pends on the nature of the dataset and the dissimilarity mea- 

sures used to distinguish between classes. A combination 

of tnUItipk dissimilarity measures can be used to achieve 

high classification accuracies. Features were extracted from 

the basis vectors of the LDB nodes and fed to a LDA based 

classifier for a six (Rock, Classical, Country, Folk, Jazz and 

Pop) and four (Rock, Classical, Country and Folk) group 

classification. The paper is organized as follows: Section 2 

covers the methodology comprising of LDB, LDB selection 

process, feature extraction and pattern classification. Re- 

sults and Discussions are covered in Section 3, and Conclu- 

sions in Section 4. 


88


2.1. Local Discriminant Bases Algorithm 

In the LDB [7] algorithm with wavelet packet bases, a set 

of training signals xp for all P classes are decomposed to 

a full tree structure of order N. The indexes p and i represent 

the pth signal class and ith signal in the training set of 

pth class. We restrict our analysis to binary wavelet packet 

trees [8]. Let s l be ~ a vector ~ space in €2" corresponding to 

the node 0 of the parent tree. Then at each level the vector 

space is spilt into two mutually orthogonal subspaces given 

by fl,,k = flj+l,zk @Q7+1,2k+l, wherej indicates the level 

of the tree and k represents the node index in level j, given 

by k = 0, ,..., 23 - 1. This process repeats till the level J, 

giving rise to 2' mutually orthogonal subspaces. Each node 

k contains a set of basis vectors B,,k, 

where 2". corresponds to the length of the signal. Then the 

signal xp can be represented by a set of coefficients as: 

3 k 1 

Basically the signal xr is decomposed into 2.' subspaces 

with aJ,k,J coefficients in each subspace. With the train- 

ing signals decomposed into wavelet packet coefficients we 

need to define a dissimilarity measure (&) in the vector 

space so as to identify those subspaces, which have larger 

statistical distance between classes. This dissimilarity mea- 

sure is used in an iterative manner to prune the tree in such 

a way that only a node is split if the cumulative discrimina- 

tive measure of the children nodes is greater than the par- 

ent nodc. The resulting tree contains the most significant 

LDBs, from which a set of Ir' significant LDBs are selected 

to construct the final tree. The testing set signals are then 

expanded using this tree and features are extracted from the 

respective basis vectors for classification. 

2.2. LDB selection process 

In the proposed method, we use a modified LDB approach. 

Instead of using a single dissimilarity measure, we use two 

dissimilarity measures (D1 and Dz) to arrive at the final tree 

structure. Using multiple dissimilarity measures provides 

more feature dimensions for classification. Especially for 

complex data sets like music signals, a single dissimilarity 

measure may not be able capture aII the characteristic information 

about its class. Also instead of the selective splitting 

of the nodes, which basically helps in removing the redundancy 

in the LDB selection, we used all the nodes from the 

full decomposition tree. The redundancy within the final set 

of LDBs were later removed in the feature evaluation pro- 

cess. 

The goal is to identify those nodes (LDB) from the full 

waveIet packet tree which demonstrate high discriminatory 

values between all the classes for a given dissimilarity mea- 

sure D,. If there are say P classes then the dissimilarity 

measure was computed by taking 2 classes at a time i.e. 

PC2 combinations, where C stands for the mathematical 

operation of combinations. The nodes which show rela- 

tively higher discriminatory power compared to all the other 

nodes in each of the PC2 combinations were chosen as 

LDBs for that particular combination. The LDB nodes are 

then sorted by their discriminatory power and the first Q 

LDBs were chosen for further processing. This is repeated 

for T trials using different audio signals for each of the 

classes. All the Q x T LD3s for each of the PC2 com- 

bination over the T trials were analyzed for number of oc- 

currences. The first 2 highly occurring LDBs for each of 

the PCz combinations was chosen as the best LDBs for that 

particular combination of classes. So after all the trials we 

will have 2PCz LDBs, from which we choose the first 10 

highly occurring LDBs over all the combinations. At the 

end of this selection process we wil1 have 10 LDBs in the 

wavelet packet tree that demonstrate relatively high discrim- 

inatory behavior among all the combination of P classes. 

In othcr words, these nodes demonstrate high statistical dis- 

tance between all the P classes. 

In this study the following values were chosen for P = 

6 and 4, Q = 5, and T = 10. Also we tested the database 

with few variations of wavelets (db, coif, sym) and observed 

sym4 wavelet to provide better discrimination between the 

classes. Hence, the results presented in this study are based 

on the sym4 wavelet packet decompositions. As we also 

used two different dissimilarity measures in selccting the 

LDBs to enhance the classification accuracy, at the end of 

the LDB selection process we will have 2 x 10 LDBs using 

the two dissimilarity measures. These 20 LDBs can be used 

to construct a composite wavelet packet tree which is used 

to decompose the testing set and extract features as will be 

explained in Section 2.3. 

2.2.1. Dissimilarity measures 


89 

458 

The first dissimilarity measure D1 is the difference in the 

normalized energy between the corresponding nodes of the 

training signals from one of the PCz combination of classes. 

This gives the difference in the energy distribution of the 

signals on the TF plane. Audio signals exhibit different 

TF energy distribution patterns based on their composition. 

Hence this measure is expected to approximately reveal the 

different energy concentration locations on the TF plane for 

different type of audio signals. 

(3)

where. and E;,k are the normalized energy of the cor- 

responding nodes for the PC$ combination signab. Fig. 1 

shows a sample LDB tree obtained using the dissimilarity 

measure D1, 

The discriminant measure Dz is a measure of estimat- 

ing the randomness or non-stationarity of the basis vectors. 

It is computed as the set of variances along the segments of 

the basis vector coefficients. The ratio of this variance mea- 

sure between the signals from each of the PC2 combination 

of classes indicate the amount of deviation observed in the 

non-stationarity between the classes. One of the important 

characteristics of an audio signal is its time varying signal 

structures. This variability in time-varying signal structures 

can be approximated using the discriminant measure D2. 

where p and q are the index of the L segments obtained by 

segmenting the basis vectors at node (j, k) for one of the 

PC2 combination of classes. Fig. 2 shows a sample LDB 

tree obtained using the dissimilarity measurc D.3. 

I 

(4) 

1 - 1 5 1 " ' 

05 I 15 

' 

2 

' I 

2.5 

'I tme katms 1 Q' 

Fig. 1. A sample LDB trce obtained using the dissimilarity 

measure DI and sym4 wavelet. 

2.3. Feature extraction 

An audio database consisting of 24 Rock, 35 Classical, 31 

Country, 21 Jazz, 34 Folk and 25 Pop signals (a total of 170 

signals) was used in this study. Each of the signal from this 

database were extracted from individual music CDs. AI1 the 

samples were of 5 seconds length sampled at 44.1 kHz, Af- 

ter the LDBs were selected as described in the previous sec- 

tion, a composite wavelet packet tree was constructed using 

all the 20 LDBs. The signals from the audio database were 

1 1 I 

I 

Fig. 2. A sample LDB tree obtained using the dissimilarity 

measure D2 and sym4 wavelet. 

decomposed using this composite wavelet packet tree. The 

basis vectors from each of the LDB nodc from this wavelet 

packet tree can be directly used as features. However con- 

sidering the dimensions of the basis vectors, we extract the 

discriminatory values using thc same dissimilarity measures 

(Dl and Dz) on the LDB nodes and use them as features. So 

for each audio signals we will have 10 features using each of 

thc dissimilarity measure. In total we will have 20 features 

for each signal. The cornbination of these 20 features were 

evaluated for their significance in the class separability. A 

wrapper approach was used to select the highly discrimina- 

tive feature set. In wrapper approach the features are either 

added or removed sequentially one by one and the classi- 

fication accuracy is computed using thc classifier. The set 

of featurcs which provide minimum classification error was 

chosen to be the optimal feature set. The resulting set of 

fcatures were fcd a pattern classifier as will be explained in 

the next Section. 

459 

2.4. Pattern Classification 

The motivation for the pattern classification is to automat- 

ically group signals of same characteristics using the dis- 

criminatory features derived as explained in the previous 

section. Pattern classification was camed out using a LDA 

based classifier. In LDA, the feature vector derived as ex- 

plained above were transformed into canonicaI discriminant 

functions [91 such as 

f = Wlbl + ~ 2b2 +.......-I- ~,b, +a, (5) 

where {U} is the set of highly discriminative features, {b} 

and a are the coefficients and constant respectively. Using 

the discriminant scores and the prior probability values of 


90

. 

each group, the posterior probabilities of each sample oc- 

curring in each of the groups are computed. The sample is 

then assigned to the group with the highest posterior proba- 

bility. 

The classification accuracy was estimated using the Ieave- 

one out method which is known to provide a least bias esti- 

mate [ 101. In leave-one-out method, one sample is excluded 

from the dataset and the classifier is trained with the remain- 

ing samples. Then the excluded signal is used as the test 

data and the classification accuracy is determined. This is 

repeated for all samples of the dataset. Since each signal 

is excluded from the training set in turn, the independence 

between the test and the training set are maintained. 

3. RESULTS AND DISCUSSIONS 

AI! the signals from the audio database were decomposed 

using the LDB wavelet packet tree and features were ex- 

tracted as explained in Sections 2.2 and Section 2.3. The 

features were then fed to LDA based classifier. The clas- 

sification accuracies were computed and verified using the 

leave-one-out method. Table 1 shows the classification ac- 

curacies achieved for a six group classification. An overall 

classification accuracy of 77% using rcgular LDA and 65% 

using leave-one-out method werc achieved. From the table 

it can be observed that the Classical and Country classes 

were classified more accurately (94% and 84%) followed 

by the Rock and Folk (79% and 79%). We observe that 

the classes Jazz and Pop have significant overlap with other 

ctasscs showing a poor classification accuracy. Fig. 3 shows 

the scatter plot of the 6 group classification. The cluster- 

ing behaviour of the classes can he observed. Rock, Classi- 

cal, Country and Folk classes show distinct clusters whereas 

the Jazz and Pop overlap with the clusters of Classical and 

Country. It is hard to perceptually arrive at a clear boundary 

between music types. Always there exist a natural overIap 

between similar type of music signals. However, the poor 

classification of Jazz and Pop in our case may be attributed 

to the insufficient and less discriminatory clues (features) 

used in this study. Fine tuning and adding more dissimilar- 

ity measures can improve the overall classification accuracy. 

As we observed significant overlap from the Jazz and 

Pop classes, we performed a second classification using only 

4 groups (124 signals), removing Jazz and Pop. This was 

done to asses the performance of the classifier with the re- 

maining 4 groups. As expected the overall classification ac- 

curacy improved from 77% to 88% for the regular LDA and 

65% to 82% using leave-one-out method as shown in Ta- 

ble 2. Fig. 4 shows a clearer clustering behavior of the 4 

classes. The results obtained are from our initial testing of 

the proposed technique. Author's previous work [6] using 

a different TF approach has provided better classification 

accuracies, however with double the size of the rkported 

Method 

Original 

Gr Ro CI CO .Is Fo Po 1 CA% 

Ro 19 0 5 0 0 0 I 79.2 

CI 0 33 0 1 0 1 I 94.3 

Table 1. Six group classification results. Method: Origi- 

nal - Regular linear discriminant analysis, Cross - validated 

- Linear discriminant analysis with leave-one-out method, 

CA% - Classification accuracy rate, Gr-Groups, Ro-Rock, 

CI-Classical, CO-Country, Fo-Folk, Ja-Jazz, Po-Pop. 

N 

f i " 

4 

Y .I 

Scatter ploi 

F~ntmn 1 

# 

* * y +. 

*d * 

Fig. 3. Six groups scatter plot with the first two canonical 

discriminant functions 

features in this work. Also restricting the final significant 

LDBs to 10 from the set of 2PC2 controls the classification 

accuracy. 


A novel LDB based audio classification scheme was pre- 

sented. High classification accuracies were achieved using 

the proposed methodology. Initial results suggest significant 

potential for LDB based audio classification. Simple dis- 

similarity measures like node energy and non-stationarity 

index performed well in identifying the discriminatory nodes 


91 

460 

9

CO 28 903 

I I I 

Validated 1 CI I 0 1 32 I 3 1 0 I 91.4 

IC01 1 1 4 1261 0 183.9 

Table 2. Four group classification results. Method: Origi- 

nal - Regular linear discriminant analysis, Cross - validated 

- Linear discriminant analysis with leave-one-out method, 

CA% - Classification accuracy rate, Gr-Groups, Ro-Rock, 

Cl-Classical, CO-Country, Fo-Folk. 

Scatter plot 

4 . 7- 

I “h. 

*? I. 

d - 

-- 

. . 

Fig. 4. Four groups scatter plot with the first two canonical 

discriminant functions 

hetween the music classes. Future work involves improving 

the LDB selection process, arriving at an optimal number of 

LDBs for a given classification problem and include more 

dissimilarity measures for audio classification. 

5. ACKNOWLEDGEMENTS 

Thc authors thank the NSERC organization for funding this 

project. The authors also acknowledge the contributions of 

Andre Chang, a research assistant in the Signal Analysis 

and Research (SAR) group at Ryerson University, Toronto, 

Canada. 

. 

46 1 

6. REFERENCES 

[I] H. G. Kim, N. Moreau, and T. Sikora, “Audio clas- 

sification based on mpeg-7 spectral basis representa- 

tions,” IEEE Transactions on circuits and systems for 

video technology, vol. 14, no. 5, pp. 716-725, May 

2004. 

[2] Lie Lu and Hong-hang Zhang, “Content analysis for 

audio classification and segmentation,” IEEE Trans- 

actions on Speech and Audio Processing, vol. 10, no. 

7, pp. 504-5 16, Oct 2002. 

[3] George Tzanetakis and Perry Cook, “Music genre 

classification of audio signals,” IEEE Transactions on 

Speech and Audio Processing, vol. IO, no. 5, pp. 293- 

302, July 2002. 

[4] G. Guo and S. 2. Li, “Content-based audio classifica- 

tion and retrieval by support vector machines,” fEEE 

Transactions on neural networks, vol. 14, no. 1, pp. 

209-215, Jan 2003. 

[SI S. Esmaili, S. Krishnan, and K. Raahemifar, “Con- 

tent based audio classification and retrieval using joint 

time-frequency analysis,” in IEEE Itrtertiational Con- 

ference on Acoustics, Speech arid Sigtzal Processing 

(ICASSP), May 2004, pp. V 665668. 

K. Umapathy, S. Krishnan, and S. Jimaa, “Multi- 

group classification of audio signals using time- 

frequency parameters,” IEEE Trarisactiotis on MuE- 

timedia, in press. 

N. Saito and R. R. Coifmann, “Local discriminant 

bases and their applications,” JolournaE of Muthemat- 

ical hinging arid Vision, vol. 5, no. 4, pp. 337-358, 

1995. 

Stephane Mallat, A wavelet tour of signal processing, 

Academic press, San Diego, CA, 1998. 

SPSS Inc., “SPSS advanced statistics user’s guide,” in 

User marrual, SPSS Inc., Chicago, IL, 1990. 

K. Fukunaga, Introduction to SratisticaE Pattern 

Recognition, Academic Press, Inc., San Diego, CA, 

1990. 


92

A NOVEL ROBUST IMAGE WATERMARKING USING A CHIRP 

BASED TECHNIQUE 

Arunan Ramalingam and Sridhar Krishnan 

Department of Electrical and Computer Engineering, 

Ryerson University, Toronto, Ontario, Canada MSB2K3 

Email: (aramalin)(krishnan) @ee.ryerson.ca 

Abstract 

In this study, we propose a new spread spectrum im- 

age watermarking algorithm thnt embeds linear chirps as 

watermark messages. The slopes of the chirp on the time- 

fmquency (TF) plane represent watermark messages such 

that each slope corresponds to a different message. We 

extract the watermark message using a line detection al- 

gorithm based on the Hough-Radon transform (HRTJ. The 

HRTderects the directional elements rhnt sari@ a paramet- 

ric constraint in the image of a TF plane. The proposed 

method not only detects the presence of watermark, but also 

extracts the embedded watermark bits and ensures the mes- 

sage is received correctly. The robustness of the proposed 

scheme has been evaluated using common imagepmcessing 

techniques such as JPEG compression and image cmpping, 

and we found that the maximum bit error rate to be 19.03% 

which is zero aferposrprocessing using HR?: 

Keywords: Imnge Watermarking, Spread Spectrum, Data 

Hiding, Hough-Radon Transform, Chirp Modulation. 


The huge success of the Internet allows for the trans- 

mission, wide distribution, and access of electronic data in 

an effortless manner. Content providers are faced with the 

challenge of how to protect their electronic data. One of 

the possible solutions in that area is data watermark, which 

is added to multimedia content by embedding an imper- 

ceptible and statistically undetectable signature. Thereby, 

multimedia data creators and distributors are able to prove 

ownership of intellectual property rights without forbidding 

other individuals to copy the multimedia content itself. In 

this study, we propose a novel chirp based watermarking 

scheme [I] for images that embeds linear chirps as water- 

mark messages. Different chirp rates, i.e., slopes on the 

time-frequency (TF) plane, represent watermark messages 

such that each slope corresponds to a different message. The 

narrowband watermark messages are spread with a water- 

mark key (PN sequence) across a wider range of frequen- 

CCECE 2004 - CCGEI 2004, Niagara Falls, May/mai 2004 

0-7803-8253-6/04/$17.00 @ 2004 IEEE 

- 1889 - 

cies before embedding. The resulting wideband noise is 

added to the perceptually significant regions of the origi- 

nal image. We use the block-based discrete cosine trans- 

form (DCT) scheme for inserting the watermark. As a re- 

sult of image manipulations some message bits extracted by 

the detector may he in error potentially resulting in the de- 

tection of the wrong watermark message. Our motivation 

for the proposed image watermarking algorithm is to detect 

the presence of the watermark, extract the embedded wa- 

termark message bits and decide on the watermark message 

even if some hits are received in error. As chirps are repre- 

sented as lines in a TF plane, line detection algorithm such 

as HRT has been applied to extract the watermark messages 

successfully. 

2. WATERMARK EMBEDDING 

Let m be a normalized chirp function that represents the. 

watermark message to be embedded into the original image. 

m takes continuous values in the interval [-1,11, and needs 

to be quantized for the detection of each embedded bit. mq 

is the quantized version of m formed according to the sign 

of the sample values of m, taking values -1 and 1. Let m: 

represent a watermark message bit to he embedded into the 

image. Each bit mi is spread with a cyclic shifted version 

pk of a binary PN sequence with a chip length of N and 

summed together to generate the widehand noise vector w. 

M 

w = m;Pk, (1) 

k=O 

where M is the number of watermark message bits in mq. 

There is always a possibility to make the trade-off between 

the embedded data size and robustness of the algorithm; as 

the PN length is decreasing, the algorithm is able to add 

more bits into the host image but the detection of the hidden 

bits and resistance to different attacks is decreased. The 

wideband noise vector w formed is added to the image in 

perceptually significant regions to ensure robustness of the 

watermark against attacks. The length of w and hence the 


93

number of watermark bits that can he embedded depends on 

the perceptual entropy of the image. 

To embed the watermark in the image, we can utilize 

the models that describe the masking characteristics of the 

human visual system [2]. Among such models, we use 

the model based on the jus: noticeable difference (JND) 

paradigm (31. A set of JNDs is associated with a particu- 

lar invertible transform T. Given that a multimedia signal 

is transformed using T, the JNDs provide an upper hound 

on the extent that each of the coefficients can be perturbed 

without causing perceptual changes to the signal quality. 

The set of signal and transform dependent JNDs can he de- 

rived using complex analytic models or through experimen- 

tation. The JND paradigm is widely used in image com- 

pression, and image watermarking applications. We use the 

JNDs to determine the perceptually significant regions and 

also to find the perceptual entropy of the image. In this work 

we use one such model based on the DCT. 

DCT Based Model 

We use the model proposed by Watson [4] that has been 

applied to JPEG coding. In this method, the original image 

is decomposed into non-overlapping 8x8 blocks, and the 

DCT is performed independently for every block of data. 

Let the original image pixels are represented as z,,j,b, where 


i and j represent the pixel elements in block b, and Xu,u,b 

denotes the DCT coefficients for the hasis function located 

in the position U, II of the block b. A frequency thresh- 

old value is derived for each DCT basis function and in 

this case result in an 8x8 mabix oft:," threshold values. 

These threshold values are determined for various view- 

ing conditions by Peterson er. al. [SI. The visual model 

we used is for a minimum viewing distance of four picture 

heights and a D65 monitor white point. Watson further re- 

- 1890 - 

94 

fines this model by adding a luminance sensitivity and con- 

trast masking component. Luminance sensitivity threshold 

is estimated by the formula 

where XO,O,b is the DC coefficient of the DCT for block b, 

Xh,o is the DC coefficient corresponding to the mean luminance 

of the display, and a is a parameter which controls the 

degree of luminance sensitivity. The authors in [51 suggest 

to set the value of a to 0.649. Given a DCT coefficient and 

a corresponding threshold value derived from the viewing 

conditions and local luminance masking, a contrast masking 

threshold is derived as 

tgv,b = max [t,".u,b, IXU,V,bl"'", (t;,t,b)'-""'"] > (3) 

where w,,,is a number between zero and one, and is empir- 

ically derived as 0.7 by the authors in [SI. The watermark 

embedding scheme is based on the model proposed in [61. 

The watermark encoder for the DCT scheme is described as 

where Xu,-,b refers to the DCT coefficients, X:,u,p refers to 

the watermarked DCT coefficients, ~ ~ , is ~ ohmned , b from 

the wideband noise vector w, and t$,,,b is the computed 

JND calculated from the visual model described in Eq(3). 

Fig. 1 shows the block diagram of the described watermark 

encoding scheme. 

3. WATERMARK DETECTION 

The received image may be different from the water- 

marked image due to some intentional or unintentional im- 

age processing operations such as lossy compression, shift- 

ing and downsampling. Fig. 2 shows the block diagram of 

the described watermark decoding scheme. The detection 

scheme for the DCT based watermarking can he expressed 

as 

where X;,",& are the coefficients of the received watermarked 

image, and +f is the received widehand noise vector. The re- 

ceived widehand noise vector can be expressed as 


iK=w+n, (7)

where n is the distortion component resulting from hostile 

image manipulations and is modeled as a zero-mean ran- 

dom vector uncorrelated with the PN sequence. We use the 

watermark key, i.e., the appropriately circular shifted PN 

sequence pk to despread %, and integrate the resulting se- 

quence to generate a test statistic (%, pr). The sign of the 

expected value of the statistic depends only on the emhed- 

ded watermark hit mi. Hence the watermark hits can be 

estimated using the decision rule: 

la TFD HRT m 

- 1891 - 

!m* L m rab, -MWl 

Fig. 4. Line detection using HRT. 

(WV) and the Hough space of the linear chirp received at 

a hit error rate of 19.03%. The prominence of the global 

maximum in the HRT space provides an indication of the 

presence of chirps in TFD and thereby leading to successful 

watermark detection.) 


We evaluated the proposed scheme using eight differ- 

ent images of size 512x512. The sampling frequency f& 

of the watermarks equal 1 kHz. Hence the initial and final 

frequencies, fob and fib of the linear chirps representing 

all watermark messages are constrained to [O-5001 Hz. We 

embed these chirps in to the images for a chip length of N, 

which depends on the perceptual entropy of the image. We 

experimentally found that for reliable detection of chirp un- 

der various image processing attacks, the chip length should 

he atleast 10000 samples. If the image can support more 

1M)OO samples, then.multiple chirps can he embedded and 

the payload can he increased. In out tests, we used a sin- 

gle watermark sequence having 176 message hits. To mea- 

sure the robustness of the watermarking algorithm, we per- 

formed the following difficult image manipulation tests: (i) 

JPEG Compression, (ii) JPEG Compression and Cropping 

(114 Original), (iii) JPEG Compression and Cropping (1116 

Original). The JPEG compression is performed with dif- 

ferent quality Q; higher value of Q indicates better image 

quality. These tests are performed on the watermarked im- 

ages to simulate the image processing attacks and the water- 

mark message hits are extracted as described in Section 3. 

During all these robustness tests, we assume that the image 

and the PN sequence are synchronized. Figs. 5 - 6 show the 

hit error rate (BER) obtained for JPEG compression with 

quality factors 60 and 20 respectively, and with watermark 


95

sJPEG CmpmrsionlSO; + Croppngilri) 

Fig. 5. BER (in %) for PEG compression with quality 

factor 60. 

message length of 176 bits. The extracted hits are localized 

in the TF domain using WV. Although some of the bits are 

received in error, HRT is able to detect the presence of chirp 

and estimate the parameters of the chirp for all the simula- 

tion results reported in the study. 

5. CONCLUSION 

In this paper, we proposed a novel image watermark- 

ing algorithm that embeds linear chirp as watermark mes- 

sages. The watermark message is added to the perceptually 

significant regions of the image to ensure robustness of the 

watermark to common image processing attacks. The algo- 

rithm is able to extract the watermark message even if some 

of the bits received are in error. A line detection algorithm 

based on the HRT detects the slope of the watermark mes- 

sage in the image of the TF plane of the chirp signal. The 

HRT provides error correcting capability and can be effi- 

ciently implemented as it operates on small images of the 

TF plane. Our studies confirm the robustness of the algo- 

rithm to image compression and cropping attacks. We are 

currently working in expanding our robustness tests and de- 

veloping a complete analytical model for the algorithm. 


We would like to acknowledge Micronet for their finan- 

cial support. 

- 1892 - 

^^ 

~ .................................................................. . ~~~ 

~~ .. 

............. 

im1 

m 

lm2 Im3 ImZ ImL Im G Im 7 lm8 

OJPEG Cmpressian(20) 

0 JPEG CmpierEion(PO)+ CrappngW4I 

oJPEO Cmpresrian/PO) I CroF@ngil!lGi 

Fig. 6. BER (in %) for JPEG compresion with quality 

factor 20. 

References 

[I] S. Erkucuk,S. Krishnan and M. Zeytinoglu, “Ro- 

bust Audio Watermarking Using a Chirp Based Tech- 

nique,”IEEE Intl. Cony on Multimedia andfipo, vol. 

2, pp. 513416.2002. 

[2] M. S. Sanders and E. J. McCormick, Human Factors 

in Engineering and Design, McGraw-Hill, New York, 

7th edition, 1993. 

131 N. Jayant, J. Johnston, and R. Safranek, “Signal Com- 

pression Based Models of Human Perception,” Pm- 

ceedings ofthe IEEE, vol. 81, pp. 1385-1422, October 

1993. 

[4] A. B. Watson “DCT quantization matrices visually OQ- 

timized for individual images,’’ Proc. SPIE Con5 Hu- 

man Vision, Visual Processing, and Digital Display, 

vol. 1913,pp. 202-216. February 1993. 

151 H. A. Peterson, A. J. Ahumada and A. B. Watson, 

“Improved detection model for DCT coefficient quan- 

tization:’ Proc. SPIE Cony Human Vision, Visual Pro- 

cessing, andDigita1 Display, vol. 1913, pp. 191-201. 

February 1993. 

[6] C.I. Podilchuk and W. Zeng “Image-Adaptive Water- 

marking Using Visual Models,” IEEE Journal on Se- 

lected Areas in Communications, vol. 16, pp. 525- 

539. May 1998. 

171 R.M. Rangayyan and S. Krishnan, “Feature identifica- 

tion in the time-frequency plane by using the Hough- 

Radon transform,” Trans. Pattern Recognition, vol. 34, 

pp. 1147-1158,2001. 


96 

I 

I

A Novel Way of Lossless Compression of Digital Mammograms 

Using Grammar Codes 

Xiaoli Li, Sridhar Krishnan and Ngok-Wah Ma 


Ryerson University, Toronto, ONT M5B 2K3, CANADA 

Phone: 416-979-5000 ext.6086 Fax: 416-979-5280 

Abstract-Breast cancer is the most common cancer among women 

in Canada. Despite slight declines in mortality rates over the past 

decade for women with breast cancer, one in nine Canadian 

women will develop breast cancer in her lifetime; one in 25 

Canadian women will die from this disease. Digital mammograms 

(X-rays of the breast) may allow better cancer magnosis and has 

the ability to be transmitted electronically around the world. The 

problem is mammograms are large size images and have less 

correlation details. Therefore, for a physician to diagnose diseases 

correctly even through the communication networks, gaining 

higher compression to save bandwidth without any data loss 

becomes a challenging issue. Among the traditional lossless 

compression algorithms such as Huffman, Lempel-Ziv and 

Arithmetic, Lempel-Ziv and Arithmetic source coding techniques 

have better performances than Hut?inan on digital mammograms. 

In order to achieve better compression ratios we investigate the 

newly developed Grammar-based source code for medical image 

compression such as mammograms. In this Grammar-based code, 

the original data (image) is first transformed into a context free 

grammar, from which the original data sequence can be fully 

reconshucted by performing parallel and recursive substitutions, 

and then uses an arithmetic coding algorithm to compress the 

context free grammar or the corresponding sequence of parsed 

phrases. We tested the grammar-based coding technique on digital 

mammograms obtained from the Mammographic Image Analysis 

Society (MIAS). The result shows the newly developed grammar 

code performs better than the traditional lossless coding schemes. 

In general, the grammar-based lossless compression algorithm 

seems to be a promising technique for teleradiology applications. 

Keywordr-Arithmetic coding, grammar-based codes, 

mammography, compression ratio. 


In this paper, we investigate a novel lossless source coding 

technique called the grammar code for lossless compression of 

mammography. Universal source coding theory aims at designing 

data compression algorithms, whose performance is 

asymptotically optimal for a class of sources. 

To put things into perspective, let us first review briefly, 

from the information-theoretic point of view, the existing 

universal lossless data compression algorithms. So far, the most 

widely used universal lossless compression algorithms are 

arithmetic coding algorithms, Lempel-Ziv algorithms, and their 

variants. Arithmetic coding algorithms and their variants are 

statistical model-based algorithms. To use an arithmetic coding 

algorithm to encode a data sequence, a statistical model is either 

built dynamically during the encoding process, or assumed to 

exist in advance. Several approaches have been proposed in the 

CCECE 2004- CCGEI 2004, Niagara Falls, Maylmai 2004 

0-7803-8253-6/04/$17.00 02004 IEEE 

- 2085 - 

97 

literature to build the statistical model dynamically. Typically, in 

all these methods, the next symbol in the data sequence is 

predicted by a proper context and coded by the corresponding 

estimated conditional probability. Arithmetic coding algorithms 

and their variants are universal only with respect to the class of 

Markov sources with order less than some designed parameter 

value. Note that in arithmetic coding, the original data sequence is 

encoded letter by letter. In contrast, no statistical model is used in 

Lempel-Ziv algorithms and their variants. During the encoding 

process, the original data sequence is parsed into non-overlapping, 

variable-length phrases according to some kind of string matching 

mechanism, and then encoded phrase by phrase. Each parsed 

phrase is either distinct or replicated with the number of 

repetitions less than or equal to the size of the source alphabet. 

Phrases are encoded in terms of their positions in a dictionary or 

database. Lempel-Ziv algorithms are universal with respect to a 

class of sources which is broader than the class of Markov sources 

of bounded order; the incremental parsing Lempel-Ziv algorithm 

[SI is universal for the class of stationary, ergodic sources. 

Other universal compression algorithm include the dynamic 

HuiXnan algorithm [6], the move to front coding scheme [7] [SI 

[9], and some two-stage compression algorithms with codebook 

transmission [IO] [ll]. These algorithms are either inferior to 

arithmetic coding algorithms and Lempel-Ziv algorithms, or too 

complicated to implement. 

The class of grammar-based codes is broad enough to 

include block codes, Lempel-Ziv types of codes, multilevel 

pattern matching (MPM) grammar-based codes [2], and other 

codes as special cases. It has been proved in [I] that if a grammar- 

based code transforms each data sequence into an irreducible 

context-free grammar, then the grammar-based code is universal 

for the class of stationary ergodic sources. (For the definition of 

grammar-based codes and irreducible context free grammars, 

please see Section 11.) Each irreducible context-free grammar also 

gives rise to a nonoverlapping, variable-length parsing of the data 

sequence it represents. Unlike the parsing in Lempel-Ziv 

algorithms, however, there is no upper bound on the number of 

repetitions of each parsed phrase. More repetitions of each parsed 

phrase imply that now there is room for arithmetic coding, which 

operates on phrases instead of letters. (In Lempel-Ziv algorithms, 

there is not much gain from applying arithmetic coding to parsed 

phrases since each parsed phrase is either distinct or replicated 

with the number of repetitions less than or equal to the size of the 

source alphabet.) 

In Section 11, we review how context-free grammars are 

used to represent sequence x, and refer you to the articles that 

explain how the reduction rules are used for designing grammar 

transforms, and how to efficiently encode grammars. In Section 

III, we describe how we implemented the new algorithm for real 

cases and the compression performances of the grammar code and 

other traditional lossless compression techniques for 


.

mammographic images. We also discuss what the advantage and 

disadvantage of the new algorithm are and why it is a promising 

algorithm after surmounting a few problems. 

11. RE\'IE\\' OF 1°K NE\\' L~VIVEKSAI. COKTEXT-FREE 

LOSSI.ESS DATA COMPRESSION ALGORITHM BASED 

ON A GREEDY COSTEXT-FREE SEQUESTIAL 

GRAMIIAR TRAVSFORhl 

'The purpuse of this ssction is IO bnAly review the new 

gr;immsr-bawrl codc \\e applied so th3t this paper is selfconuined 

For the derailed dcscription uf the grammar-based 

codes. plcasc reler to 131. 

l.et A be our sourx alphrhet with cardinaliry greater than or 

~~UJI to 2. Let A' is the set sf 311 finite slnngs of prraiti\e length 

from A. x denotes the length uix. To avoid poisihle confusion, a 

sequence from A ts somctimss called an A -sequence. Let .II EA' 

be a ssquencc tu be comprcssed 

A grammar-baed code has the structure choun in Fig I 

The onginal daw sequence i is first transformed into a concu- 

ires grammx (or simply a grammar) G irum which x mn he fully 

reroversd. and then G is comprcssed indirectly hy uring 3 7ero- 

order arithmetic code RcDre bringing in the grdmmar transfomi, 

we begin with cxpl~ining how context-free grdmmsn arc uscd to 

reprcwnt scquenccs x in ,\-. 

Figure 1: Structure of a grammar-based code. 

Fix a countable set S={so,s,,s z,...} ofsymhols, disjoint from 

A . Symbols in Swill be called variables; symbols in A will he 

called ferminal symbols. For any j>l, let S(j7={so ,S~,S~,...,S~.,). 

For our purpose, a context-free grammar G is mapping from S(i) 

to (S(i)uA )+for some j?l. The set S6) will be called the variable 

set of G and, to be specific, the elements of SO) shall he called 

sometimes G-variables. For the purpose of data compression, we 

are interested only in grammars G for which the parallel 

replacement procedure terminates after finitely many steps and 

every G-variable s( i

111. IMPLEMENTATION 

As we have presented in section 11, the new lossless 

grammar-based compression code is accomplished hy taking the 

following three steps: 

i) Dcfinc a size-on-demand variable set of G and ensure each G- 

variable is distinct from source symbols; 

ii) Convert the source sequence x into an irreducible context-free 

grammar by applying a greedy grammar transform which adopts 

reduction rules in some order [3]; 

iii)Based on the grammar transform, use one of three universal 

lossless data compression algorithms which are sequential 

algorithm, improved sequential algorithm, and hierarchical 

algorithm, to compress the irreducible grammar. All these 

algorithms combine the power of arithmetic coding with that of 

string matching. We define the size 14 of S as the number of G- 

variables in S. In OUT work, we fixed the size 14 of S, then 

operated the irreducible grammar transform and finally applied the 

hierarchical data compression algorithm [3]. The rest of this 

section mainly describes how we implemented the new lossless 

algorithm, and presents the compression performance of OUT 

implementation on mammographic images in 3 categories from 

small size (35X5), middle size (200X150) to large size 

(1024X1024). Each category has 30 images. 

To obtain a higher compression rate, we directly 

transformed the MIAS image text into a grammar G without 

converting each text into binary stream. However, the elements of 

the image text are not identical in length compared to their binary 

forms. For recovering the image successfully hy decoding later 

on, we embedded a specific symbol among pixel values to 

distinguish every two neighbors as solution before starting 

grammar transform. 

As noted in [1][3], the G-variables that represent the distinct 

production rules are distinct. The size Ifl of S is dependent on the 

image size as the new irreducible grammar transform is applied. 

Since each production rule's lei? member is a unique G-variable 

and its right member is represented hy (Se) U A )+, and string 

match is often used by the new grammar transform, the G- 

variables should he actual and as many as we need. Whereas the 

total visible symbols in language C via which we simulated the 

grammar code are limited. The maximum number of the available 

symbols is 75. But it does not mean the language C can not 

overcome the problem, but it will involve more sophisticated 

programming. Therefore, we adopted two schemes. Even though 

both of the schemes are not same as the theory, they helped us to 

study the new lossless grammar-based code in depth and verify its 

feasibility. One is to allow reusing the 75 G-variables (or less than 

75) to encode the remaining data sequence. In other words, as a 

result, a complete image was represented by several parallel 

grammars as shown in Figure 3. The other scheme is to use only 

75 G-variables to convert an image into an irreducible grammar 

one time only. 

In the first scheme, we used 75 and 35 G-variables to 

encode the 30 middle-size (2OOX150) mammographic images 

respectively and used 35 G-variables to encode the 30 large-size 

(1024x1024) mammographic images. From the study, we found 

that the more variables we used, the more processing time we 

needed for converting from a source image to an irreducible 

- 2087 - 

grammar G. For example, using 75 G-variables consumed 15 

minutes and 37 seconds on a GNU Linux workstation to encode 

medium-size images, while using 35 G-variables only took 2 

minutes and 4 seconds on the average instead. While in medical 

applications especially for mammograms where real-time 

compression is not an issue, the computational time can be 

sacrificed to some extent. But for regular images, the 

computational time of grammar code will be considered. Another 

observation we obtained from the study of scheme 1 is that its 

compression rate is better than Huftnan, Lempel-Ziv, and 

Arithmetic algorithms in some cases but not very significant 

because such a scheme destroys the correlation as a whole of the 

source image. Figure 4 displays this conclusion. We also 

compared the compression rate of using 75 G-variables and using 

35 G-variables. They are 2.643:l and 2.639:1, respectively. The 

compression gain of using 75 G-variables is very limited 

compared to using 35 G-variables. While using 75 G-variables 

took much longer time on processing as described above. 

Obviously, in scheme 1, using 35 G-variables is good enough for 

encode source images. Therefore, to save time, we did not do test 

on 1024x1024 mammographic images by using 75 G-variables. 

In the second scheme, we compared average compression 

rate of grammar code over 30 small (35x5) mammographic 

images with Huftinan, Lempel-Ziv, and Arithmetic algorithms. Its 

compression rate is much greater than any other 3 traditional 

techniques, which is displayed in Figure 5. While this scheme is 

impractical for the compression of large images since 75 G- 

variables are not enough for this purpose, the result does 

demonstrate that if we let 14 be big enough to completely 

represent a large image, its compression performance will 

conform to the theory and the Result 2 described in Section 11. 

However, we should be aware of the time consumption involved 

for processing large images using grammar code. 

Figure 2: A sample of mammographic images. (a) the original 

image(b) the image aAer Grammar decoding 

Grayscale Image 

174 175 173 173 174 

177 180 182 180 175 

176 176 174 173 175 

.................................. 

.................................. 

io 

ill io 

N575 

Figure 3: The image represented by several grammars. 


99

4 

3.5 

2.5 

A"mS 

iaqmaioonw 

I.! 

3 

I 

0.S 

0 

350. A"Ul"nlC uw H u h 

"vlablcl 

ad 

rrpeudly 

Techniques 

Figure 4: The compression performances of techniques over 30 

1024x1024 digital mammographic images 

hmgr 

mqmnon mL 

750- A"tt.wc Lzw H u h 

urnable 

V l d om nm 

0"lY 

Techniques 

Figure 5: The compression performance of techniques over 30 

35x5 digital mammographic images 

For transmitting mammograms through network, 

conquering variables requirement of grammar-based code will 

provide high compression performance. 


For decades, researchers have kept looking for much 

effective lossless compression technique for critical and large 

images, MIAS images for example, to be transmitted across the 

internet without any data loss. The new lossless compression 

100 

- 2088 - 


grammar-based code attracts our attention and prompts us to 

verify if it is a promising code. The result based on our simulation 

presents that it is promising to get higher compression ratio for 

large images than Huffman, Lempel-Ziv, Arithmetic algorithms. 

Assuming that the number of the symbols as the variables of 

G is infinite, we can completely compress large images according 

to the original design of the new universal lossless grammar-based 

code. But it will involve more complicated processes and large 

compression time. These two are the main obstacles for the 

grammar-based code to be applied practically. 

REFERENCES 

[I] I. C. Kieffer and E. -H. Yang, "Grammar based codes: 

A new class of universal lossless source codes," EEE Trans. 

Inform. Theory, Vol. 6, No. 3, May 2000 

[2] J. C. Kieffer, E. -H. Yang, G. Nelson, and P. Cosman, 

"Universal lossless compression via multilevel pattern matching," 

IEEE Trans. Inform. Theory. vol. 46, pp. 12274245, july 2000. 

[3] E. -h. Yang and J. C. Kieffer, "Efficient universal 

lossless data compression algorithms based on a greedy sequential 

grammar transform-Part one:Without context models," EEE 

Trans. Inform. Theory, vol. 46, pp.755-788, May 2000. 

[4] N. Abramson, Information Theory and Coding. New 

York McGraw-Hill, 1963. 

[5] I. Ziv and A. Lempel, "Compression of individual 

sequences via variable rate coding," IEEE Trans. Inform. Theory, 

vol. IT-24, pp. 530-536, 1978. 

[6] R. G. Gallager, "Variations on a theme by Hufian," 

IEEE Trans. Inform. Theory, vol. IT-24, pp. 668-674, 1978. 

[7] J. Bentley, D Sleator, R. Tarjan, and V. K. Wei, "A 

locally adaptive data compression scheme," Commun. Asoc. 

Comput. Mach., vol. 29, pp. 320-330, 1986. 

[8] P. Elias, "Interval and recency rank source coding: Two 

on-line adaptive variable length schemes," IEEE Trans. Inform. 

Theory, vol. IT-33, pp. 1-15, 1987. 

[9] B. Y. Ryabko, "Data compression by means of a 'book 

stack',"Probl. Inform. Transm.,vol. 16,110.4, pp. 16-21, 1980. 

[IO] D. L. Neuhoff and P. C. Shields, "Simplistic universal 

coding," IEEE. Trans. Inform. Theory, vol. 44, pp. 778-781, Mar. 

1998. 

[I I] D. S. Omstein and P. C. Shields, "Universal almost sure 

data compression," Ann. Probab.,vol. 18, pp. 441452,1990,

CONTENT BASED AUDIO CLASSIFICATION AND RETRIEVAL USING JOINT 

TIME-FREQUENCY ANALYSIS 


Multimedia Information and Signal Analysis Research (MISAR) Laboratories 


Ryerson University, Toronto, Ontario, Canada 

e-mail: (sesmaili)(krishnan)(kraahemi)@ee.ryerson.ca 

ABSTRACT 

In this paper, we present an audio classification and retrieval technique 

that exploits the non-stationary behavior of music signals 

and extracts features that characterize their spectral change over 

time. Audio classification provides a solution to incorrect and inefficient 

manual labelling of audio files on computers by allowing 

users to extract music files based on content similarity rather than 

labels. In our technique, classification is performed using timefrequency 

analysis and sounds are classified into 6 music groups 

consisting of rock, classical, folk, jazz and pop. For each 5-second 

music segment, the features that are extracted include entropy, centroid, 

centroid ratio, bandwidth, silence ratio, energy ratio, and 

location of minimum and maximum energy. Using a database 

of 143 signals, a set of 10 time-frequency features are extracted 

and an accuracy of classification of around 93% using regular linear 

discriminant analysis or 92.3% using leave one out method is 

achieved. 


With the abundance of personal computers, advances in high speed 

modems operating at 100 Mbps and GUI based peer-to-peer (P2P) 

file-sharing systems that make it simple for individuals without 

much computer knowledge to download their favorite music, there 

has been an increase of digitized music available on the Internet 

and on personal computers. As such, there is also a rising need 

to manage and efficiently search the large number of multimedia 

databases available online which is difficult using text searches 

alone. Current multimedia databases are indexed based on song 

title or artist name which requires manual entry and improper indexing 

could result in incorrect searches. A more effective content 

based retrieval system, analyzes audio signals, selects and extracts 

dominant perceptual features and classifies the music based 

on these features. Stronger features provide a higher degree of 

separation between classes and thereby a higher classification accuracy. 

The aim is to make music search engines as effective as 

text-based ones and this is examined further in this paper. 

In recent years, there has been many works on audio classification 

with various perceptual features and several classification 

algorithms. In one of the pioneer works done on audio classification 

and later commercialized into the “Muscle Fish” project, Wold 

et al [1] extracted an N dimensional vector consisting of several 

acoustical features such as loudness, pitch, brightness, bandwidth 

Thanks to Micronet and NSERC for funding. 

and harmonicity from each sound. A Euclidean (Mahalanobis) 

distance is then calculated between the input sound feature vector 

and the existing models in the database. Using the nearest neighbor 

(NN) rule, the signal is grouped into the class with the minimal 

Euclidean distance. 

In a similar work to that of [1], Liu et al [2] extract 13 different 

audio features to separate audio clips into different scene classes 

such as advertisement, basketball, football, news and weather. Features 

consist of volume distribution, pitch contour, bandwidth, frequency 

centroid and energy. A neural network classifier with a 

one-class-in-one network (OCON) structure is used and an overall 

classification rate of 88% is achieved. Artificial neural networks 

(ANN) are effective in detecting complex nonlinear relationships 

while requiring little formal training. However, their process is 

computationally expensive and more importantly, the relation between 

the input and output variables is defined in a black box 

model that has no analytical basis. In terms of audio classification 

this means that it is difficult to deduce which acoustical features 

are significant in classifying each type of sound [1]. 

In a different technique, Lu and Hankinson [3] used a rulebased 

heuristic classification method to classify an audio signal 

into speech, music and noise. For each feature, a threshold is set 

to determine the segment type and the feature set includes silence 

ratio, centroid, harmonicity and pitch. Since the feature threshold 

must change for different audio inputs, this type of classifier is 

tedious and not ideal. A classification rate of 75% for speech, and 

89% for music is reported. 

Lu et al [4] proposed support vector machines (SVMs) as an 

alternative to current classification methods. Using a kernel-based 

SVM increases the classification rate by separating nonlinear cases. 

Here, a nonlinear kernel function maps the data to a high dimensional 

feature space where the data is linearly separable. The authors 

use a combination of a rule-based classifier and a kernel 

based SVM to distinguish between 5 different audio classes including 

silence, music, background sound, pure speech and nonpure 

speech. Their feature set include similar features to those 

reported in [1] and [5], such as MFCCs, zero-crossing rate (ZCR), 

short time energy (STE), sub-band powers, brightness, and bandwidth 

with some new features such as spectral flux (SF), band periodicity 

(BP), and noise-frame-ratio (NFR). An average classification 

accuracy of around 90% is achieved. 

In the majority of the previous work in this area, audio is examined 

in either the time or frequency domain where it is assumed 

that the signals are wide sense stationary. In reality, sounds are 

non-stationary and multi-component signals consisting of series 

 

101 

 


of sinusoids with harmonically related frequencies. Our algorithm 

considers the short-time Fourier transform (STFT) of an audio signal 

to extract parameters that will be used to classify signals. Our 

retrieval technique is less computationally intensive than those that 

use ANN, SVM, or Hidden Markov Models (HMM). Also, the 

efficiency of features can be examined which is not feasible in 

ANNs. Note that while HMM can be used to examine spectral 

change over time, past works have shown that HMM needs to be 

coupled with external features such as Cepstral or perceptual features 

to be efficient [6]. Finally, our method also offers the added 

improvement that it is not specific to certain audio files and can 

be applied without adjusting the algorithm or thresholds such as in 

rule-based models. 

Our work on content-based audio classification is presented 

as follows. Section 2 presents the application of time-frequency 

analysis to feature selection and analysis for audio classification. 

In Section 3 we present our classification results for the system and 

our conclusions are provided in Section 4. 


2.1. Short-time Fourier transform (STFT) algorithm 

Since speech and audio signals have spectral characteristics that 

vary over time, they require a non-stationary signal model such 

as the STFT to describe them. Ultimately, we would like to imitate 

the capability of the ear and provide simultaneous information 

about time and frequency of the music. STFT uses a sliding window 

to compute the Fourier transform thereby providing an estimate 

of the “local frequency” at a given time. The STFT of a signal 

x[n] is given by, 

STFT(n, f) = 

∞ 

m=−∞ 

x[n + m]w[m]e −j(2πf)m 

where w[m] is the window function and the spectrogram of x is 

defined as SPEC(n, f) =|STFT(n, f)| 2 . For a given signal 

x, SPEC(n, f)∆n∆f represents the energy in the time interval 

[n, n +∆n] in the frequency band [f,f +∆f]. In STFT analysis, 

we can improve the frequency resolution by decreasing the 

spectral width ∆f at the expense of increasing the temporal width 

∆n (poor time resolution). Also the shape of the window w[n] 

is important as a window with a sharp cutoff will introduce artificial 

discontinuities. Hanning windows are mainly used in audio 

classification techniques as they reduce spectral leakage. 

2.2. Audio feature extraction 

The set of features extracted are critical as they need to be strong 

enough to clearly separate the classes of signals. This procedure 

requires perceptual features that model the human auditory system. 

Discriminating music from speech is less complex than between 

different classes of music. The latter may only require a 

small number of features such as zero crossing rate or energy envelope 

and since the spectral characteristics are not very similar, 

high accuracy rates are achieved. 

In this paper, we examine the similarities of 143 audio signals 

and classify them under six different genres. Each audio signal 

is 5 seconds, mono-channel, 16 bits per sample and sampled at 

44.1 kHz. The length of the audio samples was chosen to be 5 

seconds in relevance with the human neurological behavior which 

(1) 

 

102 

was examined by Perrot et al in [7]. They found that human beings 

require at least 3 second excerpts to identify different musical 

genres with a 70% accuracy rate while the accuracy decreases to 

53% for a 250 ms excerpt. 

We start by transforming our audio signal into a spectrogram 

with a window size of 1024 samples which corresponds to about 

23 ms at 44.1 kHz. This window size is similar to that used in 

[4] and [8]. A Hanning window with 50% overlap is used and the 

DFT is calculated in each window. The audio features extracted 

from the two-dimensional time-frequency distribution (TFD) are 

explained below. 

2.2.1. Entropy 

The entropy of a signal is a measure of its spectral distribution 

and portrays the noise-like or tone-like behavior of the signal. The 

entropy of a signal in time frame n can be calculated as: 

where 

Fm 

H(n) = Pf (TFD(n, f)) log2 Pf (TFD(n, f)), (2) 

f=0 

Pf (TFD(n, f)) = 

TFD(n, f) 

. (3) 

TFD(n, f) 

f=Fm 

f=0 

Here, TFD(n, f) represents the energy of the signal at time frame 

n and frequency index f (it is equivalent to SPEC(n, f) defined 

in Section 2.1). Also, Fm refers to the maximum frequency. 

Consider the case where there are L number of frequency bins. 

Then the maximum entropy in time window n is log 2 L which occurs 

if the frequency bins are equiprobable. First, we examined 

the entropies of 3 different types of signals. These signals were 

analyzed using 128 frequency bins, implying that the maximum 

entropy is 7 bits. The first signal consisted of a single sine wave, 

at a sampling frequency of 1 kHz. In this case, the mean entropy 

was 1.24 bits and the standard deviation at 5.636 × 10 −6 . Next 

we considered the vowel “a” (a signal component with harmonic 

structure) and its entropy was calculated to be 2.84 bits with a standard 

deviation of 0.1. Finally, we considered white Gaussian noise 

and its mean entropy was 6.38 bits with a standard deviation of 

0.06. As we expected, the sine wave had the lowest entropy and a 

standard deviation of almost zero while white noise had the largest 

entropy (approaching maximum) with a larger standard deviation. 

From our database of music signals, we found that entropy 

was a dominant feature in classifying particularly rock or folk music. 

As shown in Figure 1a, rock signals possessed the highest 

entropy followed closely by folk music while classical, country, 

jazz and pop had low entropies. Figure 1b shows the distribution 

of entropy for rock music compared to classical. As can be seen, 

the entropy ranges for the two types of signals are quite different. 

In order to determine the strength of entropy from a different perspective, 

a receiver operating curve (ROC) was plotted. The ROC 

curve is a two dimensional measure of classification performance. 

The area under this curve measures discrimination, or the ability 

of a feature to correctly classify signals. An area of 1.0 represents 

a perfect test; where an area of 0.5 or less shows the feature is 

not useful in discrimination of that class. Rock, folk, jazz, classical, 

country and pop music had ROC areas of 0.933, 0.808, 0.644, 

0.337, 0.294, and 0.145 respectively. These results show that although 

entropy is a strong feature, further features are required to 

improve classification. 


Entropy 

4.5 

4 

3.5 

3 

2.5 

2 

1.5 

1 

0.5 

0 

ROCK CLASSICAL COUNTRY FOLK JAZZ POP 

Type 

(a) 


0.5 

0.4 

0.3 

0.2 

0.1 

0 

1 1.5 2 2.5 3 3.5 4 4.5 5 

0.5 

0.4 

0.3 

0.2 

0.1 

Rock 

Classical 

0 

1 1.5 2 2.5 3 3.5 4 4.5 5 

Mean Entropy 

Fig. 1. Comparison of entropy values a) Results for different genres 

b) Distribution for classical and rock. 

2.2.2. Energy ratio 

The rate of change in the spectral energy over time was measured 

as the mean of the total energy in a frequency sub-band to the pre- 

vious time window (E[ 

fupper 

TFD(n,f) 

f=flower f=fupper 

TFD(n−1,f) 

f=flower (b) 

]). This was exam- 

ined in three different sub-bands [0, 5 kHz], [5, 10 kHz], [10 kHz, 

Fm]. However, it was found empirically that the energy ratio in 

mid and high frequency bands did not improve the classification. 

This is probably because most energy activity in audio signals is 

in the low frequency band. Therefore, only the mean of energy in 

the low-band was used in our feature set. 

The frequency location with the lowest energy component was 

also computed. Although an estimate of the mean can be calculated 

from the frequency domain, it was included in our feature set 

as it improved the classification rate by 5%. In fact, using the mean 

and standard deviation of the location of minimum energy provided 

100% classification rates for classifying country, folk and 

jazz music but low classification rates for the other three genres. 

When examining the histogram of the location of minimum energy 

for our database of signals (Figure 2), the frequency spread 

was smaller for country (21.4-21.5 kHz), folk (21.45-21.85 kHz), 

jazz (21.36-21.51 kHz) and a wider range for pop (18.1-21.5kHz), 

classical 15.5-21.5kHz) and rock (20-21.6 kHz). 

2.2.3. Brightness 

The brightness of a signal also referred to as its frequency centroid, 

shows the weighted midpoint of the energy distribution in a given 

frame. It is defined by: 

Fm f=0 fTFD(n, f) 

fi(n) = . (4) 

Fm 

f=0 TFD(n, f) 

The brightness feature could also be seen as the instantaneous 

mean frequency parameter, a typical non-stationary feature of a 

signal. The frequency centroid of the audio signal in the low frequency 

range (0-5KHz) is also examined as most of the frequency 

content of audio signals is concentrated in low frequency. 

In addition, the mean of centroid ratio to previous window is 

a useful feature as it measures the spectral change over time. We 

found that rock, folk, pop and country music signals had the largest 

 

103 

Distribution of location of minimum energy 

1 

0.5 

Rock 

0 

15 16 

1 

Classical 

0.5 

17 18 19 20 21 22 

0 

15 

1 

Country 

0.5 

16 17 18 19 20 21 22 

0 

15 

1 

0.5 

Folk 

16 17 18 19 20 21 22 

0 

15 

1 

0.5 

Jazz 

16 17 18 19 20 21 22 

0 

15 

1 

0.5 

Pop 

16 17 18 19 20 21 22 

0 

15 16 17 18 19 

Frequency in kHz 

20 21 22 

Fig. 2. Distribution of location of minimum energy 

change in centroid frequency over time while classical and jazz 

signals had the lowest change. This is expected as classical and 

jazz music generally have less activity over time compared to the 

other 4 genres. 

2.2.4. Bandwidth 

Bandwidth is the magnitude-weighted average of the difference 

between the signal’s spectral components and centroid. It can be 

defined as: 

 

Fm 

B(n) = f=0 (f − fi(n)) TFD(n, f) 

. (5) 

Fm 

f=0 TFD(n, f) 

Effectively, it shows the spectral shape and the spread of energy 

relative to the centroid, therefore it is also a non-stationary feature. 

For instance, a sine wave without noise has zero bandwidth. 

2.2.5. Silence ratio 

Silence ratio is the number of silent time window frames with total 

energy less than 0.01. This threshold is set empirically. Note that 

this feature could also be extracted from the time domain. 

Bandwidth, brightness and silence ratio have been proven to 

be effective in previous audio classification papers including [1, 2] 

although an STFT approach showing the rate of change to previous 

windows has not been used. 

3. AUDIO CLASSIFICATION 

Using the above analysis, the 10 features extracted for each sample 

included mean and standard deviation of centroid frequency, mean 

centroid (low-frequency range), mean of centroid ratio to previous 

window, mean bandwidth, silence ratio, mean and standard deviation 

of the frequency location with the lowest energy, mean and 

standard deviation of entropy. Note that mean and variance of a 

feature are calculated over the entire time window. Once the features 

are extracted for the 143 audio signals, linear discriminant 

analysis (LDA) is then applied using SPSS software [9], to predict 

group classification of cases. This type of analysis tries to 


Function 2 

4 

2 

0 

-2 

-4 

-6 

Canonical Discriminant Functions 

-8 

Function 1 

-6 

-4 

Classical 

pop 

-2 

0 

count 

jazz 

2 

Folk 

ROCK 

4 

6 

file type 

Group Centroids 

Fig. 3. All-groups scatter plot with the first two canonical discriminant 

functions 

find a linear combination of those extracted features that best separate 

the group of cases. To represent this linear combination, a 

discrimination function is formed using the extracted features as 

discrimination variables and can be expressed as: 

pop 

jazz 

Folk 

count 

Classical 

L = b1x1 + b2x2 + ....... + b10x10 + c, (6) 

where b1..b10 are the coefficients, c is a constant and x1..x10 are 

the values of the extracted features. This technique finds the first 

function that separates the groups as much as possible and then 

finds further functions that improve the separation and are uncorrelated 

to previous ones. The number of functions is determined 

by the number of predictors or features and the number of groups 

available. 

Using Fisher’s coefficients and prior probabilities of each group, 

a scatterplot (Figure 3) is created showing the discriminant scores 

of the cases on two discriminant functions. This plot shows the 

separation between different cases. Songs are categorized into six 

groups (rock, classical, country, folk, jazz and pop) and the confusion 

matrix depicted in Table 1 shows the classification performance. 

Using the original LDA, 93.0% of all original grouped 

cases are correctly classified with folk music having the lowest 

rate. A more accurate estimate is obtained through the crossvalidated 

method where a portion of cases belong to the learning 

sample and the other cases belong to the test sample. In the leaveone-out 

method used, each case is classified by the functions derived 

by all cases except that one. This method yields a 92.3% 

classification rate revealing the discrimination strength of our feature 

set. 


In this paper, we examined a technique where features used to classify 

music signals are derived directly from the time-frequency domain. 

Using six different genres for classification, we have shown 

that high accuracy rates can be obtained using features that reflect 

the non-stationarity properties of audio signals and are able to depict 

its spectral, energy and entropy change over time. In addition 

to the success rate, our algorithms have low computational complexity 

compared to other techniques and they offer versatility as 

ROCK 

 

104 

Method Type RO CL CO FO JA PO CA% 

1. Original RO 14 0 0 2 0 0 87.5 

CL 0 30 0 0 0 1 96.8 

CO 0 0 15 0 0 1 93.8 

FO 2 0 1 27 1 1 84.4 

JA 0 0 0 1 15 0 93.8 

PO 0 0 0 0 0 32 100 

Overall 93.0 

2. Cross- RO 14 0 0 2 0 0 87.5 

Validated CL 0 30 0 0 0 1 96.8 

CO 0 0 15 0 0 1 93.8 

FO 2 0 1 26 1 2 81.3 

JA 0 0 0 1 15 0 93.8 

PO 0 0 0 0 0 32 100 

Overall 92.3 

Table 1. Classification results. Method: Original - Linear discriminant 

analysis, Cross - validated - Linear discriminant analysis with 

leave-one-out method (RO-Rock, CL-Classical, FO-Folk, Ja-Jazz, 

PO-Pop, CA% - Classification accuracy rate) 

they can be applied to any audio signal without alteration. Further 

work will include optimization of window size in the TF domain 

as well as examining other classification methods such as minimum 

classification error (MCE) to improve classification rate for 

a larger database of signals. 

5. REFERENCES 

[1] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based 

classification, search, and retrieval of audio,” IEEE Multimedia, 

pp. 27–36, 1996. 

[2] Z. Liu, J. Huang, Y. Wang, and T. Chuan, “Audio feature extraction 

and analysis for scene classification,” in IEEE Workshop 

on Multimedia Signal Processing, June 1997, pp. 343– 

348. 

[3] G. Lu and T. Hankinson, “A technique towards automatic audio 

classification and retrieval,” in Fourth International Conference 

on Signal Processing, Beijing, China, October 1998, 

pp. 1142–1145. 

[4] L. Lu, H. Zhang, and S. Li, “Content-based audio classification 

and segmentation by using support vector machines,” 

ACM Multimedia Systems Journal 8, vol. 8, no. 6, pp. 482– 

492, March 2003. 

[5] J. Foote, “Content-based retrieval of music and audio,” in 

Multimedia Storage and Archiving Systems II, Proc. of SPIE, 

1997, pp. 138–147. 

[6] T. Zhang and C. Kuo, “Hierarchical classification of audio 

data for archiving and retrieving,” in Proc. ICASSP, March 

1999, pp. 3001–3004. 

[7] D. Perrot and R.O. Gjedigen, “Scanning the dial: An exploration 

of factors in the identification of musical style,” Proceedings 

of the 1999 Society for Music Perception and Cognition, 

p. 88, 1999. 

[8] G. Tzanetakis and P. Cook, “Music genre classification of 

audio signals,” IEEE Transactions on Speech and Audio Processing, 

vol. 10, no. 5, pp. 293–302, July 2002. 

[9] SPSS Inc., “SPSS advanced statistics user’s guide,” in User 



MODIFIED LOCAL DISCRIMINANT BASES AND ITS APPLICATIONS IN SIGNAL 

CLASSIFICATION 




London, ON N6A 5B8, Canada 

ABSTRACT 

One of the major challenges in classification problems based 

on signal decomposition approach is to identify the right basis 

function and its derivatives that can provide optimal features to 

distinguish the classes. With the vast amount of available libraries 

of orthonormal bases, it is hard to select an optimal set of basis 

functions for a specific dataset. To address this problem, pruning 

algorithms based on certain selection criteria is needed. Local 

Discriminant Bases (LDB) algorithm is one such algorithm, which 

efficiently selects a set of significant basis functions from the library 

of orthonormal bases based on certain defined dissimilarity 

measure. The selection of this dissimilarity measure is critical as 

they indirectly contribute to the performance accuracy of the LDB 

algorithm. In this paper, we study the impact of the dissimilarity 

measures on the performance of the LDB algorithm with two classification 

examples. The two biomedical signal databases used are 

1. Vibroarthographic signals (VAG) - 89 signals with 51 normal 

and 38 abnormal, and 2. Pathological speech signals - 100 signals 

with 50 normal and 50 pathological. Classification accuracies 

of 76.4% with VAG database and 96% with pathological speech 

databases were obtained. This modified method of signal analysis 

using LDB has shown its powerfulness in analyzing non-stationary 

signals. 


The Local Discriminant Bases (LDB) [1] algorithm is recently being 

used successfully in many classification problems. The optimal 

choice of LDBs for a given dataset is driven by the nature of 

the dataset and the dissimilarity measures [2] used to distinguish 

between classes. The choice of the dissimilarity measure for a particular 

dataset depends on knowledge of the data, computational 

complexity, and the classification accuracy requirements. For example 

probabilistic dissimilarity measures such as relative-entropy 

needs prior knowledge of the dataset distribution, whose accuracy 

depends on the size of data, on the other hand simple dissimilarity 

measures such as Euclidean distance is only suitable for numeric 

data sets. A combination of multiple dissimilarity measures with 

varying complexity can be used to achieve high classification accuracies. 

In this paper we analyze two biomedical signal databases using 

LDB algorithm with 3 different dissimilarity measures. The 

LDB algorithm is based on the wavelet packet decompositions 

with 3 different wavelets namely Daubechies (db4), Coiflet (cf4) 

Thanks to NSERC for funding this research work. 




Toronto, ON M5B 2K3, Canada 

and Symlet (sy4) [3]. This gives us 9 different combinations for 

each of the databases. A two group (class1 and class2) classification 

was performed for the 9 combinations. Linear discriminant 

analysis (LDA) based classifier was used to compute the classification 

accuracies. The classification accuracies were verified 

using the leave-one-out method [4]. The paper is organized as 

follows: In Section 2 on Methodology, Local Discriminant Bases 

algorithm, dissimilarity measures, feature extraction and pattern 

classification are covered. Results and discussions are covered in 

Section 3, and Conclusions in Section 4. 



In the LDB [1] algorithm with wavelet packet bases, a set of training 

signals x c i for all C classes are decomposed to a full tree 

structure of order N. We restrict our analysis to binary wavelet 

packet trees. Let Ω0,0 be a vector space in R n corresponding to 

the node 0 of the parent tree. Then at each level the vector space 

is spilt into two mutually orthogonal subspaces given by Ωj,k = 

Ωj+1,2k ⊕ Ωj+1,2k+1 where j indicates the level of the tree and k 

represents the node index in level j, givenbyk =0, ...., 2 j − 1. 

This process repeats till the level J, giving rise to 2 J mutually 

orthogonal subspaces. Our goal is to select the set of best subspaces 

that provide maximum discriminant information between 

the classes of the signal. Each node k contains a set of basis vectors 

Bj,k =[wj,k,l] l=2no−j −1 

l=0 , where 2 no corresponds to the length of 

the signal. Then the signal xi can be represented by a set of coefficients 

c as: 

xi =Σj,k,lcj,k,lwj,k,l 

Basically the signal xi is decomposed into 2 J subspaces with 

cj,k,l coefficients in each subspace. With the training signals decomposed 

into wavelet packet coefficients we need to define a dissimilarity 

measure (Dn) in the vector space so as to identify those 

subspaces, which have larger statistical distance between classes. 

This dissimilarity measure is used in an iterative manner to prune 

the tree in such a way that only a node is split if the cumulative discriminative 

measure of the children nodes is greater than the parent 

node. The resulting tree contains the most significant LDBs, 

from which a set of K significant LDBs are selected to construct 

the final tree. The testing set signals are then expanded using this 

tree and features are extracted from the respective basis vectors for 

classification. 

 

105 

 


(1)

In the proposed method we use a similar approach with some 

modification. Instead of the selective splitting of the nodes, which 

basically helps in removing the redundancy in the LDB selection, 

we used all the nodes from the full decomposition tree and ranked 

them in decreasing order of their dissimilarity measure values between 

classes. The first 5 nodes that exhibit high dissimilarity 

measure values between the classes are selected for each trial. 

Among these nodes, based on the frequency of occurrence in all 

the trials, the 5 most occurring significant LDBs are selected. The 

redundancy within these 5 LDBs is later removed in the feature 

evaluation process in the LDA classifier. This is basically done 

to reduce the computational complexity of the LDB algorithm implementation. 

The whole process is repeated for three different 

wavelets (db4, cf4 and sy4) and the wavelet, which provides maximum 

dissimilarity measures among all the tested wavelets, is chosen 

to be the best basis for expansions. 

2.2. Databases 

2.2.1. Vibroarthographic (VAG) signals 

These are the vibration signals emitted from the human knee joints 

during an active movement of the leg. The VAG signals can be 

used to detect the early joint degeneration or knee defects that 

are reflected in knee movements. Extensive work [5] has been 

done using time-frequency approach in classifying these signals 

into multiple groups. Few important characteristics of the VAG 

signals which make them difficult to analyze are as follows: (i) 

Highly non-stationary in nature, (ii) Varying frequency dynamics, 

and (iii) Multi-component signal. The database consists of 89 signals 

with 51 normal and 38 abnormal signals. A normal and an 

abnormal VAG signal are shown in Fig. 1a. 

2.2.2. Pathological speech signals 

These are speech signals recorded from the pathological and normal 

talkers in a sound-proof booth at the Massachusetts Eye and 

Ear Infirmary. The normal talkers exhibited no abnormal vocal 

characteristics and indicated no history of voice disorders. All signals 

were sampled at 25 kHz. The signals were the first sentence 

of the rainbow passage, ’when the sunlight strikes rain drops in 

the air, they act like a prism and form a rainbow’, as spoken by 

the subjects. More details about the database and the classification 

problem can be found in authors previous work [6]. The database 

consists of 100 signals with 50 normal and 50 abnormal signals. A 

normal and pathological speech signal are shown in Fig. 1b. 

Amplitude (.au) 

Amplitude(.au) 

60 

40 

20 

0 

20 

40 

20 

0 

20 

40 

60 

80 

Normal and Abnormal VAG signals 

Normal 

1000 2000 3000 4000 

Time samples 

5000 6000 7000 

Abnormal 

1000 2000 3000 4000 

Time samples 

5000 6000 7000 

(a) VAG signals 



0.6 

0.4 

0.2 

0 

0.2 

0.4 

0.4 

0.2 

0 

0.2 

0.4 

0.6 

Normal and Pathological speech signals 

Normal 

2 4 6 8 10 12 14 

x 10 4 

Time samples 

Pathological 

2 4 6 8 10 12 14 16 18 

x 10 4 

Time samples 

(b) Pathological speech signals 

Fig. 1. An Example of normal and abnormal/pathological signals 

for both the databases. 

 

106 

2.3. Dissimilarity measures 

In this study we used three different dissimilarity measures and 

performed a two group (class1 and class2) classification on the 

databases. In general most of the biomedical signals can be characterized 

by one or more of the following, (i) Their average energy 

distribution pattern over frequency bands, (ii) Event based temporal 

structures, (iii) Periodicity, and (iv) The amount of randomness. 

These rationales were used in arriving at the following dissimilarity 

measures. 

The first dissimilarity measure D1 is the difference in the normalized 

energy between the corresponding nodes of the training 

signals from class1 and class2. This gives the difference in the 

energy distribution of the signals on the time-frequency plane. 

D1 = E 1 j,k − E 2 j,k, (2) 

where E 1 j,k and E 2 j,k are the normalized energy of the corresponding 

nodes for class1 and class2 signals. 

The second dissimilarity measure D2 is the correlation index 

between the basis vectors at corresponding nodes. This measure 

emphasizes those nodes that can detect the difference in the temporal 

characteristics of the signals between class1 and class2. 

D2 =< Bj,k,Fj,k >, (3) 

where B and F are the corresponding basis vectors of class1 and 

class2 at node (j, k) 

The discriminant measure D3 is a measure of estimating the 

randomness or non-stationarity of the basis vectors. It is computed 

as the set of variances along the segments of the basis vector coefficients. 

The ratio of this variance measure between the signals 

from class1 and class2 indicate the amount of deviation observed 

in the non-stationarity between the classes. 

D3 = var(var(p))j,k) 

, (4) 

var(var(q))j,k) 

where p and q are the index of the L segments obtained by segmenting 

the basis vectors at node (j, k) for class1 and class2. 

2.4. Feature extraction 

Once the LDB nodes for each of the three dissimilarity measures 

are identified using the training sets (in our study 10 randomly selected 

signals for each class were used to form the training set) as 

explained in Section 2.1, all the 89 VAG signals and the 100 pathological 

speech signals were decomposed using the corresponding 

sets of LDB tree structures. Figs. 2 and 3 show the sample 

LDB tree structure obtained for the VAG and pathological speech 

databases respectively. 

The basis vectors from each of the nodes (LDBs) can be directly 

used as feature vector, however, considering the dimension 

of the basis vectors, we extract the same features from the basis 

vectors of LDBs using the dissimilarity measures (D1, D2, and 

D3) [1]. That is, from each of the LDB nodes of the corresponding 

tree structures, the normalized node energy, correlation index 

and the variance measure were calculated. In short, each signal in 

the database is used to compute 15 features, 5 from each dissimilarity 

measure. As for the correlation index calculation we use a 

random choice of normal signal as a template to correlate with the 

signals from respective test databases. The above procedure was 


Tree Decomposition 

(0,0) 

(1,0) (1,1) 

(2,0) (2,1) (2,2) (2,3) 

(3,0) (3,1) 

(3,6) (3,7) 

(4,12) (4,13) 

(5,24) (5,25) 

80 

60 

40 

20 

0 

20 

40 

60 

80 

100 

data for node: (0) or (0,0). 

1000 2000 3000 4000 5000 6000 7000 

Fig. 2. A sample LDB tree decomposition for VAG database (db4 

wavelet and D3 dissimilarity measure) 

Tree Decomposition 

(0,0) 

(1,0) (1,1) 

(2,0) (2,1) 

(3,0) (3,1) 

(4,0) (4,1) 

(5,2) (5,3) 

(4,2) (4,3) 

(5,6) (5,7) 

0.8 

0.6 

0.4 

0.2 

0 

0.2 

0.4 

0.6 

data for node: (0) or (0,0). 

2 4 6 8 10 12 14 

x 10 4 

Fig. 3. A sample LDB tree decomposition for pathological speech 

database (cf4 wavelet and D3 dissimilarity measure) 

repeated for all the three wavelets. So, in total, for each wavelet, a 

set of 15 feature vectors was extracted from each of the signal in 

the test database. 

Figs. 4 and 5 demonstrate the feature space with the first two 

dominant features of the VAG and pathological speech database 

respectively. From the figures of the feature space plots, the discriminatory 

boundaries can be visualized between the signals of 

class1 and class2. These extracted features were then fed to a linear 

discriminant based classifier as will be explained in next section. 

2.5. Pattern Classification 

The motivation for the pattern classification is to automatically 

group signals of same characteristics using the discriminatory features 

derived as explained in the previous section. Pattern classification 

was carried out by linear discriminant analysis (LDA) technique 

using the SPSS software [7]. In discriminant analysis, the 

feature vector derived as explained above were transformed into 

canonical discriminant functions such as 

f = x1b1 + x2b2 + ....... + x42b42 + a, (5) 

where {x} is the set of features, {b} and a are the coefficients 

 

107 

Feature 2 

13 

12 

11 

10 

9 

8 

7 

6 

5 

4 

x 10 3 

Feature space of the two dominant features for the VAG database with db4 wavelet 

Normal 

Abnormal 

0.05 0.1 0.15 0.2 

Feature 1 

0.25 0.3 0.35 

Fig. 4. Feature space with the first two dominant features - VAG 

database, db4 wavelet 

Feature 2 

9 

8 

7 

6 

5 

4 

3 

2 

1 

x 10 4 Feature space of the two dominant features for the pathological speech database with cf4 wavelet 

Normal 

Pathological 

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 

Feature 1 

Fig. 5. Feature space with the first two dominant features - Pathological 

speech database, cf4 wavelet 

and constant respectively estimated and derived using the Fisher’s 

linear discriminant functions [7]. Using the chi-square distances 

and the prior probabilistic values of each group the classification 

is performed to assign each sample data to one of the groups. 

The classification accuracy was estimated using the leave-one out 

method which is known to provide a least bias estimate [4]. In 

leave-one-out method, one sample is excluded from the dataset 

and the classifier is trained with the remaining samples. Then the 

excluded signal is used as the test data and the classification accuracy 

is determined. This is repeated for all samples of the dataset. 

Since each signal is excluded from the training set in turn, the independence 

between the test and the training set are maintained. 

3. RESULTS AND DISCUSSIONS 

All the signals from both the databases were decomposed using 

their corresponding LDB tree structures. Features were extracted 

as explained in Section 2.4 and fed to the LDA based classifier. 

Classification accuracies were computed for the 9 combinations 

of the wavelet and the dissimilarity measures as shown in Table 


Wavelet LDA type D1 D2 D3 

db4 Regular 65 64 67 

Cross.V 61 57 64 

cf4 Regular 70 61 61 

Cross.V 65 57 48 

sy4 Regular 67 63 57 

Cross.V 61 60 45 

Table 1. Classification table for VAG database. Regular - Normal 

LDA, Cross.V - Leave-one-out method LDA, Classification 

accuracies are in percentage (%) 

Wavelet LDA type D1 D2 D3 

db4 Regular 84 64 77 

Cross.V 84 60 72 

cf4 Regular 85 52 92 

Cross.V 84 37 91 

sy4 Regular 87 53 86 

Cross.V 84 32 84 

Table 2. Classification table for pathological speech database. 

Regular - Normal LDA, Cross.V - Leave-one-out method LDA, 

Classification accuracies are in percentage (%) 

1 and Table 2 for both the databases. It can be observed from 

Table 1 that even though there are little variations, on an average 

all the three dissimilarity measures perform equally for the 

VAG database. However from Table 2 for the Pathological speech 

database it can be seen that the dissimilarity measures D1 and 

D3 provide high classification accuracies, whereas D2 performs 

poorly. In overall for VAG database we observe that the db4 wavelet 

in combination with all the three dissimilarity measures provides 

the highest classification accuracy. Similarly we observe for pathological 

database that the cf4 wavelet in combination with D1 and 

D3 provides the highest classification accuracy. Using these combinations 

we computed the highest possible classification accuracies 

for both the databases as shown in Table 3 and Table 4. 

For the VAG database an overall classification accuracy of 

78.7% using regular LDA and 76.4% using leave-one-out method 

were achieved. This is higher than the reported classification accuracy 

in [5]. For the pathological speech database an overall classification 

accuracy of 97% using regular LDA and 96% using leaveone-out 

method were achieved. This is higher than the reported 

classification accuracy in [6]. The above results demonstrate the 

performance optimization of the LDB algorithm using the right 

choice and combination of the dissimilarity measures to achieve 

high classification accuracies for non-stationary signal analysis. 


The importance of the dissimilarity measure in the performance 

optimization of the LDB algorithm was discussed with two classification 

examples. Classification accuracies were analyzed for 

different combinations of wavelets and the dissimilarity measures. 

Improvement in the classification accuracies by using a combination 

of multiple dissimilarity measures was demonstrated. High 

classification accuracies were achieved for the databases under 

study, thus proving the success of the modified LDB in analyz- 

 

108 

Method Groups Normal Abnormal Total 

Regular Normal 39 12 51 


% Normal 76.5 23.5 100 

Abnormal 18.4 81.6 100 



% Normal 76.5 23.5 100 

Abnormal 23.7 76.3 100 

Table 3. Table showing the highest classification accuracy 

achieved for the VAG database(db4 wavelet and selective combination 

of D1, D2 and D3) . Regular - Normal LDA, Cross.V - 

Leave-one-out method LDA, % = Percentage of classification 

Method Groups Normal Pathological Total 

Original Normal 48 2 50 


% Normal 96 4 100 




% Normal 96 4 100 


Table 4. Table showing the highest classification accuracy 

achieved for the pathological speech database (cf4 wavelet and 

combined D1 and D3). Regular - Normal LDA, Cross.V - Leaveone-out 

method LDA, % = Percentage of classification 

ing non-stationary signals. Future work involves in automating 

the choice of dissimilarity measures based on the nature of the 

databases and applications. 

5. REFERENCES 

[1] N. Saito and R. R. Coifman, “Local discriminant bases and 

their applications,” Journal of Mathematical Imaging and Vision, 

vol. 5, no. 4, pp. 337–358, 1995. 

[2] Andrew Web, Statistical Pattern Recognition, WILEY, West 

Sussex, England, 2002. 

[3] Stephane Mallat, A wavelet tour of signal processing, Academic 

press, San Diego, CA, 1998. 

[4] K. Fukunaga, Introduction to Statistical Pattern Recognition, 

Academic Press, Inc., San Diego, CA, 1990. 

[5] S. Krishnan, “Adaptive signal processing techniques for analysis 

of knee joint vibroarthrographic signals,” in Ph.D dissertation, 

University of Calgary, June 1999. 

[6] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, “Discrimination 

of pathological voices using an adaptive timefrequency 

approach,” in Proceedings of ICASSP 2002 IEEE 

International conference on Acoustics, Speech and Signal 

Processing, Orlando, USA, May 2002, pp. IV 3853–3855. 

[7] SPSS Inc., “SPSS Advanced statistics user’s guide,” in User 



RADIO OVER MULTIMODE FIBER FOR WIRELESS ACCESS 

Roland Yuen Xavier N. Fernando Sridhar Krishnan 

Ryerson University, Toronto, Ontario, Canada 

r yuenaee. ryerson. ca xavier Qieee. org krishnanQee. ryerson. ca 

Abstract 

A radio over fiber link is a promising technology for 

antenna remoting applications. Typically, the radio 

over fiber link employs a single mode fiber. But, the 

signal power at the remote antenna is very small. The 

main reason is large power loss in the E/O and O/E 

convertor. But, the coupling efficiency of a E/O con- 

vertor can be improved with multimode fiber (MMF), 

so we propose to use a ROF link with a vertical-cavity 

surface-emitting laser with a graded index MMF to 

transport optical signals. A multimode fiber has a larger 

core radius compared to a SMF. A larger core radius 

allows more optical power coupled into a fiber. With 

simple butt-coupling techniques, the coupling eficiency 

can be 90% and simplicity leads to reduction in cost of 

the link. Normally, the MMF is used in short distance 

digital applications with a bandwidth distance product 

of about 500 MHDkm, so it is good for local area pic- 

ocells. Our approach as to transmit passband signals 

such as QPSK and FSK through the ROF link. Our 

simulation shows that a 900 MHz carrier can transport 

through a link of 1.22 km long. In this paper, we in- 

vestigate the feasibility of using a MMF for antenna 

remoting in local area picocells and compare the trade- 

08 between coupling efficiency and bandwidth. 

Keywords: Radio over fiber; multimode fiber; remote 

antenna; coupling eficiency. 


Radio over fiber (ROF) link is used in remote an- 

tenna applications to distribute signals for microcell 

or picocell base station (BS). In the remote antenna 

application, the downlink RF signals are distributed 

from a central base station (CBS) to many BS known 

as radio access point (RAP) through fibers. The up- 

link signals received at the RAPs are sent back to the 

CBS for any signal processing. A RAP is much more 

cost effective to deploy than a normal BS because it 

is mostly consisted of simple devices, which includes a 

E/O convertor, a O/E convertor, and an amplifier. The 

cost of signal processing in a CBS is shared amounts of 

CCECE 2004 - CCGEI 2004, Niagara Falls, May/mai 2004 

0-7803-8253-6/04/$17.00 @ 2004 IEEE 

- 1715 - 

many RAPs. Additional to the lower cost advantage, a 

smaller cell size coverage reduces the near fax effect and 

relaxes the battery requirement on mobile receivers. 

Although the fiber is a reliable medium with low 

attenuation (0.5 dB/km at 1550 nm), challenge still 

exists in large loss due to E/O and O/E conversion [l]. 

In this paper, we propose to employ multimode fiber 

(MMF) to increase the coupling efficiency, which re- 

duces the E/O conversion loss. However, MMF has 

limited bandwidth largely due to modal dispersion. 

In this paper, we will be discussing two topics. The 

downlink architecture of the ROF link. The tradeoff 

between power and bandwidth in remote antenna ap- 

plication. 

109 

2. THE RADIO OVER FIBER LINK 

Figure 1: Radio over fiber link in remote antenna ap- 

plication i 

The radio over fiber (ROF) links in remote antenna 

application is illustrated in Figure 1. The central base 

station (CBS) and the radio access points (RAPs) are 

connected through two fibers, which transport the up- 

link and downlink signal. The RAPs act as remote 

antenna that receives and transmits signals to mobile 

users, whereas the CBS collects signals from the RAPs 

for processing and distribute signals to all the RAPs. 

The downlink of the ROF can be divided into an op- 

tical channel and a wireless channel denoted by ROF 

and Air respectively in Figure 2. When a signal s(t) 

goes through the optical channel, it is attenuated by 


a loss of L,t. After the optical channel, the signal 

is boosted by G,l and later in the wireless channel it 

further attenuated by a path loss L,l. Noise nopt(t) 

is added to the signal in the optical channel, and noise 

n,1 (t) is added in the wireless signal. Finally, the quality 

of the received signal r(t) is determined from the 

signal to noise ratio (SNR) at the mobile user. 

Figure 2: Downlink block diagram of radio over fiber 

link 

2.1 Optical Channel 

The optical channel of the ROF link that use a mul- 

timode fiber (MMF) is illustrated in Figure 3. It con- 

sists of an optical source, a fiber, and a photodetector. 

1 +S(t) Multimode fiber h,,,(t) Photodetector 

Figure 3: The optical channel 

The signal s(t) from the CBS can be in any form 

such as QPSK, 16-PSK, and FSK. Usually in mobile 

communication, the signal has a bandwidth less than 2 

MHz. The signal is directly modulated onto a laser and 

biased to minimize nonlinearity and clipping distortion. 

The signal, s(t), after biased is given as, 

Sbias(t) = [I + ms(t)] (1) 

where m is the optical modulation index. 

The impulse response of a MMF can be generalized 

as a Gaussian response [2] with respect to optical power 

and it is given as, 

where T is the delay of the channel and c the stan- 

dard deviation of the impulse response that increases 

linearly with the link distance. Longer the link is, the 

modal dispersion effect is more apparent. 

110 

- 1716 - 

Other than distortion from modal dispersion, there 

are noises in the optical channel. These noises are combined 

into a term, nopt(t). The output photocurrent is 

given as, 

i(t) = - ps [I+ ms(t)l* hrnrnp(t) + nopt(t) (3) 

6 

where Ps is the average optical power emitted from the 

laser diode, Lopt is the loss in the optical channel, and 

hmmf(t) is the impulse response of a MMF. 

The Lo, includes the losses from the E/O, O/E 

conversion, the fiber attenuation, the connectors, and 

matching of the transmitter and the receiver. In [l], 

the electrical loss in dB is given as 

zin 

Lopt,dB = -20 log(%Gm) + lolog(-) + 2(2Zc + ad) 

zmt 

(4) 

and in linear is given as, 

where IR is the responsivity of the photodetector in 

mA/mW, G, is the modulation gain of the optical 

source in mW/mA, Zin is the impedance of the laser, 

Zout is the impedance of the optical receiver, IC is the 

optical connector, a is the attenuation per km of the 

fiber, and d is the distance of the link in km. In above 

expression, the E/O, O/E conversion loss, connector 

loss, and fiber attenuation are double because they are 

the losses in the optical domain. The modulation gain 

G, is the coupling efficiency that accounts for the F'res- 

ne1 loss and the misalignment loss. It reflects the qual- 

ity of coupling techniques. 

2.2 Optical Signal to Noise Ratio 


To evaluate the performance of an optical link, the 

optical signal to noise ratio (OSNR) is needed. It is 

evaluated at the output of the optical receiver. The 

OSNR can be expressed as follows, 

m2 P," ( s2 (t)) 

OSNR = 

LOPt (Gpt (t)) . 

The nopt(t) is the noise induced in the optical channel, 

and it is assumed to be an additive white gaussian 

noise. The noise consists of the relative intensity noise 

(nLIN(t)) from the laser, the shot noise (&(t)) from 

the photodetector, and the thermal noise (n&(t)) from 

the receiver electronics. In [l], the total noise power of 

the signal is given as

3. COMPARISON OF MULTIMODE 

FIBER AND SINGLE MODE FIBER 

To reduce the penalty from the E/O conversion, 

multimode fiber (MMF) is used. The proposed sys- 

tem chose the combination of vertical-cavity surface- 

emitting laser with graded index MMF fiber. In [3], the 

authors found that the coupling efficiency to a graded 

index MMF is strongly depended on the active laser di- 

ameter, the index guiding of a laser, and the transverse 

mode emission spectrum of a laser. The coupling effi- 

ciency also depends on the coupling techniques. How- 

ever, for better coupling efficiency, there is a tradeoff 

for the bandwidth of a radio over fiber link. 

Physically, the MMF has a larger core diameter of 

50-200 pm compares to the signal mode fiber (SMF) of 

8- 12 pm. In addition, MMF also has higher numerical 

aperture in the range of 0.19 - 0.30. The higher nu- 

merical aperture means a larger acceptance angle that 

allows more optical power to be coupled into a fiber. 

These physical characteristics of the SMF and MMF 

are found in [4]. Moreover, the typical VCSEL has an 

active diameter of 16 - 20 pm [3], which is smaller than 

the core diameter of a MMF, but larger than the core 

diameter of a SMF. Thus, physically a MMF can better 

capture the optical power emitted from a laser. 

It has been reported in [3] that the coupling effi- 

ciency can exceed 90%. This is achieved with butt- 

coupling a graded index MMF with a weakly index 

guide proton-implanted VCSEL laser. But, the typ- 

ical coupling efficiency lies in the 70% - 80% range. 

Whereas, the SMF has a typical coupling efficiency 

in the 40% - 70% range [5]. Various coupling tech- 

niques have been proposed and evaluated in terms of 

their coupling efficiency and the fabrication complexity. 

They can be generalized into butt-coupling, lens cou- 

pling and pigtail coupling. The butt-coupling is usually 

used for MMF. In butt-coupling, the MMF is placed as 

close to the laser facet as possible. This results in good 

coupling efficiency and increases the misalignment tol- 

erance [6]. Moreover, this technique is relative easy 

to fabricate. However, this technique is not suitable 

for SMF because of its small core diameter and its low 

numerical aperture. In practice, more complex tech- 

niques like lens coupling and pigtail coupling is used in 

SMF. In the lens coupling technique, single or multi- 

ple lenes are placed between a laser facet and a optical 

fiber [5]. This technique improves the coupling effi- 

ciency to be more than 50%. However, it is harder to 

fabricate lens that is suitable for the SMF, so pigtail 

coupling technique would be used. This technique first 

couples the laser to a MMF, then from the MMF to 

the SMF. Additional coupling loss is introduced from 

- 1717 - 

the extra coupling stage, but it is easier to fabricate [7]. 

From all the discussion above, it is obvious that MMF 

has better coupling efficiency and lower cost. 

On the other hand, the bandwidth of the MMF 

is significantly less than the SMF. It is widely re- 

ported that that digital system SMF has a bandwidth 

in the GHz.km range and the MMF has a bandwidth 

of 500 MHz-km. However, the MMF is sufficient for 

the local picocells with short distance and bit rate in 

the low Mbps range. This is demonstrated in the next 

section. 

111 

4. NUMERICAL DISCUSSION 

In this section, simulation of the downlink transmis- 

sion from the central base station to the radio access 

point is discussed. A vertical-cavity surface-emitting 

laser and a gradient index multimode fiber (MMF) is 

assumed for the radio over fiber link. The laser oper- 

ates at 850 nm and emits 1 mW of optical power. We 

assume the same butt-coupling technique as in [3]. We 

also assume a relatively large G, = 0.80 mW/mA. The 

responsivity !J? of optical receiver is 0.75 mA/mW. The 

c7 of the MMF impulse response (2) is 0.5 ns/km [2] and 

the delay T is 30. The fiber attenuation is 2.5 dB/km 

and the connector loss is 2 dB. The system is assumed 

to be perfectly matched, so there is no loss from match- 

ing. Noise is added according to (7) and generated for 

a bandwidth of 2 MHz, a relative intensity noise pa- 

rameter RIN of -155 dB/Hz, a 50 R load resistance, 

and a temperature of 278K. The optical signal to noise 

ratio (OSNR) of the link is calculated according to (6). 

Figure 4 shows the four different OSNR curves as a 

function of the ROF link distance under various chan- 

nel and signal conditions. The top most curve is the 

OSNR of a SMF and the rest of the curves is the OSNR 

of a graded index MMF at virous carrier frequencies. 

The dispersion with the MMF has significant impact 

on the OSNR of the link. With a carrier frequency of 

900 MHz at a distance of 1 km, there is about 30 dB loss 

in OSNR compares to a SMF and as the link distance 

increase the rate of the loss increases faster. However, 

application considered is a short haul link. For a carrier 

frequency of 900 MHz, the ROF link still can support 

up to 1.22 km with the OSNR better than 10 dB. For 

a 1200 MHz and a 1500 Mhz carrier, the link supports 

910 m and 740 m respectively. Figure 5 also shows the 

same OSNR curves, but the noise bandwidth increase 

to 10 MHz. It shows a decrease of about 7 dB in all 

the OSNR curves and the distance a link can support 

also decrease. 


'

OSNR of The ROF Link 

-- OSNR Carrier Fnsqueney 

Distance of the link in h 

Figure 4: OSNR versus distance of multimode ROF 

links with 2 MHz noise bandwidth 

80 ~ 

70 

Figure 5: OSNR versus distance of multimode ROF 

links with 10 MHz noise bandwidth 

- 1718 - 

112 


5. CONCLUSION 

In this paper, we have investigated the radio over 

fiber (ROF) link that employs a graded index mul- 

timode fiber (MMF) with a vertical-cavity surface- 

emitting laser to increase the coupling efficiency. The 

coupling efficiency of 90% can be achieved with butt- 

coupling, which is relatively simple for the MMF. With 

the complexity translated to the cost of a system, it 

makes our proposed system attractive. However, there 

is tradeoff in bandwidth. For a 900 MHz carrier, the 

ROF link distance is restricted to 1.22 km with an op- 

tical signal to noise better than 10dB. For the remote 

antenna application of local picocells, this configura- 

tion of ROF is sufficient. 

References 

[l] X. N. Fernando; A. Anpalagan, “On the design of 

optical fiber based wireless access systesm.. .”, In- 

ternation Conference on Communication, 2004, To 

be presented. 

[2] K. Azadet; E. F. Haratsch; H. Kim; F. Saibi; 

J. H. Saunders; M. ShafFer; L. Song; Meng-Lin 

Yu, “Equalization and FEC techniques for optical 

transceivers”, Solid-state Circuits, IEEE Journal 

of, vol. 37, no. 3, pp. 317-327, March 2003. 

[3] J. Heinrich; E. Zeeb; K. J. Ebeling, “Butt-coupling 

efficiency of VCSELs into multimode fibers”, Pho- 

tonics Technology Letters, IEEE, vol. 9, no. 12, pp. 

1555 -1557, Dec. 1997. 

[4] Gerd Keiser, Optical Fiber Communication, 

Boston, MA : McGraw-Hill, 2000. 

[5] John M. Senior, Optical Fiber Communications: 

Principles and Practice, Prentice Hall, second edi- 

tion, 1992. 

[6] J. A. Hiltunen; K. Kautio; J.-T. Makinen; P. Kar- 

ioja; S. Kauppinen, “Passive multimode fiber-to- 

edge-emitting laser alignment based on a multilayer 

LTCC substrate”, in Electronic Components and 

Technology Conference, 2002. Proceedings. 52nd. 

IEEE, May 2002, pp. 815 -820. 

[7] Leslie A. Reith; Paul W. Shumate, “Coupling sen- 

sitivity of an edge-emitting LED to single-mode 

fiber”, Lightwave Technology, Jounal of, vol. LT- 

5, no. 1, pp. 29-34, January 1987.

SUB-DICTIONARY SELECTION USING LOCAL DISCRIMINANT BASES 

ALGORITHM FOR SIGNAL CLASSIFICATION 

Karthikeyan Umapathy and Anindya Das 

Dept. of Electrical and Computer Eng., 


London, Ontario, CANADA. 


Abstract 

In signal decompositions using over-complere. redundant timefrequency 

(TF) dictionaries, oren it is challenging to restrict the 

dictionary to a sub-dictionary tailored for specific applications. 

In the proposed technique we used a similar appmach as Local 

Discriminant Bases Algorithm (mB) to select optimal TF subdictionaries 

for signal classification applications. A novel timewidth 

versus frequency band mapping was generated for each of 

the signal class. These mappings of different classes were compared 

using a discriminant measure ro arrive at a sub-dicrionary. 

This sub-dictionary was then used for decomposing the testing 

set signals, followed by fearure exrraction and classification. Two 

highly non-stationary bio-medical databases I . Vibroarrhrographic 

signals (89 signals, 51 normal and 38 abnormal) 2. Pathological 

speech darabase (103 signals, 50 normal and 50 pathological) 

were rested. Classification accuracies as high as 74.2% and 92% 

wem achieved respectively. Due Io the sub-dictionary appmach, 

appmximately a 40% reduction in signal decomposition time was 

observed for the tested databases. 

Keywords: timerfrequenq, sub-dictionary, matching pursuit, lo- 

cal discriminant bases, discriminanr measure 


lime-frequency (TF) transformations have signi,kantly contributed 

in the area of automatic signal classikation. The TF 

transforms help us to understand the signals better, thereby to extract 

strong clues or features for classikation. Even though the 

complete TF plane contain details about the signals, in classikation 

application it is often a small area or pockets of areas in the 

TF plane that actually exhibit the difference between the classes 

of signals. The success of a classiPcation application depends on 

how well these target areas can be identiPed and analyzed in the 

TF plane. Once the target areas are identiPed, it is easier to zoom 

into these areas by performing time and frequency localized decompositions 

to extract relevant features for classiPcations. 

Pruning algorithms such as Local Discriminant Bases algorithm 

(LDB) [I] were introduced to identify the target subspace in 

the TF plane that exhibit high discrimination values between signal 

classes. However most of the existing literature of LDB deals 

only with dictionary of orthonormal bases (Wavelet packets). Considering 

the various advantages 121 of using redundant dictionaries 

for Bexible signal representations, in the proposed technique 

we use adaptive time-frequency transformation (ATFT) based on 

matching pursuit algorithm. The nature of the ATIT based on 

CCECE 2004 - CCGEI 2004, Niagara Falls, May/rnai 2004 

0-7803-8253-6/04/517.00 @ 2004 IEEE 




Toronto, Ontario, CANADA. 

Email: krishnan@ee.ryerson.ca 

matching pursuit is different from wavelet packet transform (Un- 

like wavelet/wavelet packet transform the scale and frequency pa- 

rameters are not related in ATFT), hence the LDB approach in 

identifying the target subspace has to be modiPed before it can be 

applied to the matching pursuit based ATlT decomposition. 

In this paper we demonstrate the process of selecting a sub- 

dictionary (subspace) from a redundant dictionary using a similar 

approach to LDB for classikation application. The selected sub- 

dictionaries were then used to decompose two biomedical signal 

databases to a localized TF plane. Features were extracted and 

classiPcation was performed. The paper is organized as follows: 

Section I1 covers Methodology consisting of ATFT, LDB, feature 

extraction and pattern classiPcation. Results and conclusions are 

given in Section 111. 


2.1. Adaptive Time-frequency Transform 

The signal decomposition technique used in this work is based 

on the matching pursuit (MP) [2] algorithm. MP is a general 

framework for signal decomposition. The nature of the decom- 

position varies according to the dictionary of basis functions used. 

When a dictionary of TF functions is used, MP yields an adaptive 

time-frequency transformation [Z]. In MP any signal x(t) is de- 

composed into a linear combination of TF functions g(t) selected 

horn a redundant dictionary of TF functions. 

where 

-2001 - 

and a, are the expansion coefkients. The scale factor sn also 

called as octave or time-width parameter is used to control the 

width of the window function, and the parameter p, controls the 

temporal placement. The parameters fn and $., are the frequency 

and phase of the exponential function respectively. The signal 

x(t) is projected over a redundant dictionary of TF functions with 

all possible combinations of scaling, translations and modulations. 

The dictionary of TF functions can either suitably be modiPed or 

selected based on the application in hand. In our technique, we 

are using the Gabor dictionary (Gaussian functions) which has the 

113 


est TF localization properties. At each iteration, the best corre- 

lated TF functions to the local signal structures are selected from 

the dictionary. The remaining signal called the residue is further 

decomposed in the same way at each iteration subdividing them 

into TF functions. 

Theoretically when using a redundant dictionary the decom- 

position parameters a,,, sn. fn, p, and can take any values 

within their ranges. However in the practical discrete implementa- 

tion used in this work, sn can vary in powers of 2 from 2 to 14, fn 

can vary from 0 to Fs/2 (Fs is the sampling frequency), p,, can 

vary from 0 to signal size. &, can vary from 0 to 1. The possible 

values taken by these parameters can be restricted to construct a 

sub-dictionary, in Section 2.3 we will demonstrate how these pa- 

rameters can be restricted to a localized area in the TF plane for 

classiPcation application. 


In the LDB [I] algorithm (using wavelet packet bases), a set 

of training signals xf for all C classes are decomposed to a full 

tree structure of order N. We restrict our explanation to binary 

wavelet packet trees. Let Ro,o be a vector space in R- corre- 

sponding to the node 0 of the parent tree. Then al each level 

the vector space is spilt into two mutually orthogonal subspaces 

given by Cl,,, = Rj+l,2k eB Rj+l,2r;+~ where j indicates the level 

of the tree and k represents the node index in level j, given by 

k = 0, __._, 21 - 1. This process repeats till the level J, giving rise 

to 2' mutually orthogonal subspaces. The goal is to select the set 

of best subspaces that provide maximum discriminant information 

between the classes of the signal. Each node k contains a set of 

[='"a-'-l 

basis vectors Bj,k = [wj,!+],=, , where 2"" corresponds 

to the length of the signal. Then the signals xi can be represented 

by a set of coetkients cas: 

xi = Cik,lW1,k,L . (3) 

AkJ 

The time index of the signals xi has been dropped for nota- 

tional convenience. Basically the signals xi are decomposed into 

2' subspaces with Cj,k,l coetkients in each subspace. The sub- 

spaces which exhibit high discriminant values for the discriminant 

measure D, are then used to expand the testing set signals and 

features are extracted for classikation. 

Unlike the wavelet packet decomposition, in ATFl the scale 

and frequency parameters are not explicitly related. The LDB 

strategy of splitting a subspace (node) to obtain children subspaces 

(node) does not apply to A m. In ATFI any scale can occur in 

combination with any frequencies (only resuicted by the uncer- 

tainty principle) giving it the extreme Bexibility to obtain any local 

TF resolution. So we will have to adapt the LDB approach before 

it can be applied to ATFT. 

2.3. Sub-dictionary Selection Process 

As we will be performing a two-group ciassikation on the 

given datasets, the following sub-dictionary selection process will 

be explained with a two-group ciassiPcation of signals (Class A 

and Class B). Coarse TF decompositions were performed on the 

training sets of both classes of signals. Coarse TF decompositions 

can be achieved, by controlling the step size of the decomposi- 

tion parameters. In the proposed technique, out of the possible 

114 

sn values (2' to 214), we force the decomposition to select only 

scales of sl = 2', 92 = Z6, 93 = 2" and s4 = 214. Simi- 

larly we group the f,, parameters into frequency bands off 1 = 0 

to Fs/8, f2 = Fs/8 to Fs/4, f3 = Fs/4 to 3 * Fs/8 and 

f4 = 3 * Fs/8 to Fs/2. As we choose to completely cover the 

frequency range in 4 bands without gaps, we allow the decompo- 

sition to choose fn from the complete range 0 to Fs/2. Later 

in the processing of the decomposition parameters we group them 

into four frequency bands. With all the training signals decom- 

posed using the restricted s, values, we group the decomposition 

parameters in combinations of (sl, s2, s3 and s4) and (f 1, f 2, f 3 

and f4). So in total we will be able to group them into 16 cells 

as shown in Fig. I. The cells in the respective time-width versus 

frequency band mapping are numbered as A1 to A16 and B1 to 

B16. 

Here it should be noted that the time-width axis in the Fig. 1 

does not correspond to time, but scale (time-width). During the 

decomposition process, any scale parameter can occur at any time 

position so the arranging of the scale parameter from low to high 

does not mean they occur in that order in real time. Once we get 

this time-width vs frequency hand mapping for the training set of 

signals for both the classes, we average them to get an averaged 

time-width versus frequency band mapping for both classes of sig- 

nals. 

In order to identify the cells which demonstrate high discrimi- 

nant values between the classes we use a similar approach to LDB. 

We dePne a discriminant measure D, which is used to compare the 

corresponding cells in the time-width versus frequency hand map- 

pings. Unlike LDB with orthonormal bases where the set of basis 

functions and their variations are limited and Pxed, in ATFT the 

variations can be limitless theoretically (only restricted by the un- 

cerlainty principle). In other words the TF tiling (TF resolution) 

is Pxed for a particular scale function of waveletslwavelet pack- 

ets although their position in the TF plane can be altered (wavelet 

packets). In ATFT it is &@cult to assign a Pxed subspace shape or 

size based only on scale parameter or frequency parameter. Hence 

we choose both scale and frequency to assign a subspace 011 the 

TF plane. However this cannot be generalized as the combination 

of scale and frequency can be limitless (only restricted by the un- 

certainty principle) based on the signal structures. 

In the proposed technique we use the normalized cumulative 

energy difference between the cells as the discriminant D measure. 

The discriminant D measure is give by: 

and 

- 2002 - 


where E is the normalized cumulative energy of the TF functions 

in a cell, at is the energy coepcient of the TF function. k is the 

number of TF functions grouped in a cell and B, is the total de- 

composed energy of the signal. 

We compare these cumulative energies of the corresponding 

cells and compute the D. We sort them in decreasing order of D. 

The cells which yields high values for D exhibit signiPcant differ- 

ences between the classes. This indicates the target space for clas- 

sipcation lies within these cells. Fig I pictorially explains the way 

we compare the corresponding cells using D and as an example we 

have shown a possible outcome with 5 cells (cross hatched) iden- 

tibed as the highly discriminative cells between classes. These 5

0 : SI 92 r3 54 =-ax 

Tirnep"idfh 

: S"b-discio"ary : 

cells are chosen as the hst h e cells when soned in the decreasing 

order of their discriminant values. Now we identify the range covering 

all these cells both in time-width and frequency band axis. In 

the given example as shown in Fig I with doned lines, we choose 

the frequency band axis ranging from 0 to 3*Fs/8 and time-width 

ranging from sl to s2. Once these ranges are identiPed we restrict 

the redundant dictionary and construct a sub-dictionary with these 

ranges for time-width and frequencies. The testing set signals will 

now he decomposed using this sub-dictionary enabling them to 

zoom into only the target space in the TF plane. In the decomposing 

process of testing set signals using the sub-dictionary we allow 

the decomposition to go in Pne steps of time-width and frequencies. 

This targeted decomposition yields parameters that contain 

high discriminatory information between the classes. Features are 

extracted from these decomposition parameters and fed to a Linear 

Discriminant Analysis (LDA) based classiper as will be explained 

in subsequent sections. 

0 . 1 U 0 r(rmu 0 . 1 e 3, dvnu 

T>nr"d* i,rnrwd,h 

(a) (b) 

Fig. 2. Sub-dictionary selection for VAG (a) and Pathological 

speech signals (b). 

2.4. Feature Extraction and Pattern Classifi- 

cation 

We use the following two highly non-stationary databases for 

Fig. 1. Sub-dictionary selection process 

- 2003 - 

testing with our proposed technique. 1. Vibroarthrographic (VAG) 

signals and 2. Pathological speech signals. Vibroarthrographic 

signals are the signals emitted from the human knee joints during 

an active movement of the leg. More details of this database can he 

had from [3]. Pathological speech signal database contains speech 

signals from both normal and pathological talkers. More details of 

this database can be had from [4]. 

As explained in Section 2.3 we obtained sub-dictionaries for 

both the VAG and pathological speech databases. In performing 

the coarse TF decomposition on the training set, a faster version of 

the ATFT algorithm [5] was used with 2000 iterations. IO ran- 

domly selected signals from both classes, from both VAG and 

pathological speech signals were used as the training set. We 

use the Rst 3 highly discriminating cells in arriving at the time- 

width and frequency ranges. Figs 2(a) and 2(b) show the ranges 

(cross hatched) obtained for time-width and frequencies for VAG 

and pathological speech databases respectively. For VAG database 

based on the chosen cells, time-width varies from 2? to 2' and 

the frequency varies from 0 to Fs/4. For pathological speech 

database the time-width varies from 26 to 214 and the frequency 

vanes from 0 to Fs/8. All the 89 VAG signals and 100 patho- 

logical speech signals were decomposed using their correspond- 

ing sub-dictionaries with the regular ATFT algorithm. We use Pne 

steps of time-width within the range of the sub-dictionary. The 

iterations were limited to I000 for both VAG and pathological 

speech signals as we are only interested in the discriminative sub- 

space in the TFplane and we do not require a complete decomposi- 

tion of the signal. Due to the usage of sub-dictionary approach the 

decomposition times were reduced by approximately 40% in the 

proposed work compared to using a full range redundant dictio- 

nary. The reduction in the decomposition times depends on how 

small the sub-dictionary is. The decomposition parameters were 

analyzed and the following features were extracted. 1. Number of 

TF functions (Flc,): This feature is the number of TF functions 

falling into each of the cells covering the same area as the highly 

discriminative cells that were used to construct the sub-dictionary. 

115 


2. Cumulative energy of the cells (FZc,): We compute the cu- 

mulative energy contained in each of the cells that were used to 

compute (Flc,). It should he noted here that as we are using 

Pne steps of time-width in the decomposition of testing signals we 

should be having more number of cells covering the same range of 

the sub-dictionary. For example in aniving at the sub-dictionary of 

VAG signal we identikd the cells corresponding to 3 time widths 

2', Z6 and 2". However in decomposing the testing signal we 

used Pne steps of time-widths which means, the time-width step 

size is reduced fiom 4 to 1 and so now we will have 9 cells cover- 

ing the same time-width range. 

Both the above explained feature vectors (Flcn and F2c,) 

were evaluated for their discriminant power and only 9 out the total 

18 features (both feature vector put together) were selected for the 

purpose of classiPcation. This selected 9 features contained fea- 

tures from both feature vectors (FIG,, and FZc,). The motiva- 

tion for the pattern classiPcation is to automatically group signals 

of same characteristics using the discriminatory features derived. 

In LDA, the feature vector derived as explained above were trans- 

formed into canonical discriminant functions such as 

f = ulbl + uzbz + ....... + vgbg + a, (6) 

where {U} is the set of features, {b} and a are the coeecients and 

constant respectively estimated and derived using the Fisher& lin. 

ear discriminant functions [6]. Using the chi-square distances and 

the prior probabilistic values of each group the classiPcation is per- 

formed to assign each sample data to one of the groups. The clas- 

siPcation accuracy was estimated using the leave-one-out method 

which is known to provide a least bias estimate 171. 

Table 1. Table showing classiPcation accuracy achieved for the 

VAG database. Regular - Normal LDA, Cr0ss.V - Leave-one-out 

method LDA, % =Percentage of classiPcation 

Method 1- 

Regular I Normal I 35 I 16 I 51 

Abnormal 81.6 

I Abnormal I 9 1 29 

% Normal 66.1 33.5 100 

Abnormal 23.7 16.3 100 

Table 2. Table showing the classiPcation accuracy achieved for the 

pathological speech database. Regular - Normal LDA, Cr0ss.V - 

Leaveane-out method LDA, %=Percentage of classiPcation 

Method I Groups I Normal I Pathological 1 Total 

Reeular I Normal I 48 I 2 I 50 


% Normal 96 4 100 


Cr0ss.V Normal 48 2 50 

% Normal 96 4 

Pathological 12 88 

100 

100 

116 

- 2004 - 


3. RESULTS AND CONCLUSIONS 

The paper describes a novel way of constructing a target spe- 

cik sub-dictionary from a redundant dictionary, for classiPcation 

applications. High classiPcation accuracies were achieved with 

approximately 40% reduction in decomposition time. T\uo highly 

non-stationary biomedical databases were used to demonstrate the 

performance of the proposed technique. 

Features were extracted as explained in Section 2.4 for all the 

89 VAG signals and the 100 pathological signals. They were fed to 

a LDA based classikr using the SPSS software. ClassiPcation was 

performed and the results are given in Tables 1 and 2. For the VAG 

database an overall classiPcation accuracy of 74.2% using regular 

LDA and 70% using leave-one-out method were achieved. This 

is higher than the reported classikation accuracy in [3]. For the 

pathological speech database an overall classiPcation accuracy of 

92% using regular LDA and 92% using leave-one-out method was 

achieved. This is higher than the reported classiPcation accuracy 

in [4]. However the classkation accuracies for both the databases 

are not higher than the authors recent work with LDB based clas- 

sikation (74.2% vs 78.6% and 92% vs 97%). Considering the 

following facts the results obtained in the proposed technique can 

he justiPed (1) Novelty involved in the sub-dictionary construc- 

tion (2) While writing this paper, only Gaussian basis functions 

were tested with the databases (3) Reduced decomposition times 

(4) Simple features. Future work involves in rePning the proposed 

technique to include more bases and optimizing the targeted de- 

compositions to yield high classiPcation accuracies than the re- 

ported. 


The authors thankfully acknowledge the NSERC organization 

for funding this project. The authors also acknowledge the Last- 

Wave software package group. 

References 

[I] N. Saito and R. R. Coifman, bLocal discriminant bases and 

their applications.6 Journal of Mathematical Imaging and W- 

sion, vol. 5, no. 4, pp. 3379358, 1995. 

[2] S. G. Mallat and Z. Zhang, bMatching pursuit with time- 

frequency dictionaries.6 IEEE Trans. Signal Pmcessing, vol. 

41,110. 12,pp. 339793415, 1993. 

[3] S. Krishnan, bAdaptive signal processing techniques for anal- 

ysis of knee joint vibroarthrographic signa1s,6 inPh.D disser- 

tation, University ofcalgary, June 1999. 

[4] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, bDis- 

crimination of pathological voices using an adaptive time- 

frequency approach.6 inproceedings of ICASSP 2002 IEEE 

International conference on Acoustics, Speech and Sigml 

Pmcessing, Orlando, USA, May 2002, pp. IV 3853D3855. 

[5] R. Gribonval, bFast matching pursuit with multiscale dictio- 

nary of Gaussian chirps.6 IEEE Transactions on Signal Pm- 

cessing. vol. 9, no. 5. May 2001. 

[6] SPSS Inc., bSPSS Advanced statistics user6 guide.6 Sser 


[7] K. Fukunaga, Introduction to Slatistical Pattern Recognition, 

Academic Press, Inc., San Diego, CA, 1990.

Proceedings of the 25h Annual International Conference of the IEEE EMBS 

Cancun, Mexico September 17-21,2003 

Ultrasound Backscatter Signal Characterization and Classification Using 

Autoregressive Modeling and Machine Learning Algorithms 

Noushin R.Farnoud', Michael Kolios1*2 

Co-author: Srindhar Krishnan' 

'Department of Electrical Engineering, Ryerson University, Toronto, Canada 

'Department of Math-Physics and Computer Science, Ryerson University, Toronto, Canada 

Abstract- This research explores the possibility of monitoring 

apoptosis and classifying clusters of apoptotic cells based on 

the changes in ultrasound backscatter signals from the tissues. 

The backscatter from normal and apoptotic cells, using a high 

frequency ultrasound instrument are modeled through an 

Autoregressive (AR) modeling technique. The proper model 

order is calculated by tracking the error criteria in the 

reconstruction of the original signal. The AR model 

coefficients, which are assumed to contain the main statistical 

features of the signal, are passed as the input to Linear and 

Nonlinear machine classifiers (Fisher Linear Discriminant, 

Conditional Gaussian Classifier, Naive Bayes Classifier and 

Neural Networks with nonlinear activation functions). In 

addition, an adaptive signal segmentation method ,(Least 

Squares Lattice Filter) is used to differentiate the data from 

layers of different cell types into stationary parts ready for 

modeling and classification. 

Keywords-Apoptosis, Ultrasound Backscatter 


High frequency ultrasound (US) has been shown to 

detect the structural changes cells and tissues undergo 

during cell death. Research has shown that the ultrasound 

backscatter signals from apoptotic' acute myeloid 

leukemia(AML) cells differ in intensity and frequency 

spectrum as the result of the change in size, spatial 

distribution and acoustic impedance of the scattering sources 

within the cell [l] (Fig. 1). Therefore, we assume that pulse 

echo data from different cell types contain distinguishable 

statistical regularities. In this work we attempt to classify 

normal and apoptotic cancerous cells by tracking the 

statistics of the ultrasound backscatter signals from tissues 

by using Autoregressive (AR) method for time series 

modeling of ultrasound signals. 


A. Autoregressive (AR) Modeling of US signals 

Biomedical signals contain large quantities of data. 

Moreover these data usually contain some redundancies 

which make processing and analyzing them more difficult. 

In such situations signal modeling may help to take out the 

' Apoptosis is a genetically determined destruction of cells from 

within due to activation of a stimulus or removal of a suppressing 

agent or stimuli. 

0-7803-7789-3/03/$17.00 02003 IEEE 2861 

117 

Fig 1 a) H&F ' stains of b) 11 &C stains of 

Normal Cells Apoptotic Cells 

irrelevant information carried by the signal and simplifies 

classification and segmentation by using a reduced number 

of model parameters. Autoregressive (AR) modeling is 

widely used for speech and biomedical signal processing 

[2-41. This model is linear and has been successfully used 

for high-resolution spectral estimation [5]. An AR model is 

defined by the difference equation: 

P 

x(n> = -C akx(n - IC) + e(n> (1) 

k=l 

where x(n) is a wide-sense stationary3 AR process, {a(k)} 

represent AR coefficients, e(n) is white Gaussian noise and 

p is the model order which determines the error criterion. In 

section C, we will present a way to estimate this error and 

reduce it based on choosing the proper model order @). 

B. Data Acquisition 

AML cells were grown in suspension and exposed to the 

chemotherapeutic cisplatin to induce apoptosis. Pellets were 

made by swing bucket centrifugation. Details on the 

biological procedure can be found elsewhere (Czemote et al. 

1996)[6]. A 20MHz f2.35 or 40 MHz f2 transducer (Visual 

Sonics4) was used to image the pellets of normal and 

apoptotic cells. RF backscatter data was digitized at 

SOOMHz and stored for later analysis. In one experiment, 

layers of normal and apoptotic cells were created to emulate 

a clinical situation. 

C. Choosing the proper Model Order 

The modeling order @) controls the error associated 

with the AR signal approximation. This parameter 

Hematoxylin and Eosin. 

'A stochastic process is called wide-sense stationary (WSS) if its 

mean is constant and its autocorrelation depends only on the time 

difference. 

www.visualsonics.com 


determines the number of previous samples used to model 

the original signal. A small model order ignores the main 

statistical properties of the original signal while a big model 

order will result in modeling the noise associated with data 

and over-fitting5 occurs. A very common method for 

estimating the proper model order is Akaike Information 

Criterion (AIC) [7], although applying this method would be 

very difficult in our work due to nature of US signals. 

Instead, we used the following parameters based on the 

statistics of the reconstructed signal and its frequency with 

different model orders to determine the best modeling order. 

a) Ensemble Reconstruction Error 

The error(2) shows the total difference of original and 

reconstructed signals in frequency domain using AR 

modeling technique: 

4 

Z(n) = -Ca,x(n - k) 

k=l 

rt=l ' 

where :(U) is the approximated signal based on AR 

modeling with order p, N is the total number of samples 

within an individual RF line, f andj represents the fft of 

original and estimated signals respectively. 

b) Model Noise (error) Variance 

The AR process is the output of an all-pole filter 

invoked by a white noise e(@. This noise, which is also 

our modeling error, can be viewed as the output of the 

prediction error filter A(z), as shown in Fig. 2, where 

x(n) is the original signal and A(z) is the transfer 

function of AR modeling. 

Fig. 2. Block diagram of AR process 

(Model error) 

Therefore we expect that after estimating the AR 

coefficients of our model, if we invoke a filter as shown 

in fig. 2 with the estimated AR coefficients in A(z) the 

filter output, e(n), would be a white Gaussian noise. We 

can verify this by estimating the variance of the output 

of such a filter and its auto-correlation (which has a jump 

to one in zero lag and remains zero otherwise). 

D. Signal Segmentation 

The classification methods we discussed were based on 

US backscatter from pure apoptotic and normal cell pellets. 

When the model do well on training data but poorly on test data. 

2862 

118 

In patient imaging the data are acquired from tissues which 

contain different layers or layers with different mixtures on 

normal and apoptotic cells. The probabilistic behavior of the 

backscattered US signal from these cells, make the signal 

non-stationary6. This non-stationarity is important from the 

point of view of AR modeling, as this method is applicable 

if the signal is stationary'. Therefore we must use signal 

segmentation algorithms to break the signal acquired from 

tissues into stationary segments and classify each segment 

respectively. The segmentation algorithms can be classified 

into fixed *[8] and adaptive [2,9-111. Adaptive segmentation 

algorithms rely on tracking the statistical changes in the 

signal (such as mean and variance) to set a breaking 

boundary. We used this method for US signals due to its 

accuracy, modularity and ease of testing [2]. 

E. Adaptive signal Segmentation: Recursive-Least Squares 

Lattice Filter (RLSL) 

In adaptive segment,ation, the segment length changes 

dynamically according to the statistical changes in the 

signal. The main idea of using RLSL filter was to get to a 

fast convergence by using forward and backward filters. The 

parameter which expresses the statistical change in the 

signal is called convergence factor (y,(n)). The convergence 

factor provides the connecting link between different sets of 

a priori and posteriori estimation errors in this algorithm and 

is defined by 

where m is the order of the lattice filter, y,(n) is the 

convergence factor at time sample n in the mth stage of 

lattice, bm-, (n) and Bm-, (n) are the backward prediction 

error and its power at this stage [2]. 

IV. RESULTS 

a) Model Order Determination for Autoregressive (AR) 

Modeling of US signals 

Using the error criteria explained in section C, we 

calculated the error associated with the frequency of 

reconstructed and original US signals averaged over 30 

normal and apoptotic sample RF lines respectively (Fig. 3). 

Matlab (version 6.5) was used for all the calculations. Also, 

as explained in section D, we found the variance of the 

' The statistics of a non-stationary process are variant with respect to 

any translation among the time axis. 

' We have determined that US, signals from normal and apoptotic cells 

are quasi-stationary. 

' Fixed segmentation algoritlhms are widely used for speech signal 

processing. 


estimated noise generated as the output of a filter with the 

estimated AR coefficients in its transfer function and the 

original signal as its input. The result of averaging the 

variance of this noise over 30 samples is shown in fig. 4. 

These graphs indicate that model order 15 (p=l5) is a good 

choice for AR modeling order for high frequency US 

backscatter signals, as we do not see much improvement in 

ensemble error(the ratio of error between model order 15 

and 40 is 2.6 in comparison to 2.9e5 between model order 1 

and 15). Furthermore, the variance of the estimated model 

noise does not change dramatically after this model order. 

To verify this result, we modeled an US backscatter signal 

with order 15, reconstructed this signal with the estimated 

AR coefficients and found the auto-correlation of the model 

error' (noise) .As depicted in Fig. 5; this autocorrelation 

indicates the similarity of the estimated error to white noise. 

Therefore we used AR modeling with order 15 for US 

backscatter signals in the rest of this paper. 

3 

2 Algorithm Normal Accuracy Apoptotic Accuracy 

1.5' 

+ 5 10 15 

P 

20 25 30 35 40 

Model Order 

Fig. 3: Average Ensemble Error between the ffts of estimated 

and original US signal (30 samples of normal and apoptotic signals) 

,@Y 

14 

12. 

3 

%lo' 

8. ; 

z 6. 

1 5 10 15 20 25 30 35 40 

Model Order 

Fig. 4: Average variance of the estimated model noised based on the 

estimated AR coefficient (30 samples). 

This error was assumed to be the absolute difference between original 

and reconstructed signals. 

2863 

119 

1 

0 

0 IOW 2000 

Lags 

Fig. 5: Auto-correlation of the estimated model error (noise) 

6) Ultrasound Signal Classification 

Conditional Gaussian 

Classifier" 

Naive Bayes Classifier I 

Fisher's Linear 

Discriminant 

Neural Network with 

40% 

46% I 

60% 

71% 

Sigmoid activation 93.8% 99% 

1 98% I 64% I 

tanh activation 95.5% 99% 

This result shows the ability of Neural Networks with non- 

linear activation functions (in both hidden and output layers) 

to classify US signals from normal and apoptotic cells. We 

are still investigating the advantages and disadvantages of 

each approach. 

c) Ultrasound Signal Segmentation 

Fig. 5 shows RLSL algorithm applied on a layer on 

Normal-Apoptotic-Normal cell pellet with the apoptotic 

layer located between samples 800 and 15000. As long as 

the input data is stationary, the convergence factor would 

remain in the same range, but when it drops below a 

I" The priors for each class were equally set (p=0.5) 

I' The network was trained using 50000 iterations. 


I 

1

threshold '* it indicates a sudden change in statistical 

properties of the signal which is set to the segment 

boundary. 

100 500 800 1200 1500 2000 

Sample Index 

(b) 

Fig. 5. (a): Original signal from a 3 layer Normal-Apoptotic-Normal cell 

pellet. (b): Convergence factor as a parameter to detect the layer boundaries 

(stationary). 

These figures indicate that RLSL algorithm can detect the 

sudden changes in the signal due to the different statistical 

properties of normal and apoptotic layers and therefore can 

adaptively found their corresponding boundary in an US 

backscatter signal. While in Fig. 5.a the difference is 

evident, in clinical situations it is anticipated that small 

percentage of apoptotic cells would be surrounded by 

normal cells. 

V. CONCLUSION 

The best model order in using AR technique for US 

signals was found to be p=15. The accuracy of different 

classifiers has been studied and it was found that non-linear 

neural networks were most successful in classification. 

Because the actual clinical data from patients include US 

backscatter from layers and mixtures of cells, a method for 

'* The threshold in this work is set by visual inspection (however in 

the future it will be extracted from the signal based on its statistical 

properties). 

2864 

120 

differentiating these layers was presented which enables the 

AR modeling to be applicable for US signals. 

ACKNOWLEDGMENT 

We should thank Dr. Michael Sherar and Ontario Cancer 

Institute of the Princess Margaret Hospital for their support, 

Anoja Giles for helping 11s with the biological work and Dr. 

Gregory Czarnota for his scientific input. Noushin 

R.Farnoud would also like to thank Dr. Sam Roweis at the 

Computer Science Department of the University of Toronto 

for his help with the Machine Learning algorithms. 

REFERENCES 

[I] MC. Kolios, GJ. Czamota, M. Lee, JW. Hunt, MD. Sherar, 

Ultrasonic spectral parameter characterization of apoptosis, 

[2] 

Ultrasound Med. & Biol. 2002 May, 8(5):589-97. 

S. Krishnan, Adaptivefili'ering. Modeling, and Classification of Knee 

Joint Vibroarthrographic Signals, Master's Thesis, University of 

Calgary, Alberta, Canada, 1996. 

[3] F. Towfiq, C.W. Barenes, E.J. Pisa, Tissue classification basedon 

autoregressive models for ultrasound pulse echo data, ACTA 

Electronica,l984, (.26): 95-1 10. 

[4] A. Nair, BD. Kuban, N. Obuchowski, DG. Vince, Assessing spectral 

algorithms to predict atherosclerotic plaque composition with 

normalized and raw intravascular ultrasound, Ultrasound in Med. & 

Biol., 27(10): 13 19-1 331,2001. 

[51 M. Akay, Time Frequency and Wavelets in Biomedical Signal 

Processing (Book style). Piscataway, NJ: IEEE Press, 1998: 123-135. 

[6] GJ. Czamota, MC. Kolios, J. Abraham, M. Portnoy, FP. 

Ottensmeyer, JW. Hunt, hfD. Sherar, Ultrasound imaging of 

apoptosis: high-resolution non-invasive monitoring of programmed 

cell death in vitro, in situ ;and in vivo, Br J Cancer. 1999 Oct; 

81(3):520-7. 


[7] Y. Sakamoto, M. Ishiguro, G. Kitagawa, Akaike Information Criferion 

Statistics, D. Reidel Publishing Company, KTK Scientific Publishers, 

Tokyo Hardbound, ISBN 90-277-2253-6, November 1986. 

[8] J.D. Markel, A.H. Gray, Jr. Linear Prediction of Speech. Springer- 

Verlag, N.Y., New York, 1976. 

[9] D. Michael, J. Houchin, Automatic EEG analysis: A segmentation 

procedure based on the atrtocorrelation function, Electroenceph., 

Clinical Neurophysiology, (46):.232-235, 1979. 

[IO] G. Bodenstein, H.M. Praetorious, Feature extraction from the 

electroencephalogram by adaptive segmentation, Proceeding of IEEE, 

65(5): 642-652, May 199;'. 

[I I] H.M. Praetorious, G. Bodfenstein, O.D. Creutzfeldt, Adaptive 

segmentation of EEG records: A new approach to automatic EEG 

analysis, Electroencephalogram, Clinical Neurophysiology, Vo1.42, 

pp.84-91, 1917. 

[12] T. Mitchell, Machine Learning, McGraw Hill, 1997. 

[I31 C.D. Nugent, J.A. Lopez, A.E. Smith, Prediction Models in Design of 

Neural Network based ECG classifiers, BMC Medical Informatics 

and Decision Making, 2001. 

[I41 S. Chakrabarti, N. Bindal, Robust Radar Target Classifier Using 

Artificial Neural Networks, IEEE Transaction on Neural Networks, 

6(3), May 1995. 

[ 151 D. Docampo, Intelligent Methods in Signal Processing and Artificial 

Communications, Birkauser Boston, 1997. 

[ 161 D.M.Skapura, Building Neural Networks Algorithms, Applications. 

and Programming Techniques, ACM press, 1998. 

[ 171 J.A. Freeman, D.M. Skapura, Building Neural Networks. ACM press, 

1998.

ROBUST AUDIO WATERMARKING USING A CHIRP BASED TECHNIQUE 

Serhut Erkugiik, Sridhar Krishnan and Mehntet Zeyfinoglu 


Ryerson University, Toronto, ON M5B 2K3 Canada 

e-mail: (serkucuk)(knshnan)(mzeytin)@ee.ryerson.ca 

ABSTRACT 

In this study, we propose a new spread spectrum audio wa- 

termarking algorithm that embeds linear chirps as water- 

mark messages. Different chirp rates, i.e., slopes on the 

time-frequency (TF) plane, represent watermark messages 

such that each slope corresponds to a different message. 

We extract the watermark message using a line detection al- 

gorithm based on the Hough-Radon transform (HRT). The 

HRT detects the directional elements that satisfy a paramet- 

ric constraint in the image of a TF plane. The proposed 

method not only detects the presence of watermark, but also 

extracts the embedded watermark bits and ensures the mes- 

sage is received correctly. The results show that the HRT de- 

tects the embedded watermark message even after common 

signal processing operations such as MPEG audio coding, 

resampling, lowpass filtering and amplitude re-scaling. 

1. INTRODUCTXON 

In recent years, the digital format has become the standard 

for the representation of multimedia content. Today’s tech- 

nology allows the copying and redistribution of multime- 

dia content over the Internet at a very low or no cost. This 

has become a serious threat for multimedia content owners. 

Therefore, there is significant interest to protect copyright 

ownership of multimedia content (audio, image, and video). 

Watermarking is the process of embedding additional data 

into the host signal for copyright ownership. The embed- 

ded data characterizes the owner of the data and should be 

extracted to prove the ownership. Besides copyright protec- 

tion, watermarking may be used for data monitoring, finger- 

printing, and observing content manipulations. All water- 

marking techniques should satisfy a set of requirements [I]. 

In particular, the embedded watermark should be: (i) imper- 

ceptible, (ii) undetectable to prevent unauthorized removal, 

(iii) resistant to all signal manipulations, and (iv) extractable 

to prove ownership. Before the proposed technique is made 

public, all the above requirements should be met. 

Thin work was supponed by NSERC and Minonet 

The watermarking literature describes two classes of wa- 

termarking schemes. The first class of techniques called 

the one-bit watermarks [2], only detects the presence of the 

watermark rather than extracting it [3, 4, 51. The second 

class of techniques detects and extracts the embedded wa- 

termark message [6, 7, 81. If b bits represent the embedded 

watermark message, the detector has the task of correctly 

identifying the watermark message from an alphabet of 2 * 

possible watermark messages. As a result of signal manipu- 

lations some message bits extracted by the detector may be 

in error potentially resulting in the detection of the wrong 

watermark message. Our motivation for the proposed au- 

dio watermarking algorithm is to detect the presence of the 

watermark, extract the embedded watermark message bits 

and decide on the watermark message even if some bits are 

received in error. To achieve this goal, we embed linear 

chirps as watermark messages. Different chirp rates, i.e.. 

slopes on the TF plane, represent watermark messages such 

that each slope corresponds to a different message. The nar- 

rowband watermark messages are spread with a watermark 

key (binary PN sequence) across a wider range of frequen- 

cies before embedding. The resulting wideband noise is 

perceptually shaped and added to the original signal. The 

original and watermarked signals exhibit no perceptual dif- 

ferences. At the receiver a line detection algorithm based 

on the Hough-Radon transform (HRT) detects the slope of 

the extracted chirp in the image of the TF plane, even at 

discontinuities corresponding to bit errors. 

2. WATERMARK EMBEDDING 

Let x = [z(O)z(l) . ..IT be the audio signal which we 

first divide into N-sample long blocks. We use the notation 

xk = [z(kN) . . . z((k + l)N - 1)IT to represent the sam- 

ples for the kth audio block. Let m be a normalized chirp 

function that represents the watermark message to be em- 

bedded into the original signal. m takes continuous values 

in the interval [-1.1]. and needs to be quantized for the de- 

tection of each embedded bit. mq is the quantized version 

of m formed according to the sign of the sample values of 

m, taking values -1 and 1. Let ml represent a watermark 

0-7803-7965-9/03/%17.00 02003 IEEE 11 - 513 ICME 2003 

121 


message bit to be embedded into the kth audio block. We 

embed one watermark bit into each block. Each bit is spread 

with a binary PN sequence pk with a chip length of N to 

generate the wideband noise vector wk. 

We need to perceptually shape wk before adding to each 

block. To make WI imperceptible, the amplitude level of the 

wideband noise should be attenuated to 0.5 percent of the 

dynamic range of the host signal [9]. Let w; = [w’(kN) . . . 

~’((lc+l)N-l)]~ bethe signal dependentwidebandnoise 

generated from wk such that 

w;(n) = awk(n)Isk(n)l> (1) 

where a is the embedding strength. The high frequency 

band of the wideband noise sequence w; will not be robust 

to compression and lowpass filtering. Therefore. we gen- 

erate the frequency-limited noise wl;‘ by lowpass filtering 

w; with a cut-off frequency of IO percent (i.e. 2.2 kHz) of 

the maximum audio frequency. which represents part of the 

signal spectrum with significant energy content. After lim- 

iting the maximum frequency of the wideband noise to 2.2 

kHz, we found that a = 0.3 (independent of the audio sig- 

nal) is sufficient to achieve imperceptibility. This value of 

a is different than what is used in [9] because we embed a 

frequency-limited noise rather than a wideband noise. The 

resulting watermarked audio signal block yk equals 

yk = Xk i wk’. 

(2) 

We repeat the process for each block until we embed all the 

bits in mq. The resulting watermarked signal y is perceptu- 

ally the same as the original signal x. Figure I provides an 

overview of the proposed watermark embedding scheme. 

Pk 


3. WATERMARK DETECTION 

Under ideal signal conditions the received signal will be 

identical to the transmitted sequence yk. In Section 4, we 

will relax this condition and investigate the proposed wa- 

termarking scheme under the assumption that the received 

signal is different than yk as a result of various signal pro- 

cessing operations. Assuming ideal signal conditions and 

perfect synchronization of the signal and the PN sequence, 

we first lowpass filter the received signal yk to 2.2 Hz. Let 

y;’ represent the output of the lowpass filter in the receiver. 

Since w;‘ is band-limited to 2.2 kHz, we can express yp as: 

y;’ = x; + w;. 

(3) 

11 - 514 

122 

where xi is the audio signal component at the output of 

the lowpass filter. We then use the watermark key, i.e. the 

PN sequence Pk. to despread y;’ and integrate the result- 

ing sequence to recover mi, the embedded message bit for 

that block. Let (y;‘, pa) be the output of this integration 

Fig. 2. Watermark bit detection s,cheme. 

operation, where () represents the inner product operation. 

Under the assumption x k is a zero-mean sequence which is 

statistically independent from pk, we can approximate the 

expected value of (y;, pk) by the expression 

N-1 

lL=O 

(4) 

where /3 is a positive constant resulting from the filtering 

operations. Therefore, the extracted message bit m;, which 

estimates ml, can be based on the test statistic (y;!pk) 

such that 

To achieve improved watermark extraction performance we 

postprocess the extracted message bits using a time-frequency 

distribution (TFD). After all message bits are extracted, we 

construct the TFD (spectrogram) of the elements in mg. 

The TFD of a chirp watermark message is a line with vari- 

able slopes. A line detection algorithm based on the HRT 

then detects the presence of the line and estimates its param- 

eters. This second stage, consisting of TFD and HRT, func- 

As TFD 

Fig. 3. Detection of the watermark message. 

tions as an error-correcting technique and significantly increases 

the robustness of the proposed watermarking scheme. 

The HRT is an efficient tool to detect energy-varying directional 

chirps [IO]. It treats the TFD as an image, where each 

pixel value corresponds to the energy present at a particular 

time and frequency [lo]. The Radon transform (RT) computes 

the projections of different angles of an image (TFD) 

or two-dimensional data distribution f(u, v) measured as 

line integrals along ray paths [ 111: 


where 0 is the angle of the ray path of integration, p is the 

distance of the ray path from the center of the image and 

6 is the Dirac delta function. Equation (6) represents in- 

tegration of f(u,u) along the line p = ucosb' f usinb'. 

The Hough transform (HT) is a pattern-recognition method 

Fig. 4. Line detection using HRT. 

c 

U 

that calculates the number of image points satisfying a para- 

metric constraint [12]. The HT can be applied only to bi- 

nary images. However, TFDs can be gray-level images with 

intensity levels corresponding to energy values. The HRT 

method is the combination of HT and RT. It has the ad- 

vantage over HT as it can be applied to gray-level images 

to detect energy varying chirp components [IO]. The HRT 

method can also detect lines with discontinuities. This char- 

acteristic allows the extraction of the watermark message 

even if some of the watermark bits are incorrectly detected. 

Once the points on the two-dimensional data distribu- 

tion f(u,v) (in this case the probability density function 

of the TFD) that satisfy directional parametric constraints 

are found, the presence of the chirp, i.e. the watermark is 

decided. If there is watermark. the slope of the chirp deter- 

mines the watermark message. 

4. ROBUSTNESS TESTS AND DISCUSSIONS 

We evaluated the proposed scheme using 5 different audio 

signals (SI, . . . , S,) sampled at 44.1 kHz. Due to the limited 

resolution of the spectrogram, watermark messages are 

modulated as linear chirp functions with initial and final frequencies 

in one of the 17 frequency bands of 30 Hz bandwidth 

in the [0-510] Hz range. This approach allowed us to 

use a message alphabet with 289 possible watermark messages. 

We embedded these messages into audio signals of 

40 second duration for a chip length of 10000, and into audio 

signals of 20 second duration for a chip length of 5000. 

Hence, each audio signal contains 176 embedded message 

bits. To measure the robustness of the system, we performed 

the following tests: (i) TI: MP3 compression with bit rate 

11 - 515 

123 

128 kbps, (ii) Tz: MP3 compression with bit rate 80 kbps, 

(iii) Tg: lowpass filtering to 4 Wz, (iv) Tq: resampling at 

different sampling rates (22.05 kHz and 11.025 W), and 

(v) T,: amplitude scaling. We use the notation To to refer 

to watermark detection without signal manipulation. Therefore, 

the results corresponding to To serve as a reference. 

After the watermark embedded signal y goes through a 

signal manipulation process, the message bits are extracted 

using the detection scheme described in Section 3. During 

all the robustness tests, we assumed that the audio signal and 

the PN sequence are synchronized. Tables 1 and 2 show the 

bit error rate (BER) results expressed as a percentage of the 

total number of message bits (176) for the two chip lengths 

and for each signal manipulation operation. Extracted bits 

1 Audio I Robustness Test I 

-1 0.00 

Audio 

Sample 

1.14 

0.57 

I 1.14 1 1.14 I 3.42 

1 0.57 1 0.57 1 3.42 

0.00 0.00 1.70 

Table 1. BER (in percentage) for N = 10000. 

Robustness Test 

TO/T4/TS I TI 1 Tz I TS 

4.55 5.11 11.36 

s4 3.98 3.98 3.98 10.80 

S< 3.98 3.98 4.55 9.66 

Table 2. BER (in percentage) for N = 5000 

are localized on the TF plane using a spectrogram generated 

by a fixed window length short-time Fourier transform. Al- 

though some hits are received in error (even in the case of no 

signal manipulation), the HRT correctly detected the slope 

of the chirp functions in the image of the TF plane and suc- 

cessfully extracted the embedded watermark messages thus 

providing error-correction capability. In the simulations re- 

ported in this study we detected all the embedded water- 

mark messages correctly. Figure 5 shows the TFDs of the 

message hits embedded with chip length 5000 and extracted 

after various signal manipulations for the audio signal SS of 

20 second duration. 

The definition of the watermark message as a linear chirp 

function limits the data payload. We can increase the data 

payload by using any of the following techniques. (I) Em- 

bedding watermark messages in shorter signal segments. 

(2) Selecting the initial and final frequencies for the water- 

mark messages from a wider frequency band rather than the 

current [O-5101 Hz band. (3) Narrowing the 30 Hz deci- 

sion bands using higher TF resolution. We can improve TF 


m ntddd msp. 

Fig. 5. TFDs of the embedded and extracted bits 

resolution by using adaptive TF representation of the signal 

based on longer windows for slowly varying functions, and 

shorter windows for quickly varying functions. However, 

any of the above techniques can potentially degrade the de- 

tectibility of the watermark. We are currently investigating 

the potential of these techniques for increasing the data pay- 

load and their impact on the performance of the proposed 

watermark detection scheme. 

To test the robustness of the HRT with respect to large 

discontinuities, we corrupted the signal by adding white 

Gaussian noise of 5 second duration starting at the 15 sec- 

ond mark of a 40 second long audio sample. During this 

interval the watermark bit detection scheme incorrectly de- 

tected a significant number of bits (BER = 50%). Yet, the 

HRT successfully detected the slope of the linear chirp at 

the discontinuity and extracted the message. 

The initial robustness tests for additive noise, multiple 

watermarks, and multiple attacks resulted in small BERs to 

be further evaluated. 


In this paper, we proposed a new audio watermarking al- 

gorithm that extracts the watermark message even if some 

of the message bits are extracted in error. A line detection 

algorithm based on the HRT detects the slope of the water- 

mark message in the image of the TF plane of the signal. 

The HRT provides error correcting capability and can be 

efficiently implemented as it operates on small images of 

the TF plane. Initial studies confirm that the proposed al- 

gorithm achieves robustness with respect to common signal 

manipulations. The current implementation yields a modest 

data payload. However, the use of higher resolution TFDs 

124 

promises to increase the data payload while retaining all the 

desirable characteristics of the proposed watermarking al- 

gorithm. We are currently working on the synchronization 

problem and the expansion of the robustness tests. 

6. REFERENCES 

[I] M. Arnold, ‘‘Audio watermarking: Features, applica- 

tions and algorithms,” lEEE lntl. Conf oti Multiniedia 

arid Expo, vol. 2, pp. 1013-1016,20ClO. 

[2] I.J. Cox, M.L. Miller and J.A. Bloom, Digiral ”mer- 

niarkbig, San Diego, Academic Pres, 2002. 

[3] S. Lee and Y. Ho, “Digital audio watermarking in the 

cepstrum domain,” IEEE Trans. 011 Cotisioner Elec- 

tronics, vol. 46, no. 3, pp. 744-750, .August 2000. 

[4] P. Bassia, 1. Pitas and N. Nikolaidis, “Robust audio 

watermarking in the time domain,”’ IEEE Trans. on 

Mirlriniedin, vol. 3, no. 2, June 2001. 

[5] D. Kirovski and H. Malvar, “Spread-spectrum audio 

watermarking: Requirements, applications, and limi- 

tations,” lEEE Foenh Workshop OIL Mitlrirriedia Sigriol 

Processing, pp. 219-224, October 2001. 

[6] M.D. Swanson. B. Zhu and A.H. Tewfik, “Current 

state of the art, challenges and future directions for 

audio watermarking:’ lEEE Ind. Cot$ on Mirltiniedia 

Conipriritig otid Sysrenis, pp. 19-24, vol. I, 1999. 

[7] W.N. Lie and L.C. Chang, “Robust high quality time- 

domain audio watermarking subject to psychoacoustic 

masking:’ IEEE Itirl. Synip. on Circrrifs arid Systeriis, 

pp. 45-48, vol. 2,2001. 

[SI J.W. Seok and J.W. Hong, “Audio watermarking for 

copyright protection of digital audio data,” Electronics 

letters. pp. 60-61, vol. 37, no. I; Jan. 2001. 

[9] W. Bender, D. Gruhl, N. Morimoto and A. Lu “Tech- 

niques for data hiding,” IBM Systems Jormial, vol. 35, 

nos. 3 & 4, pp. 313-336.1996. 

[IO] R.M. Rangayyan and S. Krishnan, “Feature identifica- 


Radon transform,” Trans. Partern Recognition, vol. 34, 

pp. 1147-1 158,2001. 

[Ill G.T. Herman, Image Reconsrrrrcrion from Projec- 

tions: The Firndameiitals of Conipurerized Tomogra- 

phy, New York, Academic Press, 1’980. 

[ 121 R.O. Duda and P.E. Hart, “Use of Hough transform to 

detect lines and curves in pictures,” Comniunicnrions 

of the ACM, 15(1): 11-15, January 1972. 


TIME-FREQUENCY FILTERING OF INTERFERENCES IN SPREAD SPECTRUM 

COMMUNICATIONS 

ABSTRACT 

A novel technique to excise single and multi-component 

chirp-like interferences in direct sequence spread spectrum 

communications is proposed. The received signal is decom- 

posed into its time-frequency (TF) functions using an adap- 

tive signal decomposition algorithm and the TF fnnctions 

are mapped onto the TF plane. The TF plane is optimized 

and treated as an image, and the interference represented 

in the TF plane is detected using the Hough-Radon trans- 

form (HRT). Simulation results with synthetic models have 

shown successful performance for the excision of linear and 

non-linear chirp interferences. The method has shown im- 

munity to both white noise and multiple interferences even 

under very low SNR conditions of -10 dB. 

Keywords: interference excision, spread-sprectrum com- 

munications, adaptive signal decomposition, Hough-Radon 

transform, time-frequency filtering 


Serliat Erkiiqiik and Sridhar Krishnan 


Ryerson University, Toronto, ON MSB 2K3, Canada 

e-mail: (serkucuk)(krishnan)@ee.ryerson.ca 

In spread spectrum communications, the message signal is 

modulated and spread over a widerbandwidth with a pseudo- 

noise (PN) code also known at the receiver, and is transmit- 

ted over the channel. The bandwidth increase of the trans- 

mitted signal yields a processing gain, defined as the ratio of 

the bandwidth of the transmitted signal to the bandwidth of 

the message signal. Although the processing gain provides 

a high degree of interference suppression, there is a trade- 

off between increasing the processing gain and the available 

frequency spectrum. In case of high interference to signal 

ratio (ISR), the spread spectmm system with a limited pro- 

cessing gain may not be able to suppress the interference. 

Therefore, excising the interference prior to despreading the 

received signal is necessaly to increase the performance of 

the system [I]. 

In this study, we will evaluate the proposed interference 

excision algorithm using the direct sequence spread spec- 

trum (DSSS) system, one of the most widely used spread 

spectrum techniques [I]. In DSSS, m k, the kth hit of the 

This work was supported by NSERC and Micronet. 

0-7803-7946-2/03/$17.00 82003 IEEE 

323 

message signal m(t), is multiplied with a PN code p(t), 

where each message bit occurs every T,, seconds and the 

PN code bits every Tp seconds. The processing gain, i.e. 

the length of the PN code is therefore L = T,/Tp, where 

T,,, >> Tp. During the transmission of the modulated signal, 

additive white Gaussian noise n(t) and intelference i(t) 

are added to the signal in the channel, and the following signal 

is received: 

.(t) = 7n&) + n(t) + i(t). 

(1) 

At the receiver, the received signal r(t) is synchronized and 

correlated with the same PN code p(t) and the estimate of 

the message bit f i g is made, 

125 

fie = b(t)>P(t)) 

= mk (P(t),P(t)) + (n(t),p(t)) + (i(t)>P(t)) ' (2) 

As seen in the above equation, despreading of the received 

signal recovers the message signal, while spreading the noise 

and the interference. The decision is made on the polarity 

of nik. If the ratio of the interference power to the signal 

power is large so the processing gain can not suppress the 

interference, then the estimate of the message hit, I?L~ may 

be wrong. Therefore the interference should be suppressed 

prior to despreading. 

Some excision techniques such as adaptive notch filtering, 

decision-directed adaptive filtering, and analog to digital 

conversion techniques are commonly used to suppress 

narrowband interferences in DSSS [I]. However, if the interference 

has a narrowband instantaneous bandwidth in a 

wideband frequency range such as chirps, time-frequency 

(TF) methods perform well to localize the interference [2]. 

There have been several techniques proposed to suppress 

the interference using time-frequency distributions (TFDs) 

of the signal [3,4,5]. TFDs localize any interference both 

in time and frequency domain [2], and are ideally suited 

for interference excision. The commonly used TFDs suffer 

from a trade-off between TF resolution and cross-terms 

suppression. 

In this paper, we focus on a new excision technique 

based on constructing a positive TFD [6, 7, 81 of the received 

spread spectrum signal using an adaptive signal de- 


composition technique [9]. The block diagram of the pro- 

posed interference excision algorithm is shown in Figure 

1. By decomposing a signal into components, the inter- 

action between components can be avoided, and the TFD 

constructed by combining the TFDs of the individual com- 

ponents would be free of cross-terms. Also, by using Gaus- 

sian functions as bases for decomposition, a high TF reso- 

lution of interference signals can be achieved. By treating 

the TF plane as an image, the interference patterns can be 

detected by using the image analysis technique of Hough- 

Radon transform (HRT). Curves with mathematical equa- 

tions can be easily detected by transforming the shapes in 

the TF image into the Hough domain (also known as para- 

metric domain), and searching for dominant peaks (maxi- 

mum values). The co-ordinates of the dominant peaks pro- 

vide the parameters of the shape. Interferences are recon- 

structed by suitably thresholding the corresponding energy 

values in the TF plane and subtracted from the received 

spread sprectrum signal. 

The paper is organized as follows: In Section 2, con- 

struction of TF image is explained. The HRT theoty for 

detection of chirps in the TF image is explained in Section 

3. In Section 4, the performance of the proposed system is 

evaluated in terms of ISR, bit error rate (BER) and average 

chip error rate. The paper is concluded in Section 5. 

Figure 1: Interference excision algorithm. 

2. CONSTRUCTION OF TIME-FREQUENCY 

IMAGE 

The adaptive signal decomposition algorithm we use to de- 

compose the signal into its TF functions is the matching 

pursuit (MP) algorithm [9]. In MP, the received signal r(t) 

is decomposed into its linear combinations of TF functions 

g,(t) selected from an overcomplete dictionaly of TF func- 

tions. The signal r(t) can be represented as 

m 

where 

1 t-pn 

grn(t) = - g(-) 

s,. 

exp Ij(Znfnt + 4d1, (4) 

and a, are the expansion coefficients. The scale factor s, 

controls the width ofthe window functionandp, is the temporal 

placement coefficient. The parameters f and 4,, rep- 

(3) 

324 

126 

resent the frequency and the phase of the exponential func- 

tion, respectively. The signal r(t) is projected onto an over- 

complete dictionary of TF functions with all possible win- 

dow sizes, frequencies and temporal placements. At each 

iteration, the best-correlated function is selected from the 

dictionary and the remaining of the signal, which is called 

residue is further decomposed using the same iteration pro- 

cedure. For our application, we use the Gabor dictionaty 

consisting of Gaussian functions. Gaussian functions sat- 

isfy the minimum time-handwidth product and represent the 

signal on the TF plane with an optimal time-frequency res- 

olution [2]. After M iterations, the signal r(t) can be repre- 

sented as, 

M-l 

r(t) = (Rnr,gy,,(t))gy,(t) + R M~, (5) 

"=O 

where Rnr represents the residue of the signal r(t) after n 

iterations. The first term in Eqn. 5 represents the first M 

Gaussian functions matching the signal best (we will refer 

to the first term as r'(t)) and the second term (referred as 

r"(t)) represents the residue of the signal r(t). In order 

for the signal to be fully decomposed, the iteration process 

should continue until all the energy of the residue signal 

is consumed. In this study, we are interested in modeling 

the interferences with power higher than the power of the 

transmitted signal. The unmodeled part of the interference 

is suppressed by the processing gain. Therefore, we stop 

our iterations when the power of the residue of the signal 

r" (t) becomes less than the expected power of the interference 

free signal for less computational load. After the signal 

decomposition is achieved, the TFD W(t, w) may be constructed 

by taking the Wiper-Wle distribution (WVD) [2] 

of the Gaussian functions represented in r'(t): 

M-l 

W(t,w) = I(R"r,g7n(t))12Wg~~((t,w) 

M-I M-I 

*=0 

+ (R"r, 97, (t)) (R"'r.gT.., (0) Yg7" ,g,,,.~(t, (6) 

*=a 

where WgTn(t,w) is the WVD of the Gaussian window 

function. The double sum corresponds to the cross-terms of 

the WVD and should be rejected in order to obtain a crossterm 

free energy distribution of r'(t) in the TF plane [9]. 

Therefore the resulting TFD is given by the first term of 

W(t,w), which we denote it by W'(t,w). W'(t, w) is a 

positive and cross-term free distribution but it does not satisfy 

the marginal properties 

.I 

JW.(t,w)dw # Ir'(t)lZand W'(t,w)dt # IR'(w)12 (7) 

in order to be a proper TFD for feature identification applications. 

The TFD W'(t, w) may be modified to satisfy the 

marginal requirements and still preserve its important properties. 

The cross-entropy minimization method can be used 

to optimize the TFD [PI. The resulting TFD is a true probability 

density function and it can be used for feature identification 

[6]. Let's denote the optimizedTFD by W"(t, w). 


3. INTERFERENCE DETECTION. nouen AND 

RADON TRANSFORM 

The combined Hough and Radon transform (HRT) is an efficient 

tool to detect energy-varying directional chirps [IO]. 

In the HRT, the TFD is treated as an image, where each pixel 

value corresponds to the energy present at a particular time 

and frequency. For convenience, we will refer to the graylevel 

image of the optimized TFD W"(t, tu) as f (U, u). 

The Radon transform (RT) computes the projections of 

different angles of an image or two-dimensional data dis- 

tribution f (U, U) measured as line integrals along ray paths. 

The RT can be expressed as 

R = 11: f(u, u)6(p - ( UCOS~ + UsinO)) dudu, (8) 

where 8 is the angle of the ray path of integration, p is the 

distance of the ray path from the center of the image and 

6 is the Dirac delta function. The equation represents integration 

of f (U, U) along the line p = U cos 8 + U sin 8. 

The Hough transform (HT) is a paitem-recognition method 

that calculates the number of image points that satisfy a 

parametric constraint (quadratic interferences are modeled 

as second order equations as in [lo]). The HT can be applied 

to binary images only. The advantage of the combined 

HRT over HT is that it can be applied to gray-level images 

where we can detect energy varying chirp components as 

well. Once the points on the two-dimensional data distribution 

f (U, U) that satisfy directional parametric constraints 

are found, we transform the parameters to the TF domain 

and threshold the energy values of the TF functions corresponding 

to the directional interference on the TF plane. As 

illustrated in Figure I, the estimate ofthe interference ;(t) is 

reconstructed and subtracted from the received spread spectrum 

signal. 


In our simulations, we used 128 chips per message bit for 

spreading the message signal and assumed the channel to be 

non-dispersive. We considered synthetic linear, quadratic, 

and multiple (linear and quadratic) chirps as the interference 

sources. We initially evaluated the bit error rates (BERs) re- 

sulting from the presence of a constant amplitude linear and 

quadratic chirps that sweep the entire frequency band oftbe 

spread spectrum signal, for different ISRs in the range of 

[0,50] dB. We assumed the SNR to be 10 dB for each case. 

When the ISR was below IO dB, the system was able to de- 

spread the interference so that no bit errors occurred at the 

receiver. For ISRs in the range of [ 10,501 dB, we supressed 

the single and multiple interferences using the proposed ex- 

cision algorithm before despreading. Multiple interferences 

included a linear and a quadratic chirp in the same TF do- 

main. We recorded no bit errors ajier the excision of single 

and mirltiple infeferences. We repeated the same process 

325 

127 

for different SNR values in the range of [-10,10] dB and 

also recorded no bit errors. 

One of the main reasons for this is an accurate TF repre- 

sentation of interferences in the adaptive TF plane and suc- 

cessful detection and filtering by HRT. A similar obsetva- 

tion was made by Bultan et al. in [I I], where they repre- 

sent the linear interferences with good TF localization us- 

ing adaptive chirplet decomposition. However, they do not 

report any results on the excision of quadratic and multiple 

interferences. Other TFD based methods reported bit errors 

for similar excision conditions [3,4, 51. Since interferences 

with different power levels were successfully removed from 

the signal resulting in no BERs, we evaluated our system 

by calculating the percentage of chips received in error for 

various SNR values. Figures 2 and 3 show the simulation 

results for the ISR values 40 and 5 dB, respectively. The 

fint ISR value is chosen as 40 dB because the system gives 

around 50% BER (the case when the system cannot reject 

any part of the interference) when there is no excision. 

Figure 2: Probability of chips in error for ISR=40 dB. 

The second ISR value is chosen as 5 dB, where the sys- 

tem can reject the interference without pre-processing prior 

to despreading. In some of the systems proposed, exci- 

sion of the interference with low power degrades the per- 

formance of the system [3], whereas our system substan- 

tially improves the chip error rate. For illustration purpose, 

TFDs of (i) the SS signal with a single interence (ISR = 5 

dB), (ii) the detected interference, and (iii) the interference 

suppressed SS signal are shown in Figure 4. 

The experimental results show that the proposed tech- 

nique can be successfully used for excision of single and 

multiple-component chirp-like interferences using adaptive 

TFDs and HRT, where as Bultan et al. [I I] focus only on 

suppression of linear chirps and Amin uses different kernels 

for different interferences [3]. 


SNR (dB) 

. . . 

. . . . . . . . . . . . . . 

Figure 3: Probability of chips in error for ISR=5 dB. 

time sal” 

Figure 4: TFDs of (i) SS signal with a linear interference 

(chirp) (ii) estimate of the interference, (iii) interference fil- 

tered SS signal. 


A new technique is introduced for the excisionof frequency- 

modulated interferences in spread spectrum communications 

The localization of the interference is provided by an adap- 

tive signal decomposition algorithm using Gaussian func- 

tions as bases and rejecting the cross WVDs of the Gaussian 

functions. Therefore, single and multiple time-varying FM 

interferences are represented with good resolution on the 

TF plane. The interference is then detected using a line de- 

tection algorithm, HRT. The estimated interference is suh- 

tracted from the signal, prior to despreading. The simu- 

lation results for the proposed algorithm had no bit errors 

after suppressing the interference for different ISR values 

even under very low SNR conditions of - IO dB. The perfor- 

mance of the system is evaluated by calculating the received 

chips in error before and after interference suppression. The 

proposed technique improves the performance of the system 

326 

128 

by reducing the number of chips received in error after exci- 

sion of the interference in both cases, wbenthe ISR is low or 

high. This technique can be used for any kind of chirp-like 

interference suppression with high accuracy. 

6. REFERENCES 

[I] J.D. Laster and J.H. Reed, “Interference rejection in 

digital wireless communication:’ IEEE Signal Pro- 

cessing Mag., pp. 37-62, May 1997 

[2] L. Cohen, “Time-frequency distributions - A review,” 

Proc. IEEE, vol. 77, pp. 941-981, 1989 

[3] M.G. Amin, “Interference mitigation in spread spec- 

trum communication systems using time-frequency 

distributions,” IEEE Trans. Signal Processing, vol. 45, 

no. 1, pp. 90-101, Jan 1997 

[4] S. Barbarossa and A. Scaglione, “Adaptive time- 

varying cancellation of wideband interferences in 

spread-spectrum communications based on time- 

frequency distributions:’ IEEE Trans. Signal Process- 

ing, vol. 47, no. 4, pp. 957-965, Apr. 1999 

[5] X. Ouyang and M.G. Amin, “Short-time Fourier trans- 

form receiver for nonstationary interference excision 

in direct sequence spread spect“ communications,” 

IEEE Trans. Signal Processing, vol. 49, no. 4, pp. 85 1 - 

863, Apr. 2001 

[6] S. Krishnan, “Adaptive Signal Processing Techniques 

for Analysis of Knee Joint Vibroarthrographic Sig- 

nals,” PhD. Thesis, University of Calgary, June 1999 

[7] L. Coben and T. Poscb, “Positive time-frequency dis- 

tribution functions,” IEEE Trans. Acousf. Speech Sig- 

nal Processing, vol. ASSP-33, no. 1, pp. 31-38, 1985 

[8] P.J. Loughlin, J.W. Pitton and L.E. Atlas, “Construc- 

tion of positive time-frequency distributions,” IEEE 

Trans. Signal Proc., vol. 42, no. 10, pp. 2697-2705, 

Oct 1994 

[9] S.G. Mallat and Z. Zhang, “Matching pursuit with 

time-frequency dictionaries,” IEEE Trans. on Signal 

Proc., 41(12): 3397-3415, 1993 

[IO] R.M. Rangayyan and S. Krishnan, “Feature identifica- 


Radon transform:’ Pattern Recognition, vol. 34, pp. 

1147-1 158,2001 

[ 111 A. Bultan and A.N. Akansu, “A novel time-frequency 

exciser io spread spectrum communications for chirp- 

like interference:’ Proc. ICASSP-1998, pp. 3265- 

3268.1998 


A GENERAL PERCEPTUAL TOOL FOR EVALUATION OF AUDIO CODECS 



The University of Westem Ontario, 

London, Ontario, CANADA. 


Abstract 

Subjective evaluation forms an important part of any re- 

search work, where the feedback and perception of general 

public or a trained set of specialists are mandatory. Many 

audio and video coding techniques have emerged to tackle 

the bandwidth pmblems imposed by the Internet with data 

compression schemes either with lossless or perceptually 

lossless quality. In order to evaluate the performance of 

these techniques a Mean Opinion Score (MOS) test hns to 

be performed with wide variety of subjects. In this paper we 

present a MOS tool developed to evaluate the audio codecs 

both in controlled and uncontrolled listening envimnments. 

The technique is based on the International Telecommuni- 

cation Union - Radiocommunication sector (ITU-R) stan- 

dard. This novel appmach of performing distributed listen- 

ing tests in uncontmlled envimnments will help researchers 

to collect substantial feedback andperform statistical anal- 

ysis of an audio codec’sperformance in an eficient manner 

particularly for intemet driven applications. Results ofper- 

ceptual evaluation of 8 sample audio files of different va- 

rieties with an adaptive time-frequency transform (ATFTJ 

audio codec indicates the ease, independency, and the ef- 

fectiveness of performing MOS studies with the proposed 

technique. 

Keywords: mean opinion score (MOS), subjective evalua- 

tion, multimedia, audio coding, listening experiments. 


Subjective evaluation of audio quality is needed to assess 

the performance of audio codecs. Even though there are 

objective measures such as signal to noise ratio (SNR), total 

harmonic distortion (THD), and noise-to-mask ratio [l] 

they would not give a true evaluation of the audio codec 

particularly if they use lossy schemes as in the case of many 

existing well known audio codecs. This is due to the fact 

that for example in a perceptual coder SNR is lost however 

the audio quality is claimed to be perceptually distortion- 

Thanlrs to Minanet and NSERC organizations for funding ulis project. 

CCECE 2003 - CCGEI 2003, Montreal, May/mai 2003 

0-7803-7781-8/03/$17.00 @ 2003 IEEE 

Sridhar Krishnan and Garabet Sinanian 



Toronto, Ontario, CANADA. 

Email: (!aishnan)(gsinania) @ee.ryerson.ca 

less. In this case SNR measure may not give the correct 

performance evaluation of the coder. 

The proposed technique uses the subjective evaluation 

method recommended by the Intemational Telecommuni- 

cation Union - Radiocommunication sector (ITU-R) stan- 

dards. It is called a “double blind triple stimulus with hid- 

den reference” [l]. In this method listeners are provided 

with three choices A, B and C for each sample under test. A 

is the referencdoriginal signal, B and C are assigned to be 

either the referencdoriginal signal or the compressed signal 

under test. The selection of reference or compressed signal 

for B and C is completely randomized. Figure 1 explains 

the choices A, B and C. For each sample audio signal, sub- 

jects listen to reference signal A, and compare B and C with 

the A. After each comparison of B with A, and C with A, 

they grade the quality of the B and C signals with respect 

to A in 5 levels as shown in Table 1. Both the listener and 

the test performer are made unaware of the combinations 

B and C can take, and it is called double-blind, and since 

three stimulus are provided it is called double-blind triple 

stimulus method. 

Fig. 1. Block diagram explaining MOS choices A, B, and 

C. 

Audio Quality 

Excellent 

Good 

Fair 

Poor 

Unsatisfactory 

Level of Distortion 

lmperceptible 

Just perceptible but not annoying 

Perceptible and slightly annoying 

Annoying but not objectionable 

Very annoying and objectionable 

Table 1. Description of the ratings used in the Mean Opin- 

ion Score. 

- 683 - 

129 


In this paper, we propose a subjective evaluation scheme 

in line with the above explained "double blind triple stim- 

ulus with hidden reference" for an adaptive time-frequency 

transform (ATFT) based audio codec. The paper is orga- 

nized as follows. In Section 2, under methodology a brief 

introduction to the ATFT coder and measurement procedure 

are discussed. Results and discussions are covered in Sec- 

tion 3. 


2.1. Adaptive Time-frequency Transform 

(ATFT) codec 

The ATFT audio codec is based on the matching pursuit 

(MP) [2] algorithm, where any signal z(t) is decomposed 

into a linear combination of TF functions g(t) selected from 

a redundant dictionary of TF functions. 

where 

a? 

z(t) = an gTn (t), (1) 

n=0 

and an are the expansion coefficients. The scale factor also 

called the octave parameters, is used to control the width 

of the window function, and the parameter p , controls the 

temporal placement. The parameters f, and +, are the fre- 

quency and phase of the exponential function respectively. 

The signal z(t) is projected over a redundant dictionary 

of TF functions with all possible combinations of scaling, 

translations and modulations. The dictionary of TF func- 

tions can either suitably be modified or selected based on 

the application in hand. In OUT technique, we are using the 

Gabor dictionary (Gaussian functions) which has the best 

TF localization properties [31. At each iteration, the best 

correlated TF functions to the local signal structures are se- 

lected from the dictionary. The remaining signal called the 

residue is further decomposed in the same way at each itera- 

tion subdividingthem into TFfunctions. After M iterations, 

signal z(t) can be expressed as 

M-I 

4t) = (R"z, 97") 97" (4 + R'z(t), (3) 

,=O 

where the first part of z(t) is the decomposed TF functions 

till M iterations, and the second part is the residue which 

will be decomposed in the subsequent iterations. This process 

is repeated till the total energy of the signaljs decomposed. 

The decomposition parametem (s,. pn. f,. Qn and 

a,) are further processed by applying energy thresholcling 

and perccptual filtering followed by quantization to obtain 

a compact representation of the audio signal. More details 

on ATET audio coding technique can be found in somi: of 

our earlier works [4, 51. The overall block diagram of the 

ATFT codec is shown in Figure 4. Two versions (standard 

and fast) of MP algorithm based ATFT codec was evJu- 

ated using the proposed subjective evaluation technique. A 

separate subjective evaluation of the perceptual model ,and 

quantization stage of the ATFT codec is also included. 

2.2. Measurement procedure 

Evaluation of any audio coding technique involves perfoim- 

ing subjective evaluation of the compressed audio quality. 

The standard procedure to obtain quantitative and qualita- 

tive data about a coder's performance is by performing ihe 

mean opinion scores (MOS) studies. 

2.2.1. MOS in controlled environment Experimental 

setup consists of a Pentium 111 PC with Windows 2000. 

Eight sample stereo signals were played through the creative 

sound blaster card to a professional high quality headset 

(Sennehiser) with a fixed volume output. A graphical uijer 

interface (GUI) was designed as shown in Figure 2. %ee 

stimuli A, B, and C are provided as explained in Section 

I. Listeners are allowed to do the tests by themselves under 

the supervision of the research team. Ratings are automat- 

ically recorded in a report file as the listener proceeds with 

all the 8 stereo samples. Each time the listener is allowed to 

advance to the next sample only after he/she grades the cix- 

rent sample. Twenty listeners (randomly selected) with con- 

sent agreement participated in the MOS studies. Once the 

testing was finished for all the subjects, the average MOS 

scores were computed for each sample. Table 2 shows the 

average MOS values obtained for the 8 signals. Figure 3 

shows the distribution of the MOS scores for each of the 

eight sample signals. It can be observed from the Table 2 

that classical-like audio signals such as Harp, Harpsichord, 

piano, and Tubularbell received high MOS scores compared 

to the rock-like (Acdc, Deflep) and signals with voice seg- 

ments (Enya, Visit). 

- 684 - 

130 

Table 2. Average MOS for 20 listeners. 


Fig. 2. Snapshot of the GUI used for MOS studies 

. . 

UOSPOm. - 

1 1 a . o 

Fig. 3. Distribution of the MOS scores for 20 listeners 

In order to evaluate the performance of the developed 

perceptual model and the quantization stage of the ATFT 

cods, a second MOS study was conducted with 5 listeners. 

The procedure was repeated but the choices A, B and C were 

assigned as shown in Figure 4. The output of the TF decom- 

position (TF modeling stage) forms the input to the percep- 

tual filtering module hence the reference A was assigned to 

the reconstructed signal at the output of the TF modeling 

stage. Similarly choice B was assigned to the reconstructed 

signal at the output of the perceptual filtering module and C 

to the reconstructed signal at the output of the quantization 

stage. Listeners were asked to rate the choices B (percep- 

tual filtering output) and C (quantization stage output) with 

the reference A (TF modeling output) on a scale of 1 to 5 as 

explained in Section 1. 

The results were averaged for the five listeners and given 

in Table 3. From Table 3 it can be observed on an average, 

MOS scores of 4.6 and 4.3 were achieved for the perceptual 

filtering stage and the quantization stage respectively. The 

MOS scores indicate that the perceptual filtering technique 

proposed in the ATIT codec is performing exceedingly well 

with the eight sample signals and the noise introduced in the 

process ofquantization affects the output quality minimally. 

Sample PFO QO 

Deflep 

Enya 4.2 

Ham 4.2 

Harpsichord 

Piano 

Visit 

Average 

Table 3. Average MOS for PFO and QO. PFO - Perceptu- 

ally filtered output, QO - Quantisation output. 

The whole ATFT audio coding process was also evaluated 

with a faster version of the MP algorithm [6]. The faster 

version of the MP technique is based upon selecting a set of 

hest correlated TF functions at each iteration as opposed to 

one function selected at each iteration as done in the stan- 

dard MP. MOS were obtained by testing with 9 subjects and 

the results arc given in Table 4. 

Sample 

Acdc 

Deflep 

Enya 

Harpsichord 

Average MOS 

3.8 

3.2 

3.7 

Tubularbell 3.8 

Visit 2.9 

Table 4. Average MOS for 9 listeners of the ATFT codec 

with faster algorithm. 

2.2.2. MOS in uncontrolled environment As most of the 

audio compression formats are aimed at using with Internet, 

it is essential to evaluate the audio ccdec performance in an 

uncontrolled environment using Internet. The dismbuted 

MOS will give the me performance rating of the audio 

quality in terms of acceptance level in an average Internet 

listener environment. Many variability are involved in this 

MOS approach such as the quality of the audio hardware, 

audio playback software and the volume of playback. How- 

ever the average MOS results will justify the suitability of a 

media format over Intemet as it is tested in a more flexible 

environment with variety of internet listeners. 

A web based MOS as shown in Figure 5 was devel- 

oped using the standard web design tools. Similar to the 

standard MOS procedure a consent form will be displayed 

when the listener visits the main web page. After accept- 

ing the conditions, the web page is redirected to the actual 

MOS testing page. Listeners are provided with three stim- 

- 685 - 

131 


Widobad 

TF 

Modcb 

Rc-l=,lCVd 

--+F-1= 

TF 

Threshold 

M&a 

PW-Vr 

Mmldng m 

channel 

R0mESi"g 

___ 

Fig. 5. Snapshot of the GUI used for web based MOS stud- 

ies. 

uli A, B and C as explained in Section 1. The interactive 

web page contains a form to receive the name of the listener 

and his rating of each sample. A validation is performed 

such that only after entering the name and rating the audio 

sample, the listener will be able to navigate to the next sam- 

ple. When the listeners select the stimuli, the audio samples 

are played using the media players associated with their re- 

spective web browsers. At this point of time all our testing 

on the web will be using the *.au format. This means the 

ATFT audio codec output will be converted to *.au format 

for the MOS purposes. The standard *.wav and *.au format 

at 44100 kHz sampling rate with 16 bit resolution is con- 

sidered to be a nearly lossless and a gold standard for CD 

quality music. Hence converting the ATFT codec output to 

one of this formats should not affect the quality or the MOS 

obtained. Once all the samples are rated the results are ap- 

pended in a data file in the main server with the usemame. 

Scripts written in Perl, handle the processing of data and 

redirecting web pages. 

3. RESULTS AND CONCLUSIONS 

The MOS study on the ATIT codec was performed on eight 

stereo sample signals in the following modes: 1. Controlled 

(a. with standard ATIT algorithm, b. with fast ATFT al- 

PeraplalWIc"ng 

________ 

132 

___ 

Rrmnalrucled RCaniUYCLed 

gorithm and c. evaluation of perceptual and quantization 

stages with standard ATFT algorithm) 2. Uncontrolled (web 

based MOS). 

It can seen from the Tables 2,3,4, and Figure 3, the sig- 

nificance of the proposed MOS study in evaluation of the 

audio coders. A broad and clear understanding of the out- 

put audio quality of the codec can be obtained with respert 

to (I) the type of signals the codec performs well or worst, 

(2) the speed of the algorithm versus output quality and (3) 

block-based evaluation of individual parts of the coder. De- 

tailed subjective testing using the web based MOS will be 

performed to obtain statistically significant results in evalu- 

ating coder performances. 

The advantage posed by web based MOS studies such as 

ease of subject recruitment with diverse music backgrounds; 

effectiveness in data/feedbackcollection; machine and envi- 

ronmental flexibility; and the availability of personal corn- 

puters ubiquitously will make it as an attractive tool for eva- 

uating the performance of next generation media compres- 

sion techniques over Internet. 

References 

[I] Thomas ryden, "Using listening tests to assess audio codecs:' 

in Collected Papers on Digital Audio Bit-Rare Reducrion, 

AES, 1996, pp. 115-125. 

[Z] Stephane Mallat, A wavelel tour of signal processing, Aca- 

demic press, San Diego, CA, 1998. 

[3] L. Cohen, 'Time-frequency distributions - a review," Pin- 

ceedings of the IEEE, vol. 77(7), pp. 941-981, 1989. 

[41 Karthikeyan Umapathy and Sridhar Krishnan, "Joint time- 

frequency coding of audio signals," in 51h WSES International 

multi conference on CSCC (Circuits, System. Communica- 

rims and Compulers), Crete, Greece, July 2001, pp. 32-36. 

[5] Karthikeyan Umapatby and Sridhar Krishnan, "Low bit-rite 

coding of wideband audio signals:' in Proceedings of IASTED 

International conference - SPPRA (Signal Pmessing, Pattern 

Recognition and Applicationr), Rhodes, Greece, July 2OEll. 

pp. 101-105. 

161 R. Gribonval, "Fast matching pursuit with multiscale dictio- 

nary of Gaussian chirps," IEEE Transactions on Signal Pro- 

cessing, vol. 9, no. 5, May 2001. 

- 686 - 


Non-Stationary Noise Cancellation in Infrared Wireless Receivers 

Sridhar Krishnan, Xavier Fernando and Hongbo Sun 

Department of Electrical and Computer Engineering, Ryerson University, 

Toronto, Ontario, Canada 

(!aishnan)(femando)(hsun)@ee.ryerson.ca 

Abstract 

Infrared is attracting much attention for indoor 

wireless access due to its enormous bandwidth, 

inherent privacy and low cost. Intensiry modulated, 

directly detected infrared schemes do not experience 

multipath fading. However, ambient noise due to 

artificial lighting has been the major concern in 

infrared wireless systems in indoors. Conventionally, 

static or lowfrequency noise due to conventional light 

sources is removed using optical high pass filters. 

Nonetheless, interference from fluorescent lights 

equipped with electronic ballasts has periodic 

interference components up to I MHz and, cannot be 

filtered easily. In this paper, soft DSP filters are 

proposed to cancel the harmonics. ambient noise. and 

uncorrelated signal structures. Non-stationary noise is 

cancelled with an adaptive denoising filter, and a 

comb filter cancels interference from the electronic 

ballasts. Adaptive soft filters have the advantage that 

they can be easily updated and track the variations in 

noise characteristics. Simulation results show 

promising improvement in noise cancellation even 

under very low and varied SNR and noise source 

conditions. 

Keywords: infrared wireless. denoising, non-s tationary 

signals, comb filters, adaptive filters. wavelet-packets. 


Wireless communications has entered into a new 

phase. With each added application the demand for 

real-time, wideband wireless services increases. The 

overcrowded radio spectrum is simply unable to cope 

with all the demand. Infrared signal, on the other hand, 

is a promising new medium for wireless applications, 

especially at indoors. Considering the fact that the 

need for wideband multimedia type access is much 

high at indoors than at outdoors, infrared is an 

excellent choice. It has abundant untapped bandwidth 

that is freely available. Optical wireless techniques 

enjoy increased focus worldwide. The Wi-Fi (IEEE 

802.1 1) standard specifies infrared as a physical layer 

option. 

Optical energy is inherently confined within a room 

cavity resulting in inherent privacy. Therefore, the 

same infrared wavelength can be used in adjacent 

CCECE 2003 - CCGEI 2003, Montrkal, Maylmai 2003 

0-7803-7781-8/03/$17.00 0 2003 IEEE 

- 1945 - 

133 

room allowing device and frequency reusability. 

Furthermore, with Intensity Modulated Directly 

Detected (IMIDD) optical schemes, there is no 

multipath fading. The fading may degrade the signal 

strength by up to 30 dB in similar radio systems. 

However, ambient noise due to artificial lighting has 

been the major concem in infrared wireless systems in 

indoors [I]. This background light induces a white 

Gaussian shot noise that is 20 to 40 dB more than the 

signal induced shot noise. Furthermore, modem 

fluorescent lights with electronic ballasts generate 

switching noise up to 1 MHz, which introduces a much 

serious impairment. At times, the receiver thermal 

noise becomes dominant. The time-varying wireless 

channel determines weights of other noise sources. As 

a result, infrared wireless receivers experience high 

level of non-stationary noise. 

The objective of this paper is to develop signal 

processing algorithms to combat the high power non- 

stationary noise. Adaptive filters based on the least 

mean squares (LMS) and wavelet-packet based filters 

effectively cancel the noise in a non-stationary 

environment. A higher order comb filter cancels the 

periodic noise from the electronic ballast. 

2. NOISE AT INFRARED RECEIVERS 

Infrared receivers face with the challenge of a variety 

of noise sources, and the details of which are shown in 

Fig. I. There will he thermal noise from the electronics 

devices. This can he modeled as white Gaussian noise, 

and is relatively easy to tackle. 

Indoor infrared transmission systems are affected by 

interference induced by natural and dficial ambient 

light. The noise is directly proportional to the amount 

of light incident on the photo-detector, therefore, is a 

function of average optical power. The shot noise is 

due to the mean received infrared power and ambient 

light. However, the ambient light induced by shot 

noise typically has a power from 20 to 40 dl3 greater 

than the signal shot noise [2]. Therefore, the signal 

induced shot noise can be neglected. The ambient 

induced shot noise can be considered Gaussian and 

nearly white. 


Natural ambient light noise is caused by sunlight. It 

can be considered steady with slow intensity variations 

in time. Artificial ambient light comes from several 

light sources: incandescent lamps, fluorescent lamps 

driven by conventional ballasts and fluorescent lamps 

geared by electronic ballasts. The use of fixed optical 

filters reduces out of band ambient light noise. The 

steady background irradiance produced by natural and 

artificial light sources is usually characterized by a 

direct current induced at the receiver photodiode that is 

directly proportional to that current. This current is 

referred as the background noise current. 

The interfering signal produced by incandescent lamps 

is an almost perfect sinusoid with a frequency of 100 

Hz. In addition to the 100 Hz sinusoid, only the fust 

harmonics (up to 2 kHz) cany a significant amount of 

energy, and for frequencies higher than 800 Hz, all 

components are 60 dB below the fundamental. So 

using electrical high pass filter can eliminate this 

interference without much signal degradation. 

For fluorescent lamps equipped with conventional 

ballasts driven at a power-line of 50 or 60 Hz, they 

induce interference at harmonics up to 50 kHz. This 

also can be eliminated by careful choice of modulation 

scheme to ensure there are no low frequency 

components and through electrical high pass filtering. 

For fluorescent lamps equipped with electronics 

ballasts, the ballast modulation frequency itself is 35 

Mz. Therefore, interference harmonics extending up 

to 1 MHz are introduced. This cannot be easily 

filtered. In case of interference overlapping the signal 

spectrum, sophisticated digital signal processing 

algorithms need to be developed and this is the focus 

of this paper. 

3. METHODOLGY 

The spectrum produced by an electronic-ballast-driven 

lamp consists of low and high frequency regions. The 

low frequency region resembles the spectrum of a 

conventional fluorescent lamp while the high 

frequency region is attributable to the electronic 

ballast. A deterministic expression that models the 

interfering signal at the output of the photodiode is [2]: 

where R is the photodiode responsivity (M), Pm is 

the mean optical power of the interfering signal, KI= 

CCECE 2003 - CCGEI 2003, Montreal, Mayimai 2003 

0-7803-7781-8/03/$17.00 0 2003 IEEE 

- 1946 - 

134 

5.9, K2 = 2.1, and {b}, {a} and {d} are constants, the 

frequency corresponding to the lamp type as jh= 37.5 

kHZ. 

................................................................... 

noises that are 

i by conventional 

~ ballasts 

~ 


~ : i . electronic ballasts 

/ : ~ *Thermal whitenoise ~ 

: i * Shot noise by sun 

: .................................................................... , ......................................................................... 

Fig. 1 Block diagram of Noise Removal Technique; at 

Infrared Wireless Receivers 

The even harmonics of37.5 kHz would correspond to 

75 kHz, 150 kHz, 225 kHz, 300 Hz, 375 IcHz, 450 

kHz, 525 IcHz, 600 kHz, 675 kHz, 750 kHz, and 82:j 

kHZ. 

If the low frequency and high frequency ambient 

signal model is the same as practical case, we can use 

this model and use adaptive noise cancellation method 

to eliminate this noise because this noise is 

uncorrelated to our desired signal. Also adaptive noise 

cancellation method can eliminate white Gaussian 

noise (the thermal noise model). The advantage in 

using adaptive noise cancellation method is that we 

can improve the SNR only if the reference signal is 

correlated to noise signal contained in primary signal 

but uncorrelated to desired signal. The limitation to use 

adaptive filtering is that: if the reference signal is 

completely uncorrelated with both the signal and noise 

components of the primary signal, the adaptive noise 

canceller has no effect on the primary signal, and the 

output signal to noise ratio remains unchanged. We 

used the ‘well known’ LMS algorithm [3] in removing 

the thermal noise from the signal. The ease of 

implementation of the LMS algorithm is achieved at 

the expense of convergence rate. To accelerate the 

converge rate of the LMS algorithm the step size 

parameter was selected as a function of the eigen 

values of the autocorrelation matrix of the input signal 

(which is dependant on the instantaneous energy ofthe 

signal). 

In case where the reference channel is not available or 

if the reference signal is uncorrelated with the noise in 

the primary channel then, signal decomposition 

techniques could be a better alternative. In cases where

signal and noise spectra overlap, fixed filtering or 

adaptive filtering of noise may not be the best 

approach. In such a situation noise filtering by using 

mathematical decomposition techniques might be the 

best alternative and such methods are commonly 

known as de-noising techniques. The de-noising 

approach bas been successfully applied for data such 

as knee sounds [4] and ultrasound signals [5]. The 

problem of enhancing signals degraded by 

uncorrelated additive noise when the noisy signal 

alone is available, has received much attention in the 

last years [6-lo]. The main problem arises when the 

de-noising filters cannot distinguish between noise and 

noise-like important signal components, and remove 

both thereby decreasing the intelligibility of the signal 

Among the mathematical transformation techniques, a 

time-frequency (TF) decomposition technique might 

be a suitable choice since it exploits the simultaneous 

overlap in time and frequency domain, and filters the 

noise accordingly. 

The complexity of structures present at infrared 

wireless receiver requires the development of adaptive, 

low level representations in order to provide a 

meaningful analysis. In Fourier, the basis functions 

sine and cosine are not suitable in capturing the subtle 

changes in speech signals because of their inability to 

localize time information. A better signal 

representation by using basis functions that can capture 

both temporal and spectral information would be more 

useful. Signal representation such as wavelet, and 

wavelet packet can provide this information. The 

signal decomposition techniques that are considered in 

this paper are wavelet-packets and are briefly 

described in the subsequent sections. 

In wavelets, any signal can be decomposed into 

components with good time and scale properties. 

Wavelets have the advantage to express any signal 

with a fewer coefficients. The design of basis functions 

must be optimized, so that the number of non-zero 

coefficients will be minimum and the input signal is 

approximated by projecting it over A4 basis functions 

selected adaptively. It is represented as follows: 

X(') = z (x2gm)gm 

me,. 

Where x(f) is the signal to be decomposed, and 

denotes the inner product between the signal and the 

basis functions. The basis functions are obtained by 

shifting and modulating the amplitude of a prototype 

function called mother wavelet, it is given by: 

CCECE 2003 - CCGEI 2003, Montrial, May/mai 2003 

0-7803-7781-8/03/$17.00 0 2003 IEEE 

- 1947 - 

135 

where s is the scale parameter, and U is the translation 

parameter. Wavelet analysis use long time intervals for 

low frequency detailed analysis and short time 

intervals for high frequency information. That offers 

good frequency resolution at low frequencies and good 

time resolution at high frequencies. 

The main difference between wavelet and 

wavelet packet analysis is that the later allows an 

adjustable resolution of frequencies through filter bank 

decomposition. It splits the whole spectrum into two 

equal bands at different levels, obtaining a general tree 

structure that is called the wavelet packet expansion. 

Basis functions are generated with an 

algorithm that uses quadrature mirror filter (QMF) 

banks, and divide the spectrum as a tree with multiple 

branches. Wavelet packet allows to search the 

optimum decomposition of the binary tree looking for 

the branch with the best entropy criterion of the input 

signal [7]. Once the wavelet or the wavelet packet 

decomposition of the signal is achieved, the next step 

is thresholding the resulting coefficients. This can be 

done in two ways, hard and soft thresholding. 

If w, denote the waveletiwavelet packet 

coefficients, then hard thresholding [7] is applied as: 

where Tis the selected threshold. 

For avoiding the de-noising effect of certain filters 

that remove the sharp features of the signals removing 

important components, soft-thresholding discards 

terms with small or insignificant contribution for the 

information. 

Soft thresholding is performed as: 

Different methods are used for selecting the 

best threshold T and also rescaling the coefficients 

according to the noise level. 

4. RESULTS and CONCLUSIONS 

As described in Section 3, three noise removal 

experiments were performed (1) removal of high 


frequency periodic interference due to fluorescent 

lamps equipped with electronic ballasts (2) removal of 

ambient noise of random nature with adaptive 

algorithms such as the LMS, and (3) automatic 

denoising of uncorrelated structures in a received 

signal by using the wavelet-packet technique. 

Removal of High Frequency Periodic Interference: 

Periodic interference in the signal due to electronics 

ballasts have even harmonics extending upto 1 MHz. 

In Fig. 2 a synthetic signal with periodic interference is 

shown. As the periodic interference is represented as 

spectral peaks in the signal's spectrum, a series of 

notch filters have to be designed to attenuate these 

spectral peaks. As spectral attenuation can be achieved 

by placing zeros on the unit circle or close to the unit 

circle in the z-plane at the exact frequency points, a 

linear phase fmite impulse response (FIR) filter was 

determined to be the best option. Detailed analysis for 

the right filter type revealed that a 26' order FIR filter 

could totally suppress the harmonics interference 

present in the signal caused due to electronic ballasts. 

The magnitude and phase response of the filter is 

shown in Fig. 3. It could be easily seen that the filter 

has linear phase response, and the magnitude response 

clearly represents the comb filter characteristics. The 

output of the filter is shown in Fig. 4, and it is evident 

that FIR comb filtering satisfactorily reduces the 

harmonics due to electronic ballast interference. 

Removal of Ambient Noise: 

In'this study, an infrared wireless system was modeled 

with a signal of interest, and noise was added at 

different Dower levels. The desired resuonse in the 

training stage of the filter was assumed tdbe a delayed 

version of the clean sienal free of ambient noise. A 

12* order transversal fiiter was trained with the LMS 

adaptive filter algorithm. The step size parameter in 

the LMS that gnvems the convergence rate of the 

algorithm was selected in an adaptive manner, in such 

a way that the step size is inversely proportional to the 

instantaneous energy of the signal. It was found that 

the step size selected in this manner provides an 

optimal convergence suitable in an infrared wireless 

communications environment. Fig. 5 shows the 

original and the denoised signal by using this 

approach. It could be seen in panel B, that a clear 

signal is obtained, but the conyergence of LMS has 

caused some transient-like disturbances in the filtered 

output. The transient like disturbance was minimized 

by selecting the step size based on the instantaneous 

energy of the signal of interest. 

CCECE 2003 - CCGEI 2003, Montreal, May/mai 2003 

0-7803-7781-8/03/$17.00 0 2003 IEEE 

- 1948 - 

136 

Removal of Uncorrelated Signamoise Structures: 

In case of noise spectra overlapping with the signal 

spectrum, and where the reference channel is not 

available or if the reference signal is uncorrelated with 

the noise in the primary channel then, signal 

decomposition techniques could be a better alternative. 

As explained in Section 3, wavelet-packet techniques 

are promising tools for removal of structures that .are 

not correlated to the signal of interest. Wavelet-packet 

picks the hest basis functions by using the entropy 

optimization criteria. In this study, Coiflet, 

Daubechies, Haar and Symmlets were tried as wavelet 

choices with soft thresholdmg criteria, and among 

them Daubechies (db6) seem to outperform the other 

commonly used wavelets in terms of removing 

structures that are not relevant in an infrared wireless 

receiver system. Fig. 6 shows the original and the 

denoised signals, and it could clearly seen that 

wavelet-packet has performed extremely well in 

removing irrelevant components from the signal of 

interest. 

.- 

..... . . . . ~, .............................. I 

* I2 e, a. 


OI I* ow 1- I 

Fig. 2: Fluorescent lamp geared by electronic ballas,ts. 

,~iwuF^nummri*rlol..girar 

" .............................. " ........................... :. ... 

:, 

,*,; .. , ... 

.......... .......... i .... ,.j. .ol. i:;; ........ ~ ............. i ............... 

e 91 (12 

ea xi 0s 

%-*laxrr,i.:r- 

~* 

v* 

... 

3 

:w,.,m"--m-- .......... "1""1 

w,..... _I" ...... 

- , 

4-1 

..w 

.. 

. - 

, . '-': . ;,, .: . .; 

. 

. . ........ %-. ~:,- ' ' "; 

. . . . . '." ..'. 

.I. . 

-, I, ... ... "..ll.l"l_l" 

/j 3.x gz a, e4 3, "* "l rl/ 0s 1 

I . i C . r U p l i 

Fig. 3: Magnitude and phase response of FIR Comb 

filter of order 26.

Fig. 4: Output of the Comb filter. 

-***Unj--Lc 

........................ ~. ............. 

.. ...~. ................. 

aruurr^n,r---*m Ia 

., .................. .......... "* ............. 

F' n: ss 6. *I U* 

*v=.,-: 

Fig. 5: Original ambient noise and LMS 

filtered output. 

; . ~~ 

. , 

CI < 

51 

> i.1 >I P3 61 1: F.3 i? 3.8 '.i i 

i.*,,m, 

Fig. 6: Original and wavelet-packet denoised. 

CCECE 2003 - CCGEI 2003, Montreal, Mayimai 2003 

0-7803-7781-8/03/$17.00 Q 2003 IEEE 

- 1949 - 

137 

References 

[I] R.Narasimhan, M.D.Audeh, J.M.Kahn, Effect of 

electronic-ballast fluorescent lighting on wireless 

infrared links, IEE Proc.-Optoelectron, Vol. 143, No. 

6, December 1996. 

[2] Moreia A.J.C., Valadas R.T, and de Oliveira 

Duam A.M. Optical inteiference produced by 

ortijicial light, Wireless Networks, Vo1.3, 1997, pp 

131-140. 

[3] Haykin, S. Adaptive filter theory, Prentice Hal!, 

New Jersey, 2002. 

[4] S. Krishnan and R. Rangayyan. Automatic de- 

noising of knee joint vibration signals using adaptive 

time-frequency representations, Medical and 

Biological engineering and Computing. Vol. 38, No I, 

pp. 2-8, January 2000. 

[SI S. Johnston, A. Diaz and S. Doctor. De-noising of 

Ultrasonic Signals Backscattered from Coarse- 

Grained materials: Wavelet Processing and 

Maximum-Entropy Reconstruction, 67th Annual 

meeting of the Southeastem section of the American 

Physical Society. 

[6] X. Xie, J. Kuang. A noise canceller for mobile 

communications utilizing time-frequency. Analysis, 

Fourth Asia-Pacific conference on optoelectronics and 

communications. Vol. 1,pp. 504-507, October 1999. 

[7] D. Donoho. Nonlinear wavelet methods for 

recovery of signals. densities, and spectra from 

indirect and noisy data, Proceedings of Symphosia in 

Applied Mathematics.Vol.00, pp. 173-205, 1993. 

[8] M. Bohoura and J. Rouat. Vmelet speech 

enhancement based on the Teager energy operator, 

IEEE signal processing letters, Vol. 8, No I, Jan 2001. 

[9] N. Virag. Single Channel speech enhancement 

based on masking properties of the human masking 

properties of the human auditory system, IEEE 

Transactions on Speech and audio processing. Vol. 7, 

ISSUE 2, March 1999. 

[lo] L. Arslan, A. McCree, V. Viswanathan. New 

mefhoh for adaptive noise suppression. Proceedings 

of the Intemationa! Conference on Acoustics, Speech 

and Signal Processing, Vol. 1, pp. 812-815, Detroit, 

USA, May 1995. 


Adaptive denoising at Infrared wireless receivers 

Xavier N. Fernando, Sridar Krishnan, Hongbo Sun and Kamyar Kazemi-Moud 

Department of Electrical and Computer Engineering, Ryerson University 

Toronto,ON, M5B 2K3, Canada 

(fernando@ee.ryerson.ca) 

ABSTRACT 

This paper proposes an innovative approach for noise cancellation at infrared (IR) wireless receivers. Ambient noise due 

to artificial lighting and the sun has been a major concern in infrared systems. The background induced shot noise 

typically has a power from 20 to 40 dB more than the signal induced shot noise and varies with time. Due to these 

changing conditions, infrared wireless receivers experience high level of non-stationary noise. The objective of the work 

mentioned in this paper is to develop digital signal processing algorithms at the infrared wireless system to combat high 

power non-stationary noise. The noisy signal is decomposed using a joint time and frequency representation such as 

wavelets and wavelet packets, into transform domain coefficients and the lower order coefficients are removed by 

applying a threshold. Denoised version is obtained by reconstructing the signal with the remaining coefficients. In this 

paper, we evaluate different wavelet methods for denoising at an infrared wireless receiver. Simulation results indicate 

that if the noise is uncorrelated with the signal and the channel model is unavailable the wavelet denoising method with 

different wavelet analyzing functions improves the signal to noise ratio (SNR) from 4 dB to 7.8 dB. 

Keywords: optical wireless, infrared, receiver, noise, wavelet transform, denoising 


The emerging technologies like mobile portable computing and multimedia terminals at living and work environments 

are the main forces driving companies ,scientists and researchers to progress in the challenging field of wireless local 

area networks(WLAN). The need for higher speed and wider bandwidth in data communication networks is gradually 

replacing electrical transmission medium to optical. Wireless infrared LANs are important part of indoor transmission 

systems and enable high bit-rate data transferring over short distances [1]. Infrared systems occupy no radio frequency 

(RF) spectrum and they can be used where electromagnetic interference is critical. The infrared spectral region offers a 

large, virtually unlimited, bandwidth that is unregulated worldwide. Since infrared communications are confined to 

rooms, there is no interference between communication systems operating in different rooms, which result in secure 

communications. In contrary to RF transmission systems, the light is reflected diffusely on the wall surface of the rooms 

and the channel estimation will be a non-trivial subject for infrared systems. A non-directed wireless optical 

communication system can be either line-of-sight (LOS) or diffuse. A LOS system is designed under the assumption that 

the LOS path between transmitter and receiver is unobstructed. A diffuse system is defined as one which does not rely 

upon the LOS path, but instead relies on reflections from a large diffusive reflector such as the ceiling. In both cases, an 

optical signal in transit from transmitter to receiver undergoes temporal dispersion due to reflections from walls and 

other reflectors; the intersymbol interference (ISI) that results is an impediment to communication at high speeds. Single 

diffuse infrared links can operate with bit rates as high as 100 Mb/s [2]. Since it is possible to operate at least one 

infrared link in every room of a building without interference, the potential capacity of an infrared-based network is 

extremely high. The propagation characteristics of diffuse infrared signals resemble those of radio signals. The measured 

received power at different positions using a photodetector much smaller than the light wavelength will result in 

multipath fading like fluctuations in received power. In the real diffuse infrared systems, however, the detector size is 

much larger than the wavelength, so that the multipath fading like power fluctuations are averaged out effectively. While 

multipath propagation does not lead to fading, it causes temporal dispersion. The tail caused by higher order taps of the 

indoor channel impulse response induces ISI in high bit-rate communications. 

Infrared Technology and Applications XXIX, Bjørn F. Andresen, Gabor F. Fulop, Editors, 

Proceedings of SPIE Vol. 5074 (2003) © 2003 SPIE · 0277-786X/03/$15.00 

138 

199

Indoor infrared transmission suffers from a number of impairments the most important ones being shot noise 

from the ambient light and restricted symbol rate due to multipath dispersion. Noise plays a severe role in the 

performance of wireless infrared networks. Background illumination has two distinct effects in the performance of 

optical receivers; one is noise due to the steady and invariant irradiance from undesired light sources which results to 

shot noise at the photodetector, the other one is interference generated by high frequency components of some light 

sources. Typically, natural and artificial ambient light contribute to high levels of shot noise in a photodetector which 

degrades the performance of the transmission system. For data-rates up to 10 Mbps, the major degrading factor of the 

infrared communication systems is the shot noise induced in the receiver due to ambient light. Unfortunately, ambient 

light sources (sunlight and artificial light) also radiate in the same spectral wavelengths used by infrared transducers. 

Thus shot noise presents a strong spatial and temporal dependence. Several advanced techniques for the design of nondirected 

wireless infrared communication systems have been already proposed in order to minimize these signals to 

noise ratio (SNR) fluctuation effects. These ambient light levels to a significant degree determine the optical power 

required for reliable transmission. The shot noise induced by ambient light may vary over several decades during a day 

in a typical indoor environment. 

The interfering signal from the fluorescent light is periodic and deterministic. The spectrum of fluorescent 

lights driven by electronic ballasts may extend up to frequencies around 1MHz interference of which will cause serious 

degradation at infrared receivers even after high-pass electrical filtering [3-5]. 

The objective of the work mentioned in this paper is to develop a digital signal processing algorithm at the 

infrared wireless system to combat uncorrelated noise without a reference channel model. In Section 2, we introduce and 

classify different noise sources at the infrared receivers [3]. Section 3 will focus on the definition of wavelet transform 

and analyzing functions which will be used in Section 4 to introduce a new methodology for noise cancellation. The new 

wavelet–based denoising technique and the results of wavelet denoising are discussed Section 4. The conclusions are 

provided in Section 5. 

2. NOISE AT THE RECEIVERS 

Noise in the infrared optical receivers is a critical parameter of performance analysis. There are different sources of noise 

that contribute to overall performance of the wireless network link. Thermal noise of the photodetector is dominant for 

weak steady background illumination. Thermal noise is critically dependent to the front-end design of the receiver (e.g. 

preamplifier). Shot noise is induced by the quantum nature of photons randomly arriving at the photodetector. It is 

proportional to the average received optical power. Natural and artificial background light may come from different light 

sources. Different background noise source contributions are from sun, incandescent lamps, fluorescent lamps with 

conventional ballasts and electronic ballasts. The slow variations in intensity of the light coming from the Sun make it a 

strong source of shot noise. The spectrum of natural light coming from the Sun in a shiny day is spread over entire 

responsivity curve of the PIN photodetector resulting to a steady background noise current of an order of a mA stronger 

than a well artificially illuminated room. Shot noise is larger under directional lamps and near windows exposed to 

sunlight. Furthermore, it can vary drastically during a normal day with the position of the sun and with the indoor 

lighting conditions. Due to the temporal variation and directional nature of both signal and noise, the SNR at the receiver 

can vary significantly. 

Artificial light sources also contribute to shot noise as well as interference at the infrared receiver. Incandescent 

lamps interference is periodic with a frequency of 100 Hz. Its spectrum has frequency components up to 2 KHz. 

Harmonics at the frequencies of higher than 800 Hz do not carry a significant amount of energy and they are 60 dB 

below the fundamental harmonic. 

In case of the incandescent lamps the amplitude of the interference is one tenth of the current generated by the 

slow variations of intensity. Researchers have already extracted an experimental interference model for typical 

incandescent lamps [3]. 

200 Proc. of SPIE Vol. 5074 

139

Fluorescent lamps equipped with conventional ballasts driven at power-line of 50 or 60 Hz, they induce 

interference at harmonics up to 20 KHz. This interference is periodic with a frequency of 50 Hz and its harmonics are 50 

dB below the 100 Hz component for frequencies higher than 5 KHz. Interference amplitude in this case is 2 to 6 times 

lower than the shot noise current Interference model for the fluorescent lamps driven by conventional ballasts has also 

been extracted experimentally [3]. 

The fluorescent lamps with electronics ballasts have higher power efficiencies and use the same concept of 

switching power supplies. Interference generated by fluorescent lamps with electronics ballasts has lower amplitude 

compared to other types of ambient lights but its spectrum is very broad and has frequency harmonics up to 1MHz. The 

spectrum produced by an electronics-ballast-driven lamp consists of low and high frequency regions. The low frequency 

region resembles the spectrum of a conventional fluorescent lamp while the high frequency region is attributable to the 

electronic ballast. These two components of the spectrum have been modeled using the same experimental approaches as 

the other noise sources [3]. 

In these model equations the relative amplitude and phase of the harmonics can be easily identified. For 

different class of lamps all the average parameters for interference models can be easily identified [3]. Several schemes 

have been proposed in order to reduce the power penalty induced by ambient artificial light sources in an indoor infrared 

wireless system [5-7]. 

3. WAVELET TRANSFORM 

Wavelets are functions that satisfy certain mathematical requirements and are used in representing data or other 

functions. The wavelet analysis procedure is to adopt a wavelet prototype function, called an "analyzing wavelet". The 

wavelet transform has become a powerful tool for signal analysis and is widely used in many applications, including 

signal detection and denoising. 

The complexity of structures presented at infrared wireless receivers requires the development of an adaptive, 

low level representation in order to provide a meaningful analysis of the system. In Fourier, the basis functions are sine 

and cosine which are not suitable in capturing the subtle changes of received signals at the infrared receivers because of 

their inability to localize the temporal information. 

The wavelet transformation is a time-frequency decomposition technique and with the choice of smooth multiresolution 

wavelet analyzing functions that use long time intervals for capturing low frequency information of the 

desired signal and short time intervals for high frequency information, one can have a joint temporal and spectral 

representation of that signal. Temporal analysis is performed with a contracted, high-frequency version of the prototype 

wavelet, while frequency analysis is performed with a dilated, low-frequency version of the prototype wavelet. Because 

the original signal or function can be represented in terms of a wavelet expansion (using linear combination of the 

coefficients and the wavelet basis functions), data operations can be performed using just the corresponding wavelet 

coefficients. In wavelet transformation, any signal can be decomposed into components with good time and scale 

properties. Wavelets have the advantages to express any signal with fewer coefficients [9]. 

The basis functions are obtained by shifting and modulated the amplitude of the “analyzing wavelet”. The 

design of basis functions must be optimized, so that the number of non-zero coefficients will be minimal and the input 

signal is approximated by projecting it over the basis functions selected adaptively. In wavelet-based denoising, the 

noisy signal is decomposed into transform domain coefficients, and the lower order coefficients are removed by applying 

a threshold. If we assume that Ψ(t) is the analyzing wavelet function then the continuous multi-resolution wavelet frame 

transform, F[m,n], of a signal f(t) is defined: 

m ∫ 

F[ , n 

m, 

n 

−∞ 

m, 

n] 

=< Ψ ( t), 

f ( t) 

>= Ψ ( t) 

⋅ f ( t) 

⋅dt 

140 

+∞ 

Proc. of SPIE Vol. 5074 201

The inverse wavelet transform is defined as 

f ( t) 

( t) 

= F[ 

m, 

n] 

⋅ Ψm, 

n 

m∈ℑn∈ℑ here m and n belong to the set ℑ , the set of integer numbers defining each wavelet basis function, Ψ(t) , in the two 

dimensional wavelet space. 

∑∑ 

The main difference between wavelet and wavelet packet analysis is that the latter allows an adjustable 

resolution of frequencies through filter bank decomposition. Filter banks split the whole spectrum into two equal bands 

at different frequency levels, obtaining a general tree structure that is called the wavelet packet expansion. Wavelet 

packet allows searching the optimum decomposition of the tree looking for the branch with the best entropy criterion of 

the input signal [7]. 

Researchers in related engineering and applied mathematics areas have developed many different wavelet 

transform systems each with specific properties. The difference between these wavelet transforms is mainly their 

analyzing functions and the way that they are developed. There are two major classes of wavelet transform systems. One 

class consists of orthogonal wavelets and the other one consists of biorthogonal wavelets. Other wavelet transform 

systems, not included in the two main categories, have generally limited applications [8]. 

4. NOISE CANCELATION METHOD 

In order to cancel the effect of uncorrelated Gaussian noise in the indoor infrared wireless channel we introduce the 

wavelet transform applied to the signal in electrical domain. Figure 1 shows the schematic diagram of the wireless 

infrared link and the receiver with the wavelet transform denoising block. 


Figure 1 – Schematic of the wavelet based denoising wireless infrared link 

141

In this system, the high pass electrical filter will reduce the interference induced by incandescent light and 

fluorescent light by conventional ballasts. The comb filter block will cancel the high frequency interference from the 

fluorescent lamps driven by electronics ballasts [11]. In the wavelet denoising block, the received signal is being 

transformed using pre-defined analyzing function. Once the wavelet decomposition of the signal is achieved the next 

step is thresholding. Thresholder block will remove the coefficients of the signal which have smaller absolute value than 

a predefined threshold. Different methods can be used to determine the threshold level that results in performance 

improvement in addition to rescaling the coefficients to the noise level. If wm denotes the wavelet coefficients of the 

decomposed signal and A the threshold level then the hard thresholding can be described mathematically as: 

wˆ 

m 

= 

⎧⎪ 

⎨ 

⎪⎩ 

w 

0 

m 

In order to avoid the denoising effect of certain filters that remove the sharp features of the signals, soft thresholding will 

discard the coefficients with small and insignificant contribution to the information and can be performed as: 

wˆ 

where the Sgn(.) is the signum function. 

m 

= 

⎧⎪ 

⎨ 

⎪⎩ 

Sgn( 

w 

0 

m 

)( w 

m 

w 

w 

m 

m 

− A) 

The remaining wavelet coefficients produce the denoised signal which will be demodulated and decoded. The 

aim is to alleviate the shot noise generated by incandescent light, the thermal noise from the receiver electronics by this 

denoising block. For simulation the denoising algorithm is applied to a pulse train with frequency of 10 KHz that passes 

through an infrared channel that contributes additive Gaussian noise with a SNR of 4 dB. Data signal with additive white 

Gaussian noise and its spectrum is shown in Figure 2-a and 2-b respectively. 

≥ A 

< A 

w 

w 

m 

m 

≥ A 

< A 

(a) (b) 

Figure 2: The received signal passed over an additive white Gaussian noise channel (a) the spectrum of the received signal (b). 

142 


The simulations has been done using seven different wavelet analyzing functions and the results of SNR 

improvements are summarized in Table 1. The “SNR improvement” is defined as the value that SNR after denoising 

subtracted by SNR before denoising. Orthogonal wavelet transforms used in the simulation were Haar, Daubechies, 

Coiflets, Symlets and discrete Meyers’s wavelet transform. 

Figure 3 shows the original received noisy signal (above) and denoised version of the same signal after 

applying discrete Meyer’s wavelet transform (below). SNR improvement of the denoised signal in this case is 3.8 dB. In 

the thresholding block the wavelet coefficients obtained from signal decomposition that are lower than the threshold 

level are discarded. Figure 4 shows the original Gaussian noise of the channel (above) and the temporal representation of 

the discarded coefficients (below). 

Waveform SNR improvement 

‘Haar’ 2.3279 

‘db’ 3.4801 

‘sym’ 3.4522 

‘coif’ 3.5583 

‘bior’ 3.7485 

‘dmey’ 3.8281 

Table 1: SNR improvement of wavelet denoising method using different analyzing functions 

Figure 3: The original noisy 10 KHz pulse train (above) and the denoised version using the discrete Meyer’s transform (below) 


143

Figure 4: The original Gaussian noise (above) and the reconstruction of wavelet coefficients discarded by thresholder (below) 

Figure 5: The original noisy 10 KHz pulse train (above) and the denoised version using the Haar transform (below) 

144 


Figure 6: The original Gaussian noise (above) and the reconstruction of wavelet coefficients discarded by thresholder (below) 

In Figure 5 shows the denoised version of the received signal using the Haar wavelet transform has been shown 

(below). Reconstructed signal from the discarded coefficients in the thresholder is shown in Figure 6 (below). By using 

Haar wavelet transform a SNR improvement of 2.3 dB has been achieved. 

Haar wavelet analyzing has sharp edges compared to the Meyer’s wavelet mother function which is smoother 

and this results to the loss of signal information over those sharp edges therefore a lower SNR improvement. Overall the 

use of the wavelet deoinsing method with any of the analyzing functions results to a SNR improvement of approximately 

3 to 4 dB which means a signal twice more powerful than the noisy one. This improvement can be achieved for a noise 

which is uncorrelated with the information signal, and where a reference channel for noise is not available. 


Different noise contributions at the infrared wireless receivers have been mentioned. A new denoising method for 

uncorrelated noise in wireless infrared receivers was introduced using the wavelet transform. In this new method 

denoised version is obtained by reconstructing the signal with the remaining coefficients after passing through a 

thresholder. We evaluated Coiflet, Daubechies, Haar, Symmlets, Biorthogonal and Meyer wavelet analyzing functions 

for denoising at an infrared wireless receiver. Overall using the wavelet with any of the analyzing functions in the 

simulation has resulted to a SNR improvement of approximately 3 to 4 dB with the input SNR of 4 dB. If the power 

density function of the noise which is uncorrelated to the information signal is known and the reference channel model is 

unknown, the use of self-defined adaptive wavelet analyzing functions can improve the SNR of received signal whose 

spectrum overlaps with that of the noise. 

A comparison of SNR improvement for different wavelet analyzing functions has been done. Results also 

indicate that the smoother wavelet analyzing functions can preserve more signal information hence they will result to a 

higher SNR improvement. But one should consider that overall SNR improvement using the wavelet decomposition 

method for denoising is between 3 and 4 dB for different wavelets therefore we suggest the use of wavelets that can be 

implemented easier on digital signal processors (DSP) chips and have efficient calculation time in order to satisfy speed 

constraints of the electronics used in the lightwave system. 


145

6. REFERENCES 

1. F. R. Gfeller and U. Bapst,”Wireless in-house data communication via diffuse infrared radiation'' Proceedings of the 

IEEE, vol. 67, pp. 1474-1486, November 1979. 

2. Audeh, M.D.; Kahn, J.M.; “Performance simulation of baseband OOK modulation for wireless infrared LANs at 

100 Mb/s “, Proceedings IEEE International Conference on Selected Topics in Wireless Communications, vol: 25- 

26, pp: 271 -274, Jun 1992. 

3. Moreira, A.J.C.; Valadas, R.T.; de Oliveira Duarte, A.M.; ‘Characterisation and modeling of artificial light 

interference in optical wireless communication systems’, Personal, Indoor and Mobile Radio Communications, 

1995. PIMRC'95, Volume: 1 , 27-29 Page(s): 326 -331, Sep 1995. 

4. O'Farrell, T.; Kiatweerasakul, M.;’ Performance of a spread spectrum infrared transmission system under ambient 

light interference ‘,Symposium on Personal, Indoor and Mobile Radio Communications, The Ninth IEEE 

International, Volume: 2, Page(s): 703 -707, 8-11 Sep 1998 

5. Moreira, A.J.C.; Valadas, R.T.; de Oliveira Duarte, A.M.; “Reducing the effects of artificial light interference in 

wireless infrared transmission systems”, IEE Colloquium on Optical Free Space Communication Links, Page(s): 5/1 

-510, 19 Feb 1996. 

6. A. J. C. Moreira, R. T. Valadas, and A. M. de Oliveira Duarte, “Optical interference produced by artificial light“ 

Wireless Networks, vol. 3, no. 2, pp. 131-140, 1997. 

7. C.J. Georgopoulos, “Suppressing background-light interference in an in-house infrared communication system by 

optical filtering”, Internat. J. Optoelectronics vol 3,(3) (1988). 

8. Donoho, D.;”Nonlinear wavelet methods for recovery of signals, densities and spectra from indirect and noisy data”, 

Proceedings of Symposia in Applied Mathematics, vol 00, pp. 173-205, 1993. 

9. Strang, G. and Nguyen, T., “Wavelets and Filter Banks.” Wellesley-Cambridge Press, Wellesley Massachusetts, 

1996. 

10. Narasimhan, R.; Audeh, M.D.; Kahn, J.M.; “Effect of electronic-ballast fluorescent lighting on wireless infrared 

links “,ICC 96, IEEE International Conference on Conference Record, Converging Technologies for Tomorrow's 

Applications, Volume: 2, , Page(s): 1213 -1219, 23-27 Jun 1996. 

11. Krishnan, S.; Fernando, X.; Sun, H.,” Non-stationary interference cancellation in infrared wireless receivers”, In 

press, Proc. IEEE Canadian conference on Electrical and Computer Engineering, Montreal, Quebec, May 2003. 

146 


147 


148 


149 


150 


151 


152 


153 


154 


155 


156 


157 


158 


159 


160 


161 


162 


163 


Fixed Block-based Lossless Compression of Digit a1 Mammograms 

Marwan Y. Al-Saiegh and Sridhar Krishnan 

Dept. of Electrical and Computer Engineering, 

Ryerson Polytechnic University, Toronto, ON M5B 2K3, CANADA. 

Email : (malsaie@ee.ryerson.ca) (krishnan0ee.ryerson.ca) 

Abstract: Breast cancer is a leading cause of death 

among women in Canada. Computer-aided diagnosis of 

mammograms (X-ray films of breast tissue) is a non- 

invasive and an inexpensive way of diagnosing breast can- 

cer. The objective of this project is to investigate im- 

age compression schemes for faithful transmission and re- 

production of digital mammography data over a commu- 

nication link. A fixed block-based (FBB) near lossless 

compression scheme for mammograms has been devel- 

oped which runs in conjunction with traditional compres- 

sion schemes such as Huffman Coding and LZW (Lempel- 

Ziv Welch) coding. The algorithm codes blocks of pixels 

within the image that contain the same intensity value (the 

odds of having blocks of the same pixel values in a mam- 

mography image are very high), thus reducing the size of 

the image substantially while encoding the image at the 

same time. The proposed compression scheme was ap- 

plied on 44 mammograms (22 benign and 22 malignant), 

and the compression scheme provided a compression ra- 

tio of 1.7:l. When Huffman coding and LZW coding were 

used in conjunction with the FBB compression scheme, the 

compression ratio was 3.81:l for Huffman, and 5:l for LZW 

coding. The proposed FBB lossless compression technique 

seems to he promising for teleradiology applications. 

1 Introduction 

Breast Cancer is one of the leading cause of death in the 

world for women. In the US. alone in 2000, more than 

40,000 women died of breast cancer. Therefore, early di- 

agnosis is extremely important to reduce the mortality 

rate [l]. American cancer society guidelines for women 

aged 40-50 advocate screening every 1-2 years, with fre- 

quency based on the patient risk factor. The above pro- 

cedure would result in some 20 million mammograms per 

year. Archiving and retaining these data for at least three 

years will expensive and difficult, requiring sophisticated 

data compression techniques [2]. Also screening of mam- 

mograms in rural clinics is a growing concern, especially 

due to the scarcity of radiologists a subject has to wait 

for weeks to get her diagnosis result. The delay in pro- 

ducing the results are mainly due to infrequent visits of 

radiologists to rural clinics, and non-availability of an effi- 

- 0937 - 

164 

I 

1 I Digiitized Mammogram ,+! FBB Compression ;-‘Huffman Coding ’i j 

~ i 

~ 1 

~ 

Didtzied Mammogram ;- hLempel-Ziv Coding, I 

Figure 1: Block diagram of the FBB technique using Huff- 

man and Lempel-Ziv Welch coding 

cient communication link through which a mammographic 

image could be faithfully transmitted to a city clinic. Tel- 

eradiology of digital mammograms could significantly al- 

leviate this problem, and may facilitate an early diagnosis 

and reduce the incidence of this killer disease. The above 

facts warrants an efficient data compression scheme for 

mammograms. 

Physicians or radiologists are reluctant to consider a 

technique that would discard even a small amount of in- 

formation from a mammographic image. By exploiting 

the redundancy or correlation information of pixels in an 

image a data compression technique can be designed to 

efficiently compress an image. Current compression tech- 

niques are based upon transform coding techniques such as 

discrete wavelet transform and discrete cosine transform 

[3]. Although transform coding techniques have claimed 

a compression ratio of lO:l, they are lossy compression 

schemes and need extensive receiver operating character- 

istic (ROC) studies of compressed images. The proposed 

near lossless compression scheme is shown in block dia- 

gram form in Fig. l. The proposed technique is “min- 

imally,’ lossy and does not require any ROC studies to 

evaluate compressed images. The paper is organized as 

follows: section 2 covers fixed block-based (FBB) com- 

pression scheme, Huffman coding and LZW coding are 

briefly covered in section 3, section 4 covers results and 

discussion, and finally the paper is concluded in section 5. 


Figure 2: Block diagram of the FBB compression 

Go B Ck U> I I 

2 Fixed block-based (FBB) com- 

pression scheme 

Fixed block-based (FBB) compression scheme takes ad- 

vantage of the pixel correlation while scanning the image 

from left to right. It is known that the adjacent pixels in a 

mammographic image are highly correlated. The adjacent 

pixels can be combined to reduce the redundancy and that 

is what the proposed FBB algorithm is based upon. The 

FBB compression scheme will read the pixels one at a time 

and store them in a two dimensional array (i.e 448 X 448). 

The histogram of the mammogram is used to identify pix- 

els that do not appear in the image. This procedure is 

essential, because it introduces two redundant pixels that 

are used as keys through out the algorithm to avoid over- 

lapping, and represent blocks of zeros in the output file. 

2.1 Algorithm of the compression scheme 

The proposed FBB compression scheme is shown in block 

diagram form in Fig. 2. The steps involved are 

1. If the difference between the current pixel (i.e x[i]lj]), 

and the surrounding pixels x[i][j+l], x[i+l]L+l], and 

x[i+l][j] is O,l, or -1, then go to step 3. Otherwise go 

to step 2. 

2. Write the current pixel (i.e x[i]Ij]) to the output file, 

and move the sliding window to the next column of 

the two dimensional array and go back to step 1. 

3. If current pixel was not zero, then write (-l)*current 

pixel to the output file. If the current pixel is 0 then 

write the second 'key' pixel to the output file and go 

to step 4. 

4. Replace the block of pixel in the two dimensional ar- 

ray by the first 'key' pixel to avoid overlapping when 

the algorithm is implemented and go to step 5. 

5. Enforce the sliding window to skip one column, and 

go back to step1 (i.e instead of sliding the window 

from column two, start from column three). 

It is important to realize that the sliding window algorithm 

approach can be improved by making the window 

size bigger to absorb more pixels when needed, but the 

Figure 3: 'lock diagram of the FBB decompression tradeoff is to choose more 'key' pixels for every window size 

- 0938 - 

165 

to distinguish between the different window sizes. Since 0 

cannot be positive or negative, thus a 'key' pixel is needed 

for the 0 pixel. Thus, whenever the sliding window algo- 

rithm finds a block of 0 pixels, then the second 'key' pixel 

is written to the output file. 


2.2 Algorithm of 

scheme 

the decompression 

The FBB decompression algorithm is shown in block dia- 

gram form in Fig. 3. While performing decompression the 

compressed file is read from the standard input and stored 

in a one dimensional array. A temporary two dimensional 

array is used to reconstruct the image back. The tem- 

porary array is initialized with the first ’key’ value. The 

alogrithim is as follows: 

3 

1. If the current pixel in the one dimensional array (i.e 

x[i]) is negative, then write the current pixel value as 

a block (i.e 4x4 matrix) in a positive form (i.e x[i]*(- 

1)) to the temporary two dimensional array and go to 

step 4. 

2. 

3. 

4. 

5. 

If the current pixel value is the same as the second 

’key’ value then write a block of 0’s (i.e 4x4 matrix) 

to the temporary two dimensional array and go to 

step 4. 

If the current pixel in the one dimensional array is 

positive, then write the current pixel value to the tem- 

porary array and go to step 5. 

skip the following column in the temporary two di- 

mensional array to account for the block of pixels and 

go to step 5. 

increment the index value in the one dimensional by 

one and go back to step 1. 

Once the decompression alogrithim is completed the 

temporary two dimensional array is written to an out- 

put file (i.e a decompressed version of the original file). 

Huffman coding and Lempel- 

Ziv-Welch coding 

After performing a FBB compression of the mammpogram 

image, it is further compressed by using popular lossless 

compression schemes such as Huffman coding and LZW 

coding. 

3.1 Huffman coding 

Huffman codes belong to a family of codes with a variable 

codeword length, which means that individual symbols 

which makes a message are represented (encoded) with 

bit sequences that have distinct length [4]. This helps to 

decrease the amount of redundancy in message data. De- 

creasing the redundancy in data by Huffman codes is based 

on the fact that distinct symbols have distinct probabilities 

of incidence. This helps to create code words, which really 

contribute to redundancy. Symbols with higher probabili- 

ties of incidence are coded with shorter code words, while 

- 0939 - 

166 

symbols with higher probabilities are coded with longer 

code words. 

3.2 Lempel-Ziv-Welch coding 

The LZW algorithm relies on re-occurrence of byte se- 

quences (strings) in its input [5]. It maintains a table 

mapping input strings to their associated output codes. 

The goal of LZW compression is to replace repeating in- 

put strings with n-bit codes. This is done by generating a 

string table on the fly, which is a mapping between pixel 

values and compression codes. This string table is built 

by the encoder as it processes the data, and due to the 

encoding method the decoder can reconstruct the string 

table as it processes the compressed data. This differs 

from other compression algorithms, such as Huffman cod- 

ing, where the lookup table needs to be included with the 

compressed data. 

LZW works based on the fact that many groupings of 

pixels are common in images: it goes through the image 

data and tries to encode as large a grouping of pixels as 

possible with an encoding from the string table, placing 

unrecognized groupings into the string table so they can 

be compressed on later occurrences. For an image with n- 

bit pixel values, it uses compression codes that are n + 1 

bits or larger. While a smaller compression code helps gain 

larger amounts of compression, the size of the compression 

code limits the size of the string table. 

4 Results and discussion 

The proposed FBB compression scheme was tested on 

MiniMammographic database of 44 images from Mammo- 

graphic Image Analysis Society (MIAS). The MIAS is an 

organisation of UK research groups interested in the un- 

derstanding of mammograms. Films taken from the UK 

National Breast Screening Programme have been digitized 

to 50 micron pixel edge with a Joyce-Loebl scanning micro- 

densitometer, a device linear in the optical density range 

0 to 3.2 and representing each pixel with an &bit word. 

The database also includes radiologist’s ‘truth’-markings 

on the locations of any abnormalities that may be present. 

The total number of benign and malignant images in 

the database are 22 and 22 respectively. A benign mam- 

mograhic image is shown in Fig. 4, and its compressed 

version is shown in Fig. 5. Perceptually there is no differ- 

ence in image quality between the original image and it’s 

compressed version. Fig. 6 illustrates a malignant image. 

The compressed image is shown in Fig. 7. Also in this 

case there is no difference between the original and the 

compressed images. 

Table 1 shows the advantage of using FBB in conjuc- 

tion with Huffman coding, and LZW coding. The mean 

compression ratio of benign and malignant images using 

FBB scheme alone was approximately 1.7:l. The mean 


Figure 4: Benign mammogram before FBB compression 

Figure 5: Same benign ima,ge in Fig. 4 after FBB compression 

- 0940 - 

167 


Figure 6: Malignant mammogram before FBB compression 

Figure 7: Same malignant image in Fig. 6 after FBB compression 

- 0941 - 

168 


. 

Table 1: Compression ratios of benign and malignant im- 

ages. Legend CR- compression ratio 

compression ratio of benign and malignant images using 

FBB with Huffman coding was approximately 3.81:l. The 

mean compression ratio of benign and malignant images 

using FBB with LZW coding was approximately 5:l. 

Figure 8: Bar graph for different compression schemes ap- 

plied to benign images 

The two bar graphs illustrate the usefulness of combin- 

ing FBB with other standard compression schemes such 

as Huffman coding and LZW coding. The first bar graph 

in Fig. 8 is for benign images, and the second bar graph in 

Fig. 9 is for malignant images. The y axis in the bar graph 

denotes the mean file size of malignant or benign images 

in bytes, while the x axis denotes the scheme applied on 

these images (e.g. Huffman with FBB). 

5 Conclusion 

In this paper a novel method of compressing mammographic 

images is proposed. The scheme is based upon 

FBB scanning of pixels of an image. The FBB codes blocks 

of pixels within the image that contain the same value (the 

odds of having blocks of the same pixel values in a mammography 

image are very high), thus reducing the size of 

the image substantially while encoding the image at the 

same time. The FBB compression scheme alone provided 

a compression ratio of 1.7:l. When Huffman coding and 

- 0942 

169 

Figure 9: Bar graph for different compression schemes ap- 

plied to malignant images 

LZW coding were used in conjunction with the FBB com- 

pression scheme, the compression ratio was 3.8:1 for Huff- 

man and 5:l for LZW coding. The proposed FBB lossless 

compression technique seems to be promising for teleradi- 

ology applications. Future work involves investigation of 

the compression scheme for transmission of mammography 

data over internet protocol. 

Acknowledgment 

We would like to acknowledge the Mammographic Image 

Analysis Society (MIAS) for granting us permission to use 

their database. We would also like to acknowledge Ryerson 

University and NSERC for providing financial support. 

References 

S. A. Feig. Decreased breast cancer mortality through 

mammographic screening: results of clinical trials. Radiology, 

167:659-665, 1988. 

H. A. Fkazer. Computerized diagnosis comes to mam- 

mography. Diagnostic Imaging, pages 91-95, June 

1991. 

Z. Yang, M. Kallergi, R. A. DeVore, B. Lucier, 

W. Qian, R. A. Clark, and L. P. Clarke. Effect of 

wavelet bases on compressing digital mammograms. 

IEEE Engineering in Medicine and Biology Magazine, 

14(5):570-577, Sep/Oct 1995 1995. 

D. A. Huffman. A method for the construction of min- 

imum redundancy codes. Proc. IRE, 40:1098-1101, 

1952. 

J. Ziv and A. Lempel. Compression of individual sequences 

via variable-rate coding. IEEE Trans. INformation 

Theory, IT-24:530-536, 1978. 

- 


Instantaneous mean frequency estimation using 

adaptive t ime-frequency distributions 


Dept. of Electrical and Computer Engineering, 

Ryerson Polytechnic University, Toronto, ON M5B 2K3, CANADA. 

Email : krishnan@ee. ryerson .ca 

Abstract: Analysis of non-stationary signals is a chal- 

lenging task. True non-stationary signal analysis in- 

volves nionitoring the frequency changes of the signal over 

time (ie, monitoring the instantaneous frequency (IF) 

changes). The IF of a signal is traditionally obtained 

by taking the first derivative of the phase of the signal 

with respect to time. This poses some difficulties because 

the derivative of the phase of the signal may take negative 

values thus misleading the intrepretation of instantaneous 

frequency. In this paper, a novel approach to extract 

the IF from its adaptive time-frequency distribution is 

proposed. The adaptive time-frequency distribution of a 

signal is obtained by decomposing the signal into compo- 

nents with good time-frequency localization and by com- 

bining the Wigner distribution of the components. The 

adaptive time-frequency distribution thus obtained is free 

of cross-terms and is a positive time-frequency distribu- 

tion with good time and frequency localization. The IF 

may bc obtained as the first central moment of the adap- 

tive time-frequency distribution. The proposed method 

of IF estimation is very powerful for applications with low 

SNR. The proposed technique was tested with synthetic 

signals of known IF dynamics, and the method success- 

fully extracted the IF of the signals. 

Keywords: instantaneous frequency, non-stationary sig- 

nals, positive time-frequency distributions, matching pur- 

suit, average frequency. 


The instantaneous frequency (IF) of a signal is a param- 

eter of practical importance in situations such as seismic, 

radar, sonar, communications, and biomedical applica- 

tions. In all these applications the IF describes some 

physical phenomenon associated with them. Like most 

other signal processing concepts, the IF of the signal was 

originally used in describing FM modulation in communi- 

cations. In a typical radar application, the IF aids in the 

detection, tracking, and imaging of targets whose radial 

velocities change with time. When the radial velocity is 

- 0141 - 

170 

not constant, the radar’s Doppler induced frequency has 

a nonstationary spectrum, which can be tracked by IF es- 

timation techniques. Also, in biomedical signal analysis, 

IF is used in studying the electroencephalogram (EEG) 

signals to monitor key neural activities of the brain. 

The importance of the IF concept arises from the fact 

that in most applications a signal processing engineer 

is confronted with the task of processing signals whose 

spectral characteristics (in particular the frequency of the 

spectral peaks) are varying with time. These signals are 

often referred to as non-stationary signal. A chirp signal 

is a simple example of a non-stationary signal, in which 

the frequency of the sinusoidal changes linearly with time. 

It is theoretically difficult to describe the IF of a signal 

since most signals are multicomponent, and it is diffcult 

to define a unique parameter for each time instant. Also, 

since frequency is usually defined as a number of oscilla- 

tions or vibrations undergone in a unit time period, the 

association of the words “instantaneous” and “frequency” 

is still controversial. 

Several authors have tried to define the IF of a signal. 

In this paper the IF is defined by using adaptive time- 

frequency distribution (TFD). The paper is organized as 

follows: a brief review on the topic of IF is discussed 

in Section 2. The proposed technique of adaptive TFD is 

described in Section 3. Results with synthetic signals and 

real world signals are discussed in Section 4. The paper 

is concluded in Section 5. 

2 Review 


The classical definition of the IF a signal [l] is defined as 

Ville formulated a joint TFD of the signal energy called 

the Wigner-Ville distribution (WVD), and defined the IF 

as the first central moment (average frequency) of the 

WVD,

Most Cohen’s class TFD derived from WVD yield the 

IF by correct first moment calculation but this is often 

computationally expensive and is adversely affected by 

noise. 

Most TFDs such as WVD provide high signal energy 

concentration in time and frequency, therefore it is tempt- 

ing to try to use it to measure the spread of frequencies 

with time. Unfortunately, the spread of the IF of the 

WVD is only positive for certain types of signals. Even 

when the spread is positive some negative distribution 

values may appear in the calculation, and thus its useful- 

ness is questionable. From the literature it appears that 

still there are many unresolved issues regarding the IF of 

the signal (A detailed review on the fundamentals of IF 

is available in [2]). It has been shown that the usual way 

of interpreting the IF as the average frequency at each 

time brings out unexpected results with Cohen’s class of 

bilinear TFDs. If the IF is interpreted as the average fre- 

quency then the IF need not be a frequency that appears 

in the spectrum of the signal. If the IF is interpreted as 

the derivative of the phase, then the IF can extend be- 

yond the spectral range of the signal. It has been recently 

reported that the estimation of IF of a signal using a posi- 

tive TFD brings out meaningful interpretation about the 

IF of the signal [3]. The motivation behind this paper 

is in adaptively constructing a positive TFD suitable for 

estimating the IF of a signal. 

3 Adaptive Time-Frequency Dis- 

tribut ions 

The purpose of this paper is to explore the best available 

TFD for estimating the IF of a signal. For simple appli- 

cations, Cohen’s class TFDs or model-based TFDs may 

be applied. It is widely accepted that, in case of com- 

plex signals with multiple frequency components there is 

no definite TFD that will satisfy all the criteria and still 

give optimal performance. The purpose of this section is 

to construct TFDs according to the application in hand, 

i.e., to tailor the TFD according to the properties of the 

signal being analyzed. It is appropriate to call such TFDs 

as adaptive TFDs. In the present work, the concept of 

adaptive TFDs is based on signal decomposition. 

In practice, no TFD may satisfy all the requirements. 

In the method proposed in this section, by using con- 

straints, the TFDs are modified to satisfy certain specified 

criteria. It is assumed that the given signal is somehow 

decomposed into components of a specified mathematical 

representation. By knowing the components of a signal, 

the interaction between them can be established and used 

to remove or prevent cross-terms. This avoids the main 

drawback associated with Cohen’s class TFDs; numerous 

efforts have been directed to develop kernels to overcome 

- 0142 - 

171 

the cross-term problem [4, 5, 61. 

The key to successful design of adaptive TFDs lies in 

the selection of the decomposition algorithm. The com- 

ponents obtained from a decomposition algorithm depend 

largely on the type of basis functions used. For example, 

the basis function of the Fourier transform decomposes 

the signal into tonal (sinusoidal) components, and the 

basis function of the wavelet transform decomposes the 

signal into components with good time and scale prop- 

erties. For TF representation, it will be beneficial if the 

signal is decomposed using basis functions with good TF 

properties. The components obtained by decomposing 

a signal using basis functions with good TF properties 

may be termed as TF atoms. An algorithm that can 

decompose a signal into TF atoms is the MP algorithm 

described in next section. 

3.1 Matching Pursuit 

The MP algorithm decomposes the given signal using ba- 

sis functions that have excellent TF properties. The MP 

algorithm selects the decomposition vectors depending 

upon the signal properties. The vectors are selected from 

a family of waveforms called a dictionary. The signal z(t) 

is projected on to a dictionary of TF atoms obtained by 

scaling, translating, and modulating a window function 

dt): 

where 


00 

4t) = an sm (t), (3) 

n=O 

and a, are the expansion coefficients. The scale factor sn 

is used to control the width of the window function, and 

the parameter p, controls temporal placement. & is a 

normalizing factor that restricts the norm of gm (t) to 1. 

The parameters fn and 4n are the frequency and phase of 

the exponential function, respectively. yn represents the 

set of parameters (sn,p,, fn, &). 

In the present work, the window is a Gaussian function, 

i.e., g(t) = 2i exp(-.irt2); the TF atoms are then called 

Gabor atoms, and they provide the optimal TF resolution 

in the TF plane. 

In practice, the algorithm works as follows. The signal 

is iteratively projected on to a Gabor function dictionary. 

The first projection decomposes the signal into two parts: 

4t) = (~:,gYo)gYo(t) + R W), (5) 

where (2, gr,) denotes the inner product (projection) of 

s(t) with the first TF atom gTO(t). The term R1z(t) is the

esidue after approximating x(t) in the direction of gYo (t). 

This process is continued by projecting the residue on to 

the subsequent functions in the dictionary, and after AI iterations 

M-1 

x(t) = (R%gYn) 9Yn(t) + RMx(t), (6) 

n=O 

with Rox(t) = z(t), There are two ways of stopping the 

iterative process: one is to use a pre-specified limiting 

number M of the TF atoms, and the other is to check the 

energy of the residue RMx(t). A very high value of M 

and a zero value for the residual energy will decompose 

the signal completely at the expense of increased compu- 

tational complexity. 

3.2 Matching Pursuit TFD 

A signal decomposition-based TFD may be obtained by 

taking the WVD of the TF atoms in Eq. ??, and is given 

as [7]: 

w,w) = c2' I ( R ~ ~ wYn(t,w) x , ~ ~ ~ ) ~ ~ 

where Wgrn (t, w) is the WVD of the Gaussian window 

function. The double sum corresponds to the cross-terms 

of the WVD indicated by W[g7n,g7m~(t,~), and should be 

rejected in order to obtain a cross-term-free energy distribution 

of z(t) in the TF plane. Thus only the first term 

is retained, and the resulting TFD is given by 

w' 

A4-1 

(t, = c I wnx, 91" > I 2 W9Yn (t, U). (8) 

n=O 

This cross-term-free TFD, also known as matching pur- 

suit TFD (MPTFD), has very good readability and is ap- 

propriate for analysis of nonstationary, multicomponent 

signals. The extraction of coherent structures makes MP 

an attractive tool for TF representation of signals with 

unknown SNR. 

3.3 Minimum Cross-Entropy Optimization 

of the MPTFD 

One of the drawbacks of the MPTFD is that it does 

not satisfy the marginal properties. If a TFD is pos- 

itive and satisfies the marginals, it may be considered 

to be a proper TFD for extraction of time-varying fre- 

quency parameters such as IF. This is because positiv- 

ity coupled with correct marginals ensures that the TFD 

- 0143 - 

172 

is a true probability density function, and the parame- 

ters extracted are meaningful [8]. The MPTFD may be 

modified to satisfy the marginal requirements, and still 

preserve its other important cha.racteristics. One way to 

optimize the MPTFD is by using the cross-entropy min- 

imization method [9, 101. Cross-entropy minimization is 

a general method of inference about an unknown prob- 

ability density when there exists a prior estimate of the 

density and new information in the form of constraints on 

expected values is available. If the optimized MPTFD or 

OMP TFD (an unknown probability density function) is 

denoted by M(t,w), then it should satisfy the marginals 

/M(t,w) dw = Ix(t)12 = m(t), and (9) 

Eqs. 9 and 10 may be treated as constraint equations 

(new information), for optimization. Now, M(t, U) may be 

obtained from W (t,w) (a prior estimate of the density) 

by minimizing the cross-entropy between them, given by 

As we are interested only in the marginals, OMP TFD 

may be written as [lo]: 

AI(t,W) = W'(t,w) exp{-(ao(t) + Po(w))), (12) 

where the CY'S and P's are Lagrange multipliers which may 

be determined using the constraint equations. An iter- 

ative algorithm to obtain the Lagrange multipliers and 

solve for M(t, w) is presented next. 

At the first iteration, we define 

M'(t,w) = ~ '(t,w) exp(-ao(t)). (13) 

As the marginals are to be satisfied, the time marginal 

constraint has to be imposed in order to solve for ao(t). 

By imposing the time marginal constraint given by Eq. 9 

on Eq. 13, we obtain 

where m(t) is the desired time marginal and m'(t) is the 

time marginal estimated from W' (t, U). Now, Eq. 13 can 

be rewritten as 

Ml(t,w) = W'(t,w) - m (t 

m' (t) ' 


At this point, M1 (t, U) is a modified MPTFD with the de- 

sired time marginal; however, it need not necessarily have

the desired frequency marginal m(w). In order to obtain 

the desired frequency marginal, the following equation 

has to be solved: 

W(t,w) = ~ ‘(t,w). exp(-,&(w)). 

(16) 

Note that the TFD obtained after the first iteration 

M1(t,w) is used as the incoming estimate in Eq. 16. 

By imposing the frequency marginal constraint given by 

Eq. 10 on Eq. 16, we obtain 

(17) 

4 

The proposed method of extracting the IF of a signal was 

applied to a synthetic signal with known IF, and a real 

world example of knee joint sound signal. 

4.1 

Results 

Synthetic Signal 

The synthetic signal “synl” is composed of nonoverlapping 

chirp, transient, and sinusoidal FM components, and 

is shown in Fig. 1. The frequency behavior of the signal 

are shown in Fig. 2. “synl” is an example of a monocomponent 

signal with linear and nonlinear frequency dy- 

where m(w) is the desired frequency margin?, and m‘ (U) Ilamics. To simulate noisy signa’ conditions, synl was 

is the frequency marginal estimated from W (t, w). Now, corrupted by adding random noise to an SNR Of lo dB. 

Eq. 17 can be rewritten as 

I I ’ I 

By incorporating the desired marginal constraint, the 

M’(t,w) TFD may be altered and need not necessarily 

give the desired time marginal. Successive iteration could 

overcome this problem and modify the desired TFD to 

get closer to M(t, w). This follows from the fact that the 

cross-entropy between the desired TFD and the estimated 

TFD decreases with the number of iterations [lo]. 

As the iterative procedure is started with a positive dis- 

tribution W’(t, U), the TFD at the nth iteration Mn(tlu) 

is guaranteed to be a positive distribution. Such a class of 

distributions belongs to the Cohen-Posch class of positive 

distributions [SI. The OMP TFDs may also be taken to 

be adaptive TFDs because they are constructed on the 

basis of the properties of the signal being analyzed. 

A method for constructing a positive distribution using 

the spectrogram as a priori knowledge was developed by 

Loughlin et al. [ll]. The major drawback of using the 

spectrogram as a priori knowledge is the loss of TF resolution; 

this effect may be minimized by taking multiple 

spectrograms with different sizes of analysis windows as 

initial estimates of the desired distribution. The method 

proposed in this section start with the MPTFD, overcomes 

the problem of using multiple spectrograms as initial 

estimates, and produces a high-resolution TFD tailored 

to the signal properties. The OMP TFD mag be 

used to derive higher moments by estimating the higherorder 

Lagrange multipliers. Such measures are not necessary 

in the present work, and are beyond the scope of 

this paper. 

The IF of a signal can be computed as the first moment 

of TFD(t, U) along each time slice, given by 

-15L I I 4 

100 200 300 400 500 600 700 800 900 loo0 

!,me Samples 

Figure 1: Monocomponent, nonstationary, synthetic sig- 

nal “synl” consisting of a chirp, an impulse, and a sinu- 

soidal FM component (SNR = 10 dB). 

The MP method has given a clear picture of the IF 

representation: the three simulated components are per- 

fectly localized in the TFDs shown in Fig. 3. This is 

because the OMP TFD provides adaptive representation 

of signal components, and due to the possibility that each 

high-energy component is analyzed by the TF represen- 

tation independent of its bandwidth and duration. The 

good localization of transients produced by MP is because 

of the good TF localization properties of the basis func- 

tions, whereas with other techniques such as Fourier and 

wavelets, the transient information gets diluted across the 

whole basis and the collection of basis functions is not as 

large as compared to that in the MP dictionary. 

E:, wTFD(t, w) 

4.2 Real World Example 

IF(t) = 

(19) 

TFD(t, w) . 

The -proposed - technique was applied to real-world sig- 

IF characterizes the frequency dynamics of the signal. nals viz the knee sound signals. Due to the differences in 

- 0144 - 

173 


i

0.5 

100 200 300 4W 500 600 700 800 900 1000 

lime sBmDIes 

Figure 2: Ideal TFD depicting the frequency laws of the 

signal "synl" in Fig. 1. 

100 200 300 400 f 

limc 

c 

1 

f 

@ t 

600 700 800 900 1000 

ples 

Figure 3: OMP TFD of the signal synl in Fig.1. 

- 0145 - 

174 

the cartilage surface between normal and abnormal knees, 

sound signals with different IFS are produced [12]. Fig. 4 

shows the knee sound signal of a normal subject. The IF 

of the same signal is shown in Fig. 5. Automatic classifi- 

cation of the sound signals using IF as a feature for pat- 

tern classification has produced good results in screening 

abnormal knees from normal knees. 

80 

-40 

~ 

I III 

I 

I 

I 1 I I 

300 

250 

1200 

E 

a 

pi50 

p 100 

50 

0 

I I l l 1 

1 1 1 1 I 

1 1 1 1 , I 

I 

1000 2000 3000 4000 5000 6000 

Itme samples 

L 

)O 80 

Figure 5: IF estimated from the OMP TFD of the normal 

knee sound signal in Fig. 4. 

5 Conclusion 


A novel method of extracting the IF of a signal is pro- 

posed in this paper. The extraction of IF is based on

constructing an adaptive TFD and extracting the IF as 

first central moment for each time slice. The method was 

tested on synthetic signals with known IF, and the results 

were found to be satisfactory even for low SNR cases. 

[ll] P. Loughlin, J. Pitton, and L. Atlas. Construction of 

positive time-frequency distributions. IEEE Trans. 

Signal Processing, 42 : 2697-2 705, 1994. 

Acknowledgment 

[12] S. Krishnan, R.M. Rangayyan, G.D. Bell, and 

C.B. Frank. Adaptive time-frequency analysis of 

knee joint vibroarthrographic signals for non-invasive 

screening of articular cartilage pathology. IEEE 

We would like to acknowledge Micronet and NSERC for Transactions on Biomedical Engineering, page in 

providing financial support. press, 2000. 

References 

[l] J. Carson and T. Fry. Variable frequency electric 

circuit theory with application to the theory of fre- 

quency modulaion. Bell System Technical Journal, 

16:513-540, 1937. 

[2] B. Boashash. Estimating and interpreting the in- 

stantaneous frequency of a signal - Part 1: Funda- 

mentals. Proc. IEEE, 80(4):519-538, April 1992. 

[3] P. J. Loughlin. Comments on the interpretation of in- 

stantaneous frequency. IEEE Signal Processing Let- 

ters, 4(5):123-125, May 1997. 

[4] H. I. Choi and W. J. Williams. Improved time- 

frequency representation of multicomponent signals 

using exponential kernels. IEEE Trans. Acoustics, 

Speech, and Signal Processing, 37(6):862-871, 1989. 

[5] Z. Guo, L. G. Durand, and H. C. Lee. The 

time-frequency distributions of nonstationary signals 

based on a Bessel kernel. IEEE Trans. Signal Pro- 

cessing, 42:1700-1707, 1994. 

[6] R. G. Baraniuk and D. L. Jones. Signal-dependent 

time-frequency representation: optimal kernel de- 

sign. IEEE Trans. Signal Processing, 41:1589-1602, 

1993. 

[7] S. G. Mallat and Z. Zhang. Matching pursuit with 

time-frequency dictionaries. IEEE Trans. Signal 

Processing, 41( 12):3397-3415, 1993. 

[8] L. Cohen and T. Posch. Positive time-frequency dis- 

tribution functions. IEEE Trans. Acoustics, Speech, 

and Signal Processing, 33:31-38, 1985. 

[9] J. Shore and R. Johnson. Axiomatic derivation of 

the principle of maximum entropy and the principle 

of minimum cross-entropy. IEEE Trans. Information 

Theory, 26(1):26-37, 1980. 

[lo] J. Shore and R. Johnson. Properties of cross-entropy 

minimization. IEEE Trans. Information Theory, 

27(4):472-482, 1981. 

- 0146 - 

175 


Proceedings of the 22"d Annual EMBS International Conference, July 23-28,2000, Chicago IL. 

Sonification of Knee-joint Vibration Signals 

Sridhar Krishnanl, Rangaraj M. Ranga~yan~?~, 

G. Douglas Bel12,314, and Cyril B. F'rank2~3~4 

'Dept. of Electrical and Computer Engineering, Ryerson Polytechnic University 

Toronto, Ontario, M5B 2K3, CANADA. (Email: krishnan@ee.ryerson.ca) 

2Dept. of Electrical and Computer Engineering, 3Dept. of Surgery, 4Sport Medicine Centre 

University of Calgary, Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca) 

Abstract: Sounds generated due to rubbing of 

knee-joint surfaces may be a potential tool for non- 

invasively assessing articular cartilage degenera- 

tion. In this paper, an attempt is made to per- 

form computer-assisted auscultation of knee joints 

by auditory display (AD) of the vibration sig- 

nals (also known as vibroarthrographic or VAG 

signals) emitted during active movement of the 

leg. The AD technique is based on a sonifica- 

tion algorithm, in which the instantaneous mean 

frequency and envelope of the VAG signals were 

used in improving the potential diagnostic qual- 

ity of VAG signals. Auditory classification experi- 

ments were performed by two orthopedic surgeons 

with a database of 37 VAG signals that includes 

19 normal and 18 abnormal cases. Sensitivities of 

31% and 83% were obtained with direct playback 

and the sonification method, respectively. 


Auscultation, the method of examining functions and con- 

ditions of physiological systems by listening to the sounds 

they produce, is one of the ancient modes of diagnosis. 

The first use of vibration or acoustic emission as a diag- 

nostic aid for bone and joint disease is found in Laennec's 

treatise on mediate auscultation, as cited by Mollan et 

al. [l]. Laennec was able to diagnose fractures by aus- 

cultating crepitus caused by the moving broken ends of 

bone. As quoted by Mollan et al. [l], Heuter, in 1885, 

used a myodermato-osteophone (a type of stethoscope) to 

localize loose bodies in human knee joints. In 1929, Wal- 

ters reported on auscultation of 1600 joints and detected 

certain sounds before symptoms become apparent [2]; he 

suggested that the sounds might be an early sign of arthri- 

tis. 

After 1933, most of the works reported on kneejoint 

sounds have been on objective analysis of the sound or vi- 

bration signals (also known as vibroarthrographic or VAG 

signals) for non invasive diagnosis of kneejoint pathology 

0-7803-6465-1/00/$10.00 02000 IEEE 

13, 4, 5, 6, 71. Although auscultation of knee joints using 

stethoscopes is occasionally practised by clinicians, there 

is no published evidence of their diagnostic value. Also, 

no study has been reported on computer-aided ausculta- 

tion of knee-joint sounds. This paper proposes methods for 

computer-aided auscultation of knee-joint sounds based on 

an auditory display (AD) technique. 

2 Data Acquisition 

Each subject sat on a rigid table in a relaxed position 

with the leg being tested freely suspended in air. The 

VAG signal was detected at the mid-patella position of 

the knee by using vibration sensors (accelerometers) as the 

subject swung the leg over an approximate angle range of 

135O + Oo + 135O in 4s. Informed consent was obtained 

from every subject. The experimental protocol has been 

approved by the Conjoint Health Research Ethics Board 

of the University of Calgary. 

The VAG signal was prefiltered (10 Hz to 1 kHz) and 

amplified before digitizing at a sampling rate of 2 kHz. 

The details of data acquisition may be found in Krish- 

nan et al. [7]. The database consists of 37 signals (19 

normal and 18 abnormal). The abnormal signals were col- 

lected from symptomatic patients scheduled to undergo 

arthroscopy, and there was no restriction imposed on the 

type of pathology. 

3 Sonification 

AD may be defined as aural representation of a stream 

of data. The field of AD is emerging, and has recently 

drawn attention in the areas of geophysics, biomedical en- 

gineering, speech signal analysis, image analysis, aids for 

the handicapped, and computer graphics [8]. AD has to 

be performed in such a manner as to take advantage of 

the psychoacoustics of the human ear. The AD technique 

considered in the present work is a sonification technique. 

In sonification, features extracted from the data are used 

176 

1995 


Proceedings of the 22"d Annual EMBS International Conference, July 23-28,2000, Chicago IL. 

to control a sound synthesizer. The sound signal gener- 

ated does not bear a direct relationship to the data being 

analyzed. A simple example of a sonification technique is 

mapping of parameters derived from a data stream to AD 

parameters such as pitch and loudness. 

4 Motivation for AD of VAG 

Prior to graphical recording and analysis of VAG sig- 

nals, auscultation of knee joints was the only noninva- 

sive method available to distinguish normal joints from 

degenerative joints. Significant success has been claimed 

by several researchers using the auscultation technique 111. 

However, classification of knee joints by auscultation is a 

highly subjective technique. Further, a significant portion 

of VAG signal energy lies below the threshold of auditory 

perception of the human ear in terms of frequency and/or 

intensity. The situation may be ameliorated by developing 

AD methods for computer-aided auscultation of knee-joint 

vibrations. The main motivating factors in applying AD 

techniques to VAG are: 

0 It has been established through objective signal anal- 

ysis of VAG that sounds generated by abnormal knees 

are distinctive and different from those produced by 

normal knees [3, 4, 5, 6, 91. Sounds of diagnostic 

value may be made prominent by applying suitable 

AD techniques to VAG. 

0 AD of VAG obtained using vibration sensors may 

facilitate relatively noise-free and localized auditory 

analysis when compared to direct auditory analysis 

of the acoustic sensor data. 

The work described in this paper hypothesizes that 

auditory analysis of VAG data may aid an orthopedic sur- 

geon in making diagnostic inferences. In the next section, 

a sonification technique for AD of VAG data is developed. 

This study is the first attempt to listen to knee sounds 

detected by vibration sensors. 

5 Sonification of VAG Signals 

In an effort to facilitate AD of only the important charac- 

teristics of VAG signals, a sonification algorithm is pro- 

posed. The sonification algorithm involves amplitude 

modulation (AM) and frequency modulation (FM). The 

instantaneous mean frequency (IMF) FP(t) is an impor- 

tant parameter in characterizing multicomponent nonsta- 

tionary signals such as VAG [7]. The IMF of a signal could 

be extracted from a positive time-frequency distribution 

(TFD) of the signal [7]. The FM part of the sonified signal 

is obtained by frequency modulating a sinusoidal waveform 

with the IMF'of the signal. The auditory characteristics 

of the FM part alone will be tonal, which could quickly 

U) 

30 

10 

0 

4 -10 

1-20 

0-7803-6465-1/00/$10.00 02000 IEEE 1996 

177 

-30 

I -60 

I 

Figure 1: An abnormal VAG signal of a patient with chon- 

dromalacia patella. 

P 

E 

P 

Figure 2: Spectrogram of the VAG signal in Fig. 1. 

cause boredom and fatigue. To obviate this problem, an 

AM part a(t) is obtained as the absolute d ue of the an- 

alytic version of the VAG signal. The AM part provides 

an envelope to the signal and contributes to the frequency 

deviation (bandwidth) about the IMF. 

For the sake of illustration, plots of an abnormal VAG 

signal with chondromalacia patella (a type of cartilage 

pathology), and the processed versions of the signal are 

presented. Fig. 1 shows the original VAG; the spectro- 

gram (a joint time-frequency representation) of the signal 

is shown in Fig. 2. The related entities of the sonified 

versions of the signal are shown in Figs. 3 to 5. The en- 

velope and the IMF of the signal are shown in Figs. 3 and 

4, respectively. The spectrogram shown in Fig. 5 clearly 

illustrates the envelopeIMF behlavior of the sonified signal 

with a time-scale factor of two. 

The advantages of the IMF-based sonification method 


Proceedings of the 22"' Annual EMBS International Conference, July 23-28,2000, Chicago IL. 

time Sam* 

I I ' 

Figure 3: Envelope of the VAG signal in Fig. 1. 

Figure 4: IMF of the VAG signal in Fig. 1 estimated using 

its positive TFD. 

0-7803-6465-1/00/$10.00 02000 IEEE 1997 

178 

time Samples 

time samp(es 

Figure 5: Spectrogram of the IMF-based sonified version 

of the VAG signal in Fig. 1. A time-scale factor of 2 was 

used. Note that the figure window has been divided into 

two parts to show the time-scale expansion. 


I lo'

are: 

Proceedings of the 22"d Annual EMBS International Conference, July 23-28,20:00, Chicago IL. 

0 It helps in auditory analysis of a multicomponent non- 

stationary signal in terms of its main features such as 

FP(t) and a(t). 

0 FP(t) takes high values for transients and noise. 

However, by making use of the envelope (intensity) 

information, noise can be made less audible as com- 

pared to transients. 

0 Integration of FP(t) ensures a continuous phase, and 

the method does not require any phase unwrapping. 

0 Integration of FP(t) makes the method insensitive to 

noisy FP(t) estimates. 

The IMF-based method has the following disadvantages: 

0 In case of a noisy signal, FP(t) will have an almost 

uniform waveform, and does not provide much infor- 

mation unless the envelope can contribute some in- 

formation. In the present study, this problem is over- 

come by processing denoised versions of the VAG sig- 

nals [lo]. 

0 It is obvious that the method may not be applicable to 

information-rich signals such as speech: the formant 

structure of voiced speech cannot be represented by 

the relatively simple IMF. 

6 Experiments and Results 

Auditory analysis of VAG signals was performed by two 

orthopedic surgeons (GDB and CBF) with significant ex- 

perience in knee-joint auscultation and arthroscopy. The 

experiment was conducted in two stages: In the fist stage, 

familiarization and training were provided through the re- 

sults of application of the IMF-based sonification methods 

to a speech signal and four VAG signals (two normals and 

two abnormals). In the second stage, the methods were 

tested with the database of 37 VAG signals. 

From the initial evaluation (first stage), GDB selected 

the two-times time-scaled IMF-based sonification method 

for the test (second) stage. The purpose of the classifi- 

cation experiment in the test stage was to determine the 

diagnostic improvement provided by the processed sounds 

when compared to direct playback. The test stage in- 

cluded auditory classification experiments performed with 

the same database three times: twice by GDB with a time 

gap of 45 days between the repeat experiments and once 

by CBF. The direct playback of VAG signals provided a 

- sensitivity of 31% and a specificity of 74%, whereas au- 

ral analysis of the sonified signals provided a sensitivity of 

83% and a specificity of 32%. 

The results suggest that computer-aided auscultation 

of VAG signals may be a potential tool for improved diag- 

nosis of knee-joint cartilage pathology. The specificity and 

sensitivity may be increased with more auditory training. 

0-7803-6465-1 /00/$10.00 02000 IEEE 1998 

179 


We gratefully acknowledge supplort from the Alberta Her- 

itage Foundation for Medical Research and the Faculty of 

Engineering, Ryerson Polytechnic University. 

References 


R. A. B. Mollan, G. C. McCullagh, and R. I. Wilson. 

A critical appraisal of auscultation of human joints. 

Clinical Orthopaedics and Elelated Research, 170:231- 

237, 1982. 

C. F. Walters. The value of joint auscultation. Lancet, 

1~920-921, 1929. 

M. L. Chu, I. A. Gradisar, amd R. Mostardi. A nonin- 

vasive electroacoustical evalution technique of carti- 

lage damage in pathological knee joints. Medical and 

Biological Engineering and Computing, 16:437442, 

1978. 

Y. Nagata. Joint-sounds in gonoarthrosis - clinical application 

of phonoarthrography for the knees. Journal 

of UOEH, 10(1):47-58, 1988. 

N. P. Reddy, B. M. Rothschild, M. Mandal, V. Gupta, 

and S. Suryanarayanan. Noninvasive acceleration 

measurements to characterize knee arthritis and 

chondromalacia. Annals o,F Biomedical Engineering, 

23:78-84, 1995. 

R. M. Rangayyan, S. Krishnan, G. D. Bell, C. B. 

Frank, and K. 0. Ladly. Parametric represen- 

tation and screening of knee joint vibroarthre 

graphic signals. IEEE %ns. Biomedical Engineer- 

ing, 44(11):1068-1074, Nov. 1997. 

S. Krishnan. Adaptive signal processing techniques 

for analysis of knee joint vibroarthrographic signals. 

Ph. D. dissertation,University of Calgary, Calgary, 

AB, Canada, June 1999. 

G. Kramer. An introduction to auditory display. In 

G. Kramer, editor, Auditory Display: Sonification, 

Audafication, and Auditor;y Interfaces, pages 1-78. 

Addison Wesley, Reading, ILIA, 1994. 

S. Krishnan, R.M. Rangayyan, G.D. Bell, and C.B. 

Frank. Adaptive time-frequency analysis of knee joint 

vibroarthrographic signals for non-invasive screening 

of articular cartilage pathology. IEEE Zkansactions 

on Biomedical Engineering, page in press, 2000. 

S. Krishnan and R.M. Rangayyan. Automatic de- 

noising of knee joint vibration signals using adaptive 

time-frequency representat ions. Medical and Biologi- 

cal Engineering and Computing, page in press, 2000.

Proceedings of the 1999 IEEE Canadian Conference on Electrical and Computer Engheering 

Shaw Conference Center, Edmonton, Alberta, Canada May 9-12 1999 

Denoising Knee Joint Vibration Signals Using Adaptive 

Time-Frequency Representations 

Sridhar Krishnan and Rangaraj M. Rangayyan 

Dept. of Electrical and Computer Engineering, Univ ersity of Calgary, 

2500 University Drive NW, Calgary , Alberta T2N 1N4, CANAR. 

Email: (krishnan)( ranga)@enel.ucalgary .ca 

Abstmct - A novel denoising method for improv- 

ing the signal-to-noise ratio (SNR) of knee joint vibra- 

tion signals (also known as vibroarthrographic or VAG 

signals) is proposed. The denoising methods consid- 

ered are based on signal decomposition techniques such 

as wavelets, wavelet packets, and the matching pur- 

suit method. Performance evaluation with synthetic 

signals simulated with characteristics expected of VAG 

signals indicated good denoising results with the match- 

ing pursuit method. Nonstationary signal features ex- 

tracted and identified from time-frequency distributions 

of denoised VAG signals have shown good potential in 

screening for articular cartilage pathology. 

Keywords: denoising, time-frequency distributions, 

matching pursuit, knee joint sounds, vibroarthrography. 


Vibration signals sensed using an accelerometer at 

the mid-patella position of the knee joint during normal 

leg movement could be used to develop a non-invasive 

tool for monitoring and screening of articular cartilage 

pathology. The knee joint vibration signals are referred 

to as vibroarthrographic or VAG signals. 

VAG signals have the following important charac- 

teristics: 

They are nonstationary and multicomponent in na- 

ture. 

Although the accelerometer placed at the mid- 

patella position has excellent immunity to back- 

ground noise, random noise is expected to com bine 

with VAG signal during leg mowment and data 

acquisition. 

e There is no underlying model available as yet for 

VAG signal generation from which the signal-to- 

noise ratio (SNR) could be determined a priori. 

In order to analyze VAG signals and to extract 

discriminant features, nonstationary and multiconpo- 

nent signal analysis tools such as timefrequency dis- 

tributions (TFDs) could be used. TFDs give the sig- 

nal energy distribution at different time instants and 

frequencies. The features extracted from a TFD will 

contain the combined time-frequency (TF) dynamics of 

the given signal as opposed to features along either the 

0-7803-5579-2/99/$10.000 1999 BEE 1495 

time or the frequency axis alone as provided by con- 

ventional techniques. However, TFD features may be 

biased due to the presence of random noise. Because 

of random behavior and wide-frequency range, a noise 

process is localizable neither in time nor in frequency, 

and appears all over the TF plane. 

Filtering of noise from VAG signals may help in 

extracting and identifying significant TF features use- 

ful in screening applications. In circumstances where 

the SNR of a signal is not known a priori, optimal lin- 

ear filtering techniques such as Wiener filtering may 

not be the best solution. In such cases, approaches 

based on signal decomposition using orthogonal or 

non-orthogonal bases may be an interesting alterna- 

tive. This paper is a first attempt to automatically 

denoise VAG signals using signal decomposition. The 

common1 y-used denoising methods sum as wavelets and 

wavelet packets are compared with an adaptive TF de- 

composition method such as matching pursuit with a 

Gabor dictionary. 

In Section I1 the methodology is described. Section 

I11 presents the results and discussion on the perfor- 

mance of the denoising methods studied with synthetic 

and real VAG signals. The paper is concluded in Sec- 

tion IV with a brief summary. 

11. METHODS 

The Wiener filter is an optimal filter for removing 

Gaussian random noise provided the noise statistics are 

available a priori. In real-world situations, signals ac- 

quired from an unknown system may have an unknown 

SNR. In cases where the SNR is not known a priori, sig- 

nal decomposition using an appropriate basis may help 

in extracting the coherent structures of a signal with 

respect to the basis dictionary. In the following sec- 

tions, methods for linear and nonlinear approximation 

or decomposition of signals are briefly described. 

A. Linear Approximation 

In linear approximation, the given signal is pro- 

jected over M orthogonal basis vectors that are chosen 

a priori. Linear approximation of a discrete signal z(n) 

180 


may be written as 

M-1 

= (z,grn)gm, (1) 

m=O 

where (z,gm) denotes the inner product of z(n) with 

the orthogonal basis vectors gm’s that are selected a 

priori. It has been shown that an optimal linear ap- 

proximation is provided by the Karhunen-Loitve basis 

[l]. The approximation may be improwd by choosing 

the M orthogonal basis vectors depending on the prop- 

erties of the given signal rather than selecting them be- 

fore hand. The selection of signal-adaptive basis func- 

tions leads to the concept of nonlinear decomposition. 

B. Nonlinear Approximation 

In the case of nonlinear approximation, the given 

signal is approximated with M vectors selected adap- 

tively. The nonlinear decomposition of a signal z(n) 


md&j 

where IM denotes a group of basis functions from a dic- 

tionary that provides the first M inner product values 

(z,gm) arranged in decreasing order. The M vectors 

in IM are the basis vectors that correlate best with 

x(n), and may be interpreted as the main features of 

z(n). One such possible approximation is the wavelet 

transform, where the basis vectors are obtained by di- 

lating and translating a prototype function (also known 

as wavelets), given by 

1 t-U 

= -J 9 (8) 

1 (3) 

where s denotes the dilation parameter and U is the 

translation parameter. Nonlinear decomposition based 

on wavelets outperforms linear decomposition because 

the former is equivalent to the construction of an irreg- 

ular sampling grid adapted to the local sharpness of the 

signal variations. Efficient denoising may be performed 

using wavelets by approximating the signal with a small 

number of non-zero wavelet coefficients; thresholding of 

the wavelet coefficients may be hard or soft [2]. 

To further optimize nonlinear signal approxima- 

tion, one could adaptively choose the basis depending 

upon the given signal. This approach of selecting the 

“best” basis among a dictionary of bases by minimiz- 

ing a cost function or entropy is known as the method 

of wavelet packets (WP) [3]. The WP approac h uses 

a large family of orthogonal bases that include differ- 

ent types of local TF functions (also known as TF 

atoms). The bases are computed using a quadrature 

0-7803-5579-2/99/$10.00 0 1999 IEEE 1496 

181 

mirror filter-bank algorithm. WP decomposes the sig- 

nal into TF atoms that are adapted to the TF structures 

present in the signal. A denoised version of a signal may 

be obtained by soft thresholding or hard thresholding 

the WP coefficients. 

Another way to optimize a TF decomposition is 

by using non-orthogonal basis functions. An example 

of such a decomposition is the matching pursuit (MP) 

algorithm [4]. In this case, the non-orthogonal basis 

functions are Gaussian functions with good time and 

frequency localization characteristics. In MP, the signal 

is first projected onto the dictionary, and the Gabor 

TF atom with the highest correlation with the signal 

is selected. The residue of the signal is then projected 

onto the dictionary, and the component with the highest 

correlation is selected. The decay parameter, denoted 

by 

may be used as the stopping criterion of the decompo- 

sition process. .In Eq. 4, 11 Rmx denotes the residual 

energy level at the mth iteration. The decomposition 

is continued until the decay parameter does not reduce 

any further. At this stage, the selected components rep 

resent coherent structures and the residue represents 

incoherent structures in the signal with respect to the 

dictionary. The residue may be assumed to be due to 

random noise, since it does not show any TF localiza- 

tion. 


Before denoising VAG signals for feature extrac- 

tion, the best available denoising method was selected 

on the basis of performance with synthetic signals sim- 

ulated with characteristics similar to those expected of 

VAG signals. The synthetic signal used for illustra- 

tion in the present paper includes a linear frequency- 

modulated (FM) componen t, a nonlinear FM compo- 

nent, and a transient. The synthetic signal possesses 

the multicomponent and nonstationary characteristics 

typical of VAG signals. The reason to use FM compe 

nents in synthetic signals in the present study is that 

dominant pole analysis of VAG signals has indicated 

timevarying frequency characteristics [5]. Transients 

may depict joint clicks produced during movement of 

the knee. Random noise at different levels was added 

to the synthetic signal to simulate good and poor signal 

recording conditions. 

To evaluate the performance of the denoising rneth- 

ods chosen for the present study, the normalized root 

mean squared (NRMS) error measure w as used. NRMS 

is given by 


(4)

A . : . : . ;. ... I... . . .... .. ' .. . .:. .. ..; 4 

a 

-4- 

Fig. 1. Multicomponent, nonstationary, synthetic signal corn- 

posed with a linear FM component, a nonlinear FM cornp- 

nent, and a transient. 

I ' ' ' I ' " ' ' ' 1 

0 50 100 152 200 250 300 350 400 450 500 

llme samples 

Fig. 2. The synthetic signal in Fig. 1 with noise added (SNR = 

0 dB). 

where s(n) is the original signal without noise, d(n) is 

the denoised signal, and N is the number of samples 

in the signal. A small NRMS measure indicates good 

denoising performance. 

A. Results with Synthetic Signals 

The denoising methods were applied to the syn- 

thetic signal with two levels of Gaussian random noise 

added. The noise levels were such that the resulting 

signals had an SNR of 10 dB and 0 dB. The symmlet 4 

wavelet [l] was used for wavelet-based denoising. A soft 

0-7803-5579-2/99/$10.00 0 1999 IEEE 1497 

182 

Fig. 3. Wavelet-based denoised version of the noisy signal in 

Fig. 2. 

't I 1 

Fig. 4. Wavelet packet-based denoised version of the noisy signal 

in Fig. 2. 

me SamFleO 

Fig. 5. Matching pursuit-based denoised version of the noisy 

signal in Fig. 2. , 


e 

1 

os 

07- 

Ohzo5- 

04- 

09- 

02- 

01- 

OD 

SNR - 10 10 

Fig. 6. Comparison of the NRMS error values of the denoised 

versions of the synthetic signal with SNR = 10 dB. 

threshold was applied to the wavelet coefficients; coef- 

ficients that did not pass the soft threshold test were 

discarded. In case of the WP method, the “best” basis 

was selected on the basis of the Schur concavity cost 

function [3], and the denoised version was obtained by 

soft thresholding the WP coefficien ts. Gaussian func- 

tions were used for the MP method. Gaussian functions 

provide the optimal TF resolution and satisfy the equal- 

ity criteria of the uncertainty principle. The threshold 

for denoising was based on the decay parameter as given 

by Eq. 4. 

Fig. 1 shows the original synthetic signal, and 

Fig. 2 shows the signal with noise added to an SNR 

of 0 dB. The denoised versions of the signals using the 

wavelets method, the WP method, and the MP method 

are shown in Figs. 3, 4, and 5 respectively. Visual com- 

parison indicates that the MP-denoised result has pre- 

served most of the important characteristics, especially 

the transient component. 

Fig. 6 shows a bar graph comparing the NRMS 

values of the results of the three denoising methods ap- 

plied to the synthetic signal with SNR of 10 dB. From 

the Fig. 6 it is evident that adaptive denoising using 

MP provides good denoising for a moderately high SNR 

case. The case of the low SNR of 0 dB was simulated 

to depict very poor signal recording conditions (not ex- 

pected in VAG signals). From the bar chart in Fig. 7 we 

can deduce that the MP technique has again provided 

the best denoising result (lowest NRMS value) of the 

three methods studied. 

It is worthwhile to mention that the denoising re- 

sults with wavelets and WP are highly dependent on 

the selection of the threshold value for the coefficients. 

In the case of MP, the decay parameter is a more ob- 

jective measure. Fig. 8 shows the reduction of the de- 

cay parameter with the number of TF atoms used for 

0-7803-5579-2/99/$10.00 0 1999 IEEE 1498 

183 

03. 

025 

1 

os 

07- 

Ob- 

io5; 04 

05- 

02 - 

01 - 

0 0 

SNR-OdB 

Fig. 7. Comparison of the NRMS error d ues of the denoised 

versions of the Synthetic signal with SNR = 0 dB. 

.. ..I .. ...... .. ..... ..~ .... . . .: . . ..... , . . 

MmbrOl w slornl 

Fig. 8. Plot of the decay parameter versus the number of TF 

atoms for the synthetic signal with SNR = 10 dB and SNR 

= 0 dB. 

the synthetic signal with SNR = 10 dB and SNR = 0 

dB. It is clearly evident that, in denoising the signal 

with SNR = 0 dB, the MP method has been able to 

extract fewer coherent structures as compared to the 

10 dB case. This result indicates that the higher level 

of noise has destroyed some of the low-energy coherent 

structures in the 0 dB version of the signal. 

The WP method may give better results if the 

threshold is selected in an optimal manner. The per- 

formance of the WP method for denoising cannot be 

appreciated much in the present application, since for 

highly nonstationary signals such as the synthetic sig- 

nals shown, the WP method produces a mismatch be- 

tween the “best” orthogonal basis and many local signal 

components. On the contrary, MP is a “greedy” algo- 

rithm that locally optimizes the choice of the wavelet 

packet function for the signal residue at each stage. The 


' I I 

1 

40 

0 05 1 15 2 25 3 35 4 

lime m s 

Fig. 9. Abnormal VAG signal of a subject with cartilage pathol- 

ogy. 

-80 ' 

0 05 1 15 2 25 3 35 4 

it- m D 

Fig. IO. MP-denoised version of the VAG signal in Fig. 9. 

Fig. 11. Difference between the original VAG signal in Fig. 9 and 

the MP-denoised version in Fig. 10. 

0-7803-5579-2/99/$10.00 0 1999 IEEE 1499 

I 

1 I 

184 

Fig. 12. TFD of the abnormal VAG signal in Fig. 9 computed 

using the spectrogram. 

Tim 

0 05 1 15 2 25 3 35 4 

Tim 

Fig. 13. TFD of the MP-denoised VAG signal in Fig. 10 com- 

puted using the spectrogram. 

good optimization property of MP is achieved at the ex- 

pense of increased computational load as a result of the 

greedy approach. Also, in the case of a multicompo- 

nent signal where different types of energy structures 

are located at different times but in the same frequency 

interval, there is no WP basis that is well adapted to all 

of them. WP-based decomposition using an orthogonal 

basis lacks translation invariance and is thus difficult 

to use for pattern recognition. MP is a translation- 

invariant method if a translation-invariant dictionary 

such as a Gabor dictionary is used. Based on these ob- 

servations, the MP technique was selected for denoising 

VAG signals. 

B. Results with VAG signals 

The MP technique was applied to 90 VAG signals 

(51 normal and 39 abnormal). TFDs were constructed 


using the denoised signals. As an illustration, an abnormal 

VAG signal of a subject with cartilage pathology 

is shown in Fig. 9, its MP-denoised version is shown in 

Fig. 10, and the difference between the original and the 

denosied versions of the abnormal VAG signal is shown 

in Fig. 11. From Fig. 11, we could observe that a signif- 

icant amount of random noise has been removed from 

the original signal by the MP-denoising method. 

The TFD computed using the spectrogram of the 

original signal in Fig. 9 is shown in Fig. 12. The TFD of 

the MP-denoised version of the same VAG signal com- 

puted using the spectrogram is shown in Fig. 13. The 

spectrograms of the original and the denoised VAG sig- 

nals were computed using the same short-time Fourier 

transform parameters. Tonal and FM components are 

clearly seen in the TFD of the denoised VAG signal, 

thus facilitating enhanced feature identification. In- 

stantaneous features based on energy and frequency pa- 

rameters were computed as marginal v alues of the TFDs 

[6] of the MP-denoised VAG signals; pattern analysis of 

the features indicated screening accuracy of up to 70%. 


A novel approach to denoise VAG signals for en- 

hanced feature extraction and identification was pro- 

posed. The denoising methods considered were based 

on nonlinear decomposition of signals. The MP method 

of denoising is more promising for application to non- 

stationary signals such as VAG than the commonly- 

used wavelet-based denoising and WP-based denoising 

techniques. The wavelet techniques are best adapted 

to global signal properties, whereas the MP method is 

based on local optimization. Nonstationary signal fea- 

tures extracted from the TFDs of MP-denoised VAG 

signals have shown good potential for screening normal 

knees from abnormal knees. 

Acknowledgements: W e gratefully ahowledge support 

from the Alberta Heritage Foundation for Medical Re- 

search and the Natural Sciences and Engineering Re- 

search Council of Canada. 

REFERENCES 

[l] S. Mallat. A wavelet tour of signal processing. Academic 

Press, San Diego, CA., 1998. 

[2] D. Donoho. Unconditional bases are optimal bases for data 

compression and for statistical estimation. Journal of Appl. 

and Cornput. Harmonic Analysis, 1:lOO-115, 1993. 

[3] M.V. Wickerhauser. Adapted wavelet analysis from theory to 

software. IEEE press, Piscataway, NJ., 1994. 

[4] S.G. Mallat and Z. Zhang. Matching pursuit with timefrequency 

dictionaries. IEEE Tmns. on Signal Processing, 

41( 12):3397-3415,1993. 

[5] R.M. Rsngayyan, S. Krishnan, G.D. Bell, C.B. Frank, and 

K.O. Ladly. Parametric representation and screening of knee 

joint vibroarthrographic signals. IEEE Trans. on Biomedical 

Engineering, 44(11):1068-1074, Nov. 1997. 

0-7803-5579-2/99/$10.00 0 1999 IEEE 1500 

[6] S. Krishnan, R.M. Rsngayyan, G.D. Bell, C.B. Frank, and 

K-0. L~IY. Time-frequency signal feature extraction and 

screening of knee joint vibroarthrographic signals using the 

matching pursuit method. CDROMproceedings, lgth A,,nual 

International Conference of The IEEE Engineering in 

Medicine and BiO[OgY Society, Chicago, IL, October 1997. 

185 


Comparative Analysis of the Performance of the Time-Frequency 

Distributions with Knee Joint Vibroart hrographic Signals 

Rangaraj M. Rangayyan and Sridhar Krishnan 

Dept. of Electrical and Computer Engineering, University of Calgary, 

2500 University Drive NW, Calgary, Alberta T2N 1N4, CANADA. 

Email: (ranga)( krishnan)Oenel.ucalgary.ca 

Abstract ~ Vibroarthrographic (VAG) signals emitted 

by human knee joints can be used to develop a non-invasive 

diagnostic tool t,o detect articular cartilage degeneration. 

VAG signals are nonstationary and multicomponent in nature; 

time-frequency tlistribut,ions (TFDs) provide powerful 

means to analyze such signals. The objective of this paper is 

to determine the TFD suitable for identification and extraction 

of VAG signal features of clinical significance. The TFDs 

considered are: autoregressive (AR) model-based TFD; the 

reassigned, smoothed. pseudo-Wigner-Ville (RSPWV) distribution; 

and a TFD based on signal decomposition using 

the matching pursuit (MP) algorithm. As the true TFD of a 

VAG signal is not known, the results of the TFDs were compared 

based on tmhe expected characteristics using synthetic 

signals. The MP TFD shows good potential in analyzing 

multicomponent signals with low signal-to-noise ratio when 

compared to the AR model-based TFD and the RSPWV 

method. The TFD techniques were also tested on VAG signals 

with additional information provided by auscultation 

and arthroscopy. The results indicate that the MP TFD is 

the best, available TFI) to analyze VAG signals. 


Vibroarthrography (VAG) , the recording of human knee 

joint vibration/acoustic signals during active movement of 

the leg, can be used as a non-invasive diagnostic tool to 

detect articular ca.rtil;ige degenerat,ion. The currently used 

“gold standard” for assessment of cartilage surface degeneration 

is arthroscopy, where the cartilage surface is inspected 

and palpated with a cable. The disadvantage with 

arthroscopy is that it cannot be applied to patients whose 

knees are in a highly degenerated stmate due to osteoarthritis, 

ligamentous insta.bility, meniscectomy, or patellectomy. 

The drawbacks with arthroscopy and the limitations of 

imaging techniques have motivated researchers to look for 

tools such as VAG. In our work, the VAG signal is detected 

at the mid-patella position on the surfa.ce of the knee as the 

leg is swung over the angle range of 135’ + 0’ -+ 135’ in a 

t8ime period of 4s. The signals a.re filtered to the range 10 Hz 

t,o 1 kHz and amplified before sampling at a rate of 2 kHz. 

The cartilage surfaces of a normal knee are smooth and 

slippery. Vibrat,ions generated due to friction between articulating 

surfaces of ‘degenerated cartilage are expected to 

be different in aniplitude and frequency from those of normal 

knees. The important characteristics of VAG signals are 

listed below. 

0-7803-5073-1/98/’$10.00 01998 IEEE 273 

186 

The VAG signal is expected to be a multicomponent 

signal due to the possibility that during movement, of 

the knee, the rubbing of the femoral condyle on the 

patella surface provides multiple sources of vibration, 

and also due to the possibility that the signal from a 

single source can propagate through different channels 

of tissue to the mid-patella position, thus giving rise to 

multiple energy components at different frequencies for 

a given time. 

VAG signals are nonstationary due to the fact that the 

quality of joint surfaces coming into contact may not be 

the same from one angular position (point of time) to 

another during articulation of the joint. 

Due to the differences in cartilage structures in normal 

and abnormal knees, VAG signals with different 

frequency law components may be generated. Identification 

of such frequency dynamics may help in classification 

of normal and abnormal knees. 

Our previous approaches tackled the nonstationarity of VAG 

signals by adaptively segmenting trhe signals into stationary 

components; each segment was parametrically represented 

using a separate set of autoregressive coefficients, 

dominant poles, or cepstral coefficients. Dominant poles 

(poles corresponding to dominant spectral peaks in the signal) 

of each segment have provided good discriminant information 

for classifying signals into normal and abnormal 

groups [l], validating the assumption that the frequency dynamics 

of normal VAG signals differ from those of abnormal 

signals. A major drawback of the segmentation-based technique 

lies in associating the clinical information obtained 

during arthroscopy or auscultation with the segments of a 

signal. This problem can be overcome by using nonstationary 

signal analysis tools such as time-frequency distributions 

(TFDs) . TFDs reveal frequency and temporal information 

simultaneously, and are particularly attractive for analysis 

of multicomponent signals, depiction of frequency laws, and 

noise suppression. The purpose of this work is to identify 

the best available TFD for objective identification and extraction 

of TF structures in VAG signals. 

II. TIME-FREQUENCY DISTRIBUTIONS 

The right TFD would be one t

signals are: 1) model-based TFD, 2) Cohen's class TFDs, 

and 3) TFD based on decomposition of signals. 

il. Autowgressave Model-based TFD 

In the model-based TFD, the autoregressive (AR) model 

coefficients of t,he signal segments are used in estimating the 

power spectral density of each segment. In our work, the 

model coefficients were estimated using the Burg method. 

Fixed segment length was used for synthetic signals, and 

adaptive segment length was used for real VAG signals. 

B. Reassigned Smoothed Pseudo Wigner- Ville Distribution 

The Wigner-Ville distribution (WVD) is the most pop- 

ular TFD of Cohen's class. The main drawback with the 

WVD is that, in the case of multicomponent signals, cross- 

terms are generated in the TFD. Cross-terms can be min- 

imized by using two-dimensional low-pass filtering in the 

ambiguity domain, and the smoothed version of the WVD 

can be obtained. In this paper, the most commonly used 

smoothed version of the WVD, namely the smoothed pseudo 

Wigner-Ville distribution (SPWVD) , is considered. The SP- 

WVD reduces cross-terms significantly. The extent of reduc- 

tion in cross-terms depends upon the type of signal being an- 

alyzed. In our applications with synthetic and VAG signals, 

the smoothing windows used are Gaussian funct,ions. 

The smoothing windows suppress cross-terms in the 

WVD but smear localized components, leading to less ac- 

curate TF localization of signal components as compared to 

the WVD. Recently, a reassignment method has been pro- 

posed by Auger and Flandrin [a] to improve TF localization 

in smoothed TFDs such as SPWVDs. 

In the reassignment method, the window is moved from 

the geometric center (t,w) to the energy center (i!,ij) of the 

TFD. The reassigned SPWVD (RSPWVD) improves the TF 

localization of smeared components and provides good read- 

ability in the TFD. 

C. Matching Pursuit 

The TFD generated by the matching pursuit (MP) 

method is based on signal decomposition. The MP algo- 

rithm selects the decomposition vectors depending upon the 

signal properties. The vectors are selected from a family of 

waveforms called a dictionary. The signal z(t) is projected 

on to a dictionary of Gabor atoms obtained by scaling, trans- 

lating, and modulating a Gaussian window function g(t): 

where 

n=O 

and an are the expansion coefficients. The scale factor s, 

is used to control the width of the window function, and 

the parameter p, controls temporal placement. & is a 

274 

187 


normalizing factor which keeps the norm of g,, (t) equal to 1. 

The parameters f, and 4, are the frequency and phase of the 

exponential function, respectively. In our application, the 

window is a Gaussian function, i.e., g(t) = 21/4exp(-7rt2); 

the TF atoms are then called Gabor functions 

In practice, the algorit,hm works a.s follows: The signal 

is iteratively projected on tro a Gabor function dictionary. 

The first projection decomposes the signal into two parts: 

where (x, grn) denotes the inner product (projection) of z(t) 

with the first TF atom gro (t). The term R1x(t) is the residue 

after approximating z(t) in the direction of grn (t). This process 

is continued by projecting the residue on to the subsequent 

functions in the dictionary, and aft,er M iterat,ioiis 

M-1 

n=O 

with Roz(t) = z(t). There are two ways of stopping the it,- 

erative process: one is to use a pre-specified limiting number 

M of the TF atoms, and the other is to check the energy of 

the residue RMx(t). A very high value of M and a zero value 

for the residual energy will decompose the signal complet#ely 

at t8he expense of increased computational complexity. 

In this work, M was limited to 1000 atoms and the resid- 

ual energy limit was set to be 0.5% of the total energy. For 

VAG signals, the maximum octave length given by log, N 

(where N is the number of samples) was set to 11 due t.o t,he 

nonstationary nature of the signal. Also, in NIP analysis, 

only coherent structures [3] of the signals can be extraded: 

the residual components that do not have a high correlation 

with the vectors in the dictionary are rejected. 

The Wigner distribution of x(t) based on the TF atoms 

is given as [3]: 

where Wgyn(t,W) is the Wigner transform of the Gaussian 

window function. The double sum corresponds to 

the cross-terms of the Wigner distribution indicated by 

W[grn ,g7ml (t, U) , and should be rejected in order to obtain 

a cross-term-free energy distribution of ~(t) in the TF plane. 

Thus only the first term is computed, and the resulting TFD 

is given by 

M-1 

The cross-term free TFD W'(t, w) has very good readability 

and is appropriate for multicomponent signal analysis. The 

extraction of coherent structures makes MP an attractive 

tool for TF representation of signals with unknown SNR.

a- .. ............ L .. . :. 

-U) ~ ....................... 

; ........... : .................................... I ............ : ..... 

-a, 

0 1 w o 2 o o o 3 w o u a m m w s o a , ” s w o 

umr vmwr Fig. 1. (a) VAG signal of a. normal subject. Grinding sound was heard 

during auscultation at an angle range of 50° -+ 0’ (3400 to 4000 

samples) during extension of the knee. au- acceleration units. 

Fig. 2. TFDs of the signal in Figure 1. (a) AR model-based TFD. X’s 

denote dominant poles. (b) RSPWV distribution. (c) MP TFD. 

~ 

275 

188 

111. RESULTS 

A. Results with Synthetic Signals 

Before applying the TFDs to real VAG signals, the 

TFDs were evaluated with synthetic signals. For exam- 

ple, one of the synthetic signals “syn” was composed with 

overlapping chirp, impulse, and sinusoidal FM components. 

The signal “syn” is a nonstationary signal since its spectrum 

varies with time. Transients such as an impulse represent 

clicks heard during knee movement. Chirp and sinusoidal 

FM components are examples of linear and nonlinear fre- 

quency dynamics; their physiological relevance needs to be 

studied. AS VAG classification experiments based on dom- 

inant poles (adaptive pole or spectral peak tracking) have 

provided very good accuracy [l], we believe it is worthwhile 

to study VAG signals in terms of their frequency dynamics 

with improved time tracking. 

To simulate noisy signal conditions the synthetic signals 

were corrupted by adding Gaussian noise to an SNR of 10 

dB, and to simulate worse signal recording conditions the 

synthetic signals were corrupted by adding Gaussian noise 

to an SNR of 0 dB. 

The results obtained with synthetic signals are pre- 

sented below in a summarized version for the sake of brevity. 

The synthetic signals were segmented into fixed segments, 

and each segment was AR modeled using the Burg lattice 

method; a model order of 15, determined empirically, was 

used for each segment. The advantage of this method as 

compared with other segment-based methods such as the 

spectrogram is that the model presumes that the signal out- 

side the segment is nonzero as opposed to the spectrogram, 

where the signal outside the window is assumed to be zero. 

The TFD was free of cross-terms with reasonable TF local- 

ization, a,nd the TFD did not include the impulse component; 

this is because a transient of very short duration cannot be 

modeled by prediction inherent in the AR model. In the 

case of low SNR of 0 dB, AR modeling failed to give good 

spectral estimates. 

Although cross-terms were suppressed in the RSP- 

WVDs, the TFDs generated by RSPWVD had negative val- 

ues, and may not be suitable for feature extraction as an 

accurate estimate of the mean frequency or spread for each 

time instant cannot be reliably obtained. The method of 

reassignment improved the localization of the components 

significantly, but the problem of negative distribution values 

exists. In the case of the lower SNR of 0 dB, it was hard to 

distinguish the components of interest from cross-terms. 

The MP method gave a clear picture of the TF rep- 

resentation; the three simulated components were perfectly 

localized in the TFDs. This is because the MP TFD pro- 

vides adaptive representation of signal components, and due 

to the possibility that each high-energy component is ana- 

lyzed by the TF representation independent of its bandwidth 

and duration. The poor localization of transients by other 

techniques such as Fourier and wavelets is due to the fact 

that the transient information gets diluted across the whole 

basis and t,he collection of ba.sis functions is not as large as 


compared to that in the MP dictionary. 

In the lower SNR, the MP TFD was better than those 

obtained using the other techniques. The MP TFD could be 

made more readable by extracting only the coherent struc- 

tures of the signal. The MP technique has the facility to 

include automatic denoising of the signal, which is useful in 

situations where the SNR of the signal is not known. 

B. Results with VAG signals 

The TFD methods were tested on ten VAG signals. 

For computing the AR model-based TFDs, the signals were 

adaptively segmented into quasi-stationary segments using 

the recursive least-squares lattice (RLSL) algorithm [4]. The 

segments were AR modeled using the Burg-lattice algorithm 

and the model order used was 40 [4]. For the sake of illustra- 

tion, the VAG signal (“vagl”) of a normal subject is shown 

in Figure l(a). Grinding sound was heard during auscul- 

tation at an angle range of 50’ + 0’ (approximately in the 

range of 2500 to 4000 samples) for this subject. From the AR 

model-based TFD of t,he signal “vagl” in Figure 2(a), we can 

observe that the TF representation is cross-term-free. The 

grinding sound is shown as a high-frequency activity. The 

localization of the component, corresponding to the grinding 

sound is coarse, and the precise angle (or time) at which the 

sound was heard cannot, be readily det,ermined. Because of 

the coa,rse estimation of components, the AR model-based 

TFD may not be appropriate for instantaneous parameter 

extraction. The most dominant pole indicated by the ‘X’ 

marks superimposed on t,he AR model-based TFD in Figure 

2(a), indicat,es the dominant spectral peaks in the signal. As 

t,he dominant, poles are selected on a segment-by-segment 

basis, the dominant poles are also not suitable for instanta- 

neous parameter tracking. 

Figure 2(b) shows the TFD obtained using the RSPWV 

method. The TFD is obviously not readable except for the 

component corresponding to the “grinding” sound. Further, 

the TFD has negative values due to cross-terms. The neg- 

ative values may mislead parameter calculation, and hence 

the RSPWVD may not be appropriate for feature extraction. 

The MP TFD is shown in Figure 2(c). The TFD was 

constructed using the coherent structures of the signal only, 

and the number of TF atoms was 441. The TFD has clearly 

represented the “grinding” sound with very good localiza- 

t,ion. The TFD obtained is a positive distribution and is free 

of cross-terms, and is suitable for feature extraction. 

C. Classification Experiments 

A database of 90 VAG signals was compiled, including 

51 normal and 39 abnormal signals. Although the MP TFD 

does not satisfy true marginal properties, time-varying pa- 

rameters with discriminant information can be computed as 

marginal values of an MP TFD. The time-varying parame- 

ters that were extracted from the MP, TFD were: 

1. Energy Parameter: the mean of W- (t,w) along each time 

slice, which gives a measure of energy variation with time. 

2. Energy Spread Parameter: the standard deviation of 

W’ (t, U) along each time slice. 

276 

189 

3. Frequency Parameter: the first moment along each time 

slice as given by the expression 

4. Frequency Spread Parameter: the second central moment 

along each time slice as given by 

IMFS(t) = 


cy::[w - IMF(t)12W’(t, w) 

cy=”,” Mi’ (t, U) 

(7) 

The mean and standard deviation of the parameters over 

the duration of each signal were comput,ed, and each VAG 

signal was represented by a set of eight features. Statistical 

pattern classification based on stepwise logistic regression 

analysis [5] of the features of the 90 VAG signals as nor- 

mal/abnormal was achieved with an accuracy of 74.4%. The 

frequency parameter significantly contributed towards accu- 

rate classification of VAG signals. This gives motivation to 

search for linear and nonlinear frequency components in the 

TF plane of VAG signals. 


Segmentation-based analysis of VAG signals has limita- 

tions in correlating the estimated angle range of pathology as 

observed during arthroscopy with the segments of the signal. 

The problem of segmentation can be avoided by using non- 

stationary signal analysis tools such as TFDs. It is difficult 

to interpret VAG TFDs, and even harder t80 train clinicians 

in interpreting TFDs. Therefore, TFDs should be selected so 

as to facilitate objective feature identification and extrahon. 

Analysis of the performances of different TFDs shows that 

the MP TFD is the most suitable TFD for VAG signal anal- 

ysis. Preliminary results with 90 VAG signals suggest that 

the parameters extracted from the MP-based TFD provide 

good discriminant information. Compared with our previ- 

ous methods, the proposed method does not need any joint 

angle and clinical information, and shows good potential for 

noninvasive diagnosis of articular cartilage pathology. 

Acknowledgements: We gratefully acknowledge support from 

the Alberta Heritage Foundation for Medical Research. 

REFERENCES 

R.M. Rangayyan, S. Krishnan, G.D. Bell, C.B. Frank, and K.O. 

Ladly. Parametric representation and screening of knee joint vi- 

broarthrographic signals. IEEE Transactions on. Biomedical Engi- 

neering, 44(11):1068-1074, NOV. 1997. 

F. Auger and P. Flandrin. Improving the readability of time- 

frequency and time-scale representations by the reassignment 

method. IEEE Transactions on Signal Processing, 43(5):1068- 

1089, May 1995. 

S.G. Mallat and Z. Zhang. Matching pursuit with time-frequency 

dictionaries. IEEE Transactions on Signal Processing, 41( 12):3397- 

3415, 1993. 

S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank, and K.O. 

Ladly. Adaptive filtering, modelling, and classificat,ion of knee joint 

vibroarthrographic signals for non-invasive diagnosis of articular 

cartilage pathology. Medical and Biological Engineering and Com- 

puting, 35577-684, Nov. 1997. 

A.P. Afifi and S.P. Azen. Statistical Analysis: il computer Oriented 

approach. Academic Press, Inc., New York, NY., 2nd edition, 1979.

DETECTION OF NONLINEAR FREQUENCY-MODULATED 

COMPONENTS IN THE TIME-FREQUENCY PLANE USING AN 

ARRAY OF ACCUMULATORS 


Dept. of Electrical and Computer Engineering, University of Calgary, 


Email: (krishnan) (ranga)@enel.ucalgary.ca 

Abstract - We propose a novel approach to detect 

nonlinear frequency-modulated (FM) components such 

as sinusoidal and hyperbolic FM components in multi- 

component, nonstationary signals in the time-frequency 

(TF) plane. The approach, based upon the use of an 

array of accumulators, can be used to detect nonlinear 

FM components with varying energy in low signal-to- 

noise ratio environments. 


Instantaneous frequency (IF) is an important pa- 

rameter in characterizing the nonstationary behavior 

of a signal. IF could be frequency modulated (FM) 

as a linear component (e.g. chirp) or as a nonlinear 

component (e.g. quadratic FM) with time. Detection 

of linear and nonlinear FM components in a nonsta- 

tionary signal has been studied extensively by using 

time-frequency (TF) representations [l], [2] and poly- 

nomial phase transforms (PPT) [3]. In PPT, the FM 

component is detected by estimating the phase coeffi- 

cients of the given complex signal. The disadvantages 

with PPT are that it can only be applied to signals 

whose amplitude variations are slower than their phase 

variations, and that reliable estimates of the phase co- 

efficients are not guaranteed under low signal-to-noise 

ratio (SNR) conditions. Barbarossa and Lemoine es- 

timated nonlinear FM parameters by using the reas- 

signed, smoothed, pseudo- Wigner-Ville representmation 

and the Hough transform. Although the method is at- 

tractive, accurate estimation of FM parameters is not 

possible in the presence of cross-terms. 

In our work, the nonlinear frequency parameters of 

a signal are estimated via its TF representation. The 

TF representation is treated as an image, where each 

pixel corresponds to the energy present at a particular 

time and frequency. 

11. TIME-FREQUENCY DISTRIBUTIONS 

The main conditions under which a TF distribution 

(TFD) can be treated as an image are: 

The TFD should be positive. 

0-7803-5073- 1/98/$10.00 01998 IEEE 557 

The TFD should satisfy the marginal properties. 

Cross-terms should be negligible in order to avoid 

false search. 

The widely used Cohen’s class TFDs do not satisfy the 

above requirements as the kernel used is functionally 

independent of the signal. TFDs based on linear combinations 

of the Wigner distributions of TF atoms, as 

given by a decomposition algorithm such as matching 

pursuit [4], are positive distributions and are cross-term 

free; however, they do not satisfy the marginal properties. 

TFDs that are positive and satisfy the marginals 

do exist, and one can obtain an infinite number of them 

for any signal. Such TFDs are nonlinear functions of the 

signal; the kernels for these TFDs are generally signaldependent, 

and are known as Cohen-Posch TFDs [5]. 

Accordingly, while the Cohen-Posch TFDs can, in theory, 

be obtained from Cohen’s general formulation, the 

signal-dependence of the kernel, coupled with its possible 

unbounded nature, calls for alternative formulations 

of practical implementation of the Cohen-Posch 

TFDs. One formulation which is particularly tractable 

and readily demonstrates the positivity and marginal 

conditions is: 

P(t,w) = P(t)P(w) Q(u(t), vb)), (1) 

where P(t) = ls(t)I2 and P(w) = IS(w)I2 are the 

marginal densities with s(t) being the given time- 

domain signals and S(w) the Fourier transform of the 

signal, and Q(u, w) is any positive function of the variables 

( U, w) over 0 5 ( U, w) 5 1, normalized to one: 

s s 

In Eq.1, we have 

Q(U, w)du = Q(U, v)dv = 1. (2) 

U = u(t) = fm P(t’)dt’, 

‘U =.(U) = J:mP(w’)dw’. 

(3) 

It is obvious that the density P(t,w) is positive, and 

straightforward to show that the marginals are satis- 

190 


fied. In addition to being positive and yielding the cor- 

rect marginals, the Cohen-Posch TFDs are also shift- 

invariant and scale-invariant. An algorithm to effi- 

ciently compute the Cohen-Posch TFDs has been pro- 

posed by Loughlin et al. [ti]. The algorithm is based 

on minimum cross-entropy (MCE) optimization of the 

density functions. 

USING AN ARRAY OF 

111. TFD ANALYSIS 

ACCUMULATORS 

The approach of the present work is to apply the 

Hough transform based on an array of accumulators 

to positive TFDs obtained using the MCE method to 

detect energy-varying quadratic FM components such 

as sinusoidal and hyperbolic FM components with un- 

known parameters. Sinusoidal and hyperbolic FM sig- 

nals are common in synthetic aperture radar, multi- 

path communication channels , helicopter recog nit ion, 

and sonar. 

The detection algorithm is based upon the use of 

an array of accumulators; the dimensionality of the ar- 

ray depends upon the number of parameters to be es- 

timated. The TFD is treated as an image with gray 

values corresponding to the normalized energy values 

of the components. 

Let us first consider the procedure for detecting a 

sinusoidal FM component in the TF plane. In practice, 

a sinusoidal FM component may occur at any location 

in the TF plane, and hence a generalized expression of 

a sine wave is considered: fk = A + m sin(2.rr fo k + 4), 

where A is the frequency shift in the TF plane, fk is the 

frequency at time L, fo is the number of cycles of the 

sinusoidal FM, 4 is the phase shift, and m is the am- 

plitude. In practice, a sinusoidal FM component may 

not have a constant amplitude. In order to minimize 

the effect of amplitude variations, an attempt is made 

to make the waveform continuous by using edge-linking 

techniques based upon a gradient method (i.e., by ap- 

plying an image processing algorithm to the TFD). 

The algorithm works as follows: 

1. Each parameter is bounded by a minimum and 

a maximum value. For each point (fk, IC) in the 

TF plane carrying a nonzero value, we let the pa- 

rameters m, fo, and q5 equal each of the allowed 

(quantized or binned) values and solve for the 

corresponding A using the equation A = fk - 

m sin(2.rr fo k + 4). The parameter A is rounded 

to the nearest allowed quantized value. 

2. If the choice of the parameters results in a nonzero 

value for A, we increment the corresponding four- 

dimensional array (initialized to zero) and the en- 

ergy value corresponding to the cell. It is obvious 

that sinusoidal FM components will correspond to 

high-intensity hypersurfaces in the Hough param- 

eter domain. 

3. A threshold is then applied to the total energy 

value and the number of points. This facilitates the 

detection of energy-varying sinusoidal FM compo- 

nents of significant duration. 

4. The mean and standard deviation of the army in- 

dices are computed using those accumula.t.or cells 

whose values have passed t>he threshold t,est,. A 

high value for st.anda.rd deviation of t.he parame- 

ters indicates the presence of multiple sinusoidal 

FM components in the TF plane. 

For the detection of hyperbolic FM, the parametric 

equation fk = A + is considered, where b is a con- 

stant related to the time shift. The two parameters A 

and b can provide a generalized representat,ion of any 

hyperbolic FM phenomenon. The est.imation procedure 

is sinii1a.r to that for sinusoidal FM; however, t,he dimen- 

sion of the array is two (.4 and b) instlead of four. 

IV. RESULTS 

The proposed method was tested on synthetic sig- 

nals containing sine and hyperbolic FM components. 

The TFDs were obtained using the MCE algorithm with 

narrowband and wideband spectrograms as the a pri- 

orz estimate. The TFDs were of high TF resolution and 

free of cross-terms. 

Figure l(a) shows a nonstationary signal wit,h a 

sinusoidal FM component with SNR = 0 dB. The TFD 

obtained using the MCE method is shown in Figure 

l(b). The Hough parameter domain (which cannot be 

displayed due to its dimensionality of four) indicated 

the presence of a sinusoidal FM component, and the 

estimation of the parameters -4, fk, fo, 4, and m of 

the sinusoidal FM was accurate. The threshold for the 

accumulator cells was set to be 3% of the total energy 

value of the signal. 

Figure 2(a) shows a nonstationary signal with a hy- 

perbolic FM component embedded in white noise at an 

SNR of 0 dB. Figure 2(b) shows the TFD of the signal 

computed using the MCE method. The Hough param- 

eter space is shown in Figure 2(c) with the co-ordinates 

corresponding to A and b. The threshold was set to 9% 

of the total energy value of the signal. From the Hough 

parameter space we can infer that the highest intensity 

point corresponding to A = 23 and b = 507 are the 

parameters of the hyperbolic FM. 

A multicomponent nonstationary signal consisting 

of a sinusoidal FM component and a hyperbolic FM 

component corrupted by random noise to an SNR of 

0 dB is shown in Figure 3(a). The MCE-based TFD 

is shown in Figure 3(b). The sinusoidal FM detection 


558 

191

1 

08 

-08C I I 

100 203 300 400 5W 6W 700 800 900 1000 

flme 5amDIe5 

(a) 

I 

100 200 3W 400 500 600 7W 800 900 

I 

1OW 

t,me samples 

(b) 

Fig. 1. (a) Nonstationary signal with a sinusoidal FM component 

at an SNR of 0 dB. (b) TFD of the signal in (a) computed 

using the MCE method. 

method successfully indicated the presence of a sinu- 

soidal FM component, and the hyperbolic FM detec- 

tion method indicated the presence of a hyperbolic FM 

component. The Hough parameter space of hyperbolic 

FM detection is shown in Figure 3(c). 

V. CONCLUSION 

The proposed method successfully detected non- 

linear FM component,s, and the parameters estimated 

were accurate within *lo% of their actual values even 

at a low SNR of -5 dB. The nonlinear FM components 

were not detected at an SNR of -10 dB. Better esti- 

mates of the parameters under low-SNR conditions can 

be achieved by increasing the number of quantization 

levels for each parameter and by denoising the signals 

before computing their TFDs. A difficulty with the pro- 

posed method lies in the selection of a suitable thresh- 

old (lower thresholds have to selected for low SNR). 

J 

8 

::I . t 

l 

Fig. 2. (a) Nonstationary signal with a hyperbolic FM compo- 

nent at an SNR of 0 dB. (b) TFD of the signal in (a) computed 

using the MCE method. (c) Hough parameter space. 

The performance of the method needs to be evaluated 

in comparison with the existing methods under differ- 

ent SNR conditions. The method can be extended to 

detect any pattern (signature) in the TFD provided the 

pattern can be expressed in a parametric form. 

The proposed method may find application in the 

detection of the presence of nonlinear FM components 

in biomedical signals such as knee joint sound signals, 

and facilitate screening of normal and abnormal signals. 

559 

Acknowledgements: We gratefully acknowledge support 

192 


08 

[3] S. Peleg and B. Friedlander. Multicomponent signal analysis 

06 using the polynomial-phase transform. IEEE Transactzons 

on Aerospace and Electronic Systems, 32( 1):378-387, January 

04 

1996. 

02 [4] S.G. Mallat and 2. Zhang. Matching pursuit with timej 

0 

$4, 

frequency dictionaries. IEEE Transactzons on Szgnal Proceeszng, 

41(12):3397-3415, 1993. 

[5] L. Cohen and T. Posch. Positive time-frequency distribution 

fuctions. IEEE Transactzons on Acoustzcs, Speech, and Szgnal 

-0 4 

-0 e 

Processzng, 33:31-38, 1985. 

[6] P. Loughlin, J. Pitton, and L. Atlas. Construction of positive 

4* time-frequency distributions. IEEE Transactzons on Szgnal 

Processzng, 42:2697-2705, 1994. 

100 200 300 400 so0 500 700 800 900 ,000 

r,me samples 

(a) 

(C) 

Fig. 3. (a) Nonstationary signal with a sinusoidal FM component 

and a hyperbolicFM component at an SNR of 0 dB. (b) TFD 

of the signal in (a) computed using the MCE method. (c) 

Hough parameter space of hyperbolic FM detection. 

from the Natural Sciences and Engineering Research 

Council of Canada (NSERC) and the Alberta Heritage 

Foundation for Medical Research (AHFMR). 

REFERENCES 

[I] S. Barbarossa. Analysis of multicomponent LFM signals by 

combined Wigner-Hough transform. IEEE Transactions on 

Signal Processing, 43(6):1511-1515, June 1995. 

[2] S. Barbarossa and 0. Lemoine. Analysis of nonlinear FM sig- 

nals by pattern recognition of their time-frequency represen- 

tation. IEEE Signal Processing Letters, 3(4):112-115, April 

1996. 

193 


560

Proceedings - 19th international Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA 

TIME-FREQUENCY SIGNAL FEATURE EXTRACTION AND 

SCREENING OF KNEE JOINT VIBROARTHROGRAPHIC 

SIGNALS USING THE MATCHING PURSUIT METHOD 

Sridhar Krishnan' , Rangaraj M. Rangayyan1v2, G. Douglas Bel11*213, Cyril B. 

'Dept. of Electrical and Computer Engineering, 2Dept. of Surgery, 3Sport Medicine Centre 

The University of Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca) 

Abstmct-Nonstationary features of knee joint vi- 

broarthrographic (VAG) signals were extracted from 

their time-frequency distributions (TFDs) obtained us- 

ing the matching pursuit method. Features computed 

as marginal calculations of the TFDs were instantaneous 

energy, instantaneous energy spread, instantaneous mean 

frequency, and instantaneous mean frequency spread. 

The features carry information about the combined time- 

frequency dynamics of the signals. The mean and stan- 

dard deviation of the features were also computed, and 

each VAG signal was represented by a set of just eight 

parameters. The method was tested on 37 VAG signals 

(19 normal and 18 abnormal) with no restriction on the 

type of articular cartilage pathology. Discriminant analy- 

sis of the parameters showed an accuracy of 89.5% at the 

training stage and 77.8% at the test stage. Compared 

to our previous methods, the proposed method does not 

need any joint angle and clinical information, and shows 

good potential for noninvasive diagnosis and monitoring 

of articular cartilage pathology. 

Keywords: Vibroarthrography, Knee sounds, Time- 

frequency analysis, Articular cartilage, Matching pursuit. 


Knee joint vibration or sound signals, also known as 

vibroarthrographic (VAG) signals, emitted during active 

movement of the leg are expected to be associated with 

pathological conditions of the articular cartilage. VAG 

signal analysis could lead to a clinical tool for diagno- 

sis and monitoring of true articular cartilage pathology 

such as chondromalacia of the patella. A variety of VAG 

signal analysis techniques have been proposed in the lit- 

erature [I], [2], [3], [4], [5]. All of the previous methods 

used standard signal processing techniques based on the 

Fourier transform or autoregressive modeling, by assum- 

ing the signal to be either stationary or by segmenting 

the signal into quasi-stationary parts. 

In the present work, the nonstationarity of VAG sig- 

nals is taken into consideration, which arises due to the 

fact that different joint surfaces come in contact during 

movement, and the nature and quality of the joint sur- 

faces coming in contact may not be same from one posi- 

tion to the next. Hence both intra- and inter-subject vari- 

(O-7803-4262-3/97/$10.00 (C) 1997 IEEE) 

1309 

ability of signal characteristics are expected. Although 

our previous approaches [4], 151 addressed nonstationar- 

ity to some extent by using robust adaptive segmentation 

algorithms, there was a difficulty in labeling individual 

segments as normal or abnormal. This is because an 

accurate estimation of the joint angle corresponding to 

pathology as observed during arthroscopy could not be 

achieved. The problem could be completely avoided by 

using nonstationary signal analysis tools such as time- 

frequency (TF) and wavelet transforms. The objective 

of our current work is to extract and identify relevant 

features in the TF plane which could discriminate abnor- 

mal knees from normal knees based solely on VAG signal 

features. 

A. Data Acquisition 

11. METHODS 

Each subject sat on a rigid table in a relaxed position 

with his/her leg freely suspended in air. The VAG signal 

was recorded at the mid-patella position of the knee as 

the subject swung his/her leg over an approximate an- 

gle range of 135' + 0' -+ 135' in 4s. The signal was 

prefiltered and amplified before digitizing at a sampling 

rate of 2kHz. A database of 37 signals was prepared, in- 

cluding 18 signals of symptomatic patients scheduled to 

undergo arthroscopy. There was no restriction imposed 

on the type of pathology, and the abnormal signals in- 

cluded chondromalacia of different grades at the patella, 

meniscal tear, tibial chondromalacia, and anterior cruci- 

ate ligament injuries. 

B. Time-Fwquency Analysis 

Features of VAG signals were extracted from their 

time-frequency distributions (TFDs) obtained using the 

matching pursuit (MP) method [SI. In MP analysis, 

the given signal is decomposed into a linear expansion 

of waveforms, known as TF atoms, selected from a re- 

dundant dictionary of functions. The TF atoms in the 

dictionary are generated by scaling, translating, and fre- 

quency modulating a normalized window function g7(t). 

194 


Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA 

The MP method represents a signal z(t) as: 

where 

and a, are the expansion coefficients. The scale factor s, 

is used to control the width of the envelope of grn(t), and 

the parameter p, controls the temporal placement. -& 

is a normalizing factor, which keeps the norm of gyn(t) 

equal to 1. The parameters fn and 4, are the frequency 

and phase of the exponential function, respectively. In 

our application, the envelope function is a Gaussian func- 

tion, i.e., g(t) = 2'14 exp(--Kt2); the TF atoms are then 

called Gabor functions. 

In practice, the algorithm works as follows: First, 

the signal is projected on to a Gabor function dictionary. 

The projection decomposes the signal into two parts: 

4t) = (",gro)g-Yo(t) + R14% (3) 

where (2, gyo) denotes the inner product (projection) of 

z(t) with the first TF atom gro(t). The second term 

R'z(t) is the residual vector after approximating z(t) 

in the direction of gyo(t). This process is continued by 

projecting the residue on to the dictionary, and after M 

iterations 

M-1 

42) = (R"z,g,,)g,,(t) + R'4% (4) 

n=O 

with Roz(t) = ~ (2). There are two ways of stopping the 

iterative process: one is to use a pre-specified limiting 

number A4 of TF atoms, and the other is to verify the 

energy of the residue RMz(t). A very high value of M 

and a zero value for the residual energy will decompose 

the signal completely at the expense of increased compu- 

tational complexity. 

In this work, M was chosen to be 1000 atoms, the 

residual energy limit was set to be zero, and only coherent 

structures were extracted (i.e., components determined to 

be noise by the MP algorithm were rejected). 

Now, the Wigner distribution of t(t) based on the 

TF atoms is given as [6]: 

M-1 

W(t,w) = I(Rnz,g,n)12 Wgyn(t,u)+ 

n=O 

- 

i 

P o 

-20 

-40 

Time samples (1 to 8 I92 samples) 

@) 

Fig. 1. (a) A normal VAG signal. (b) TFD of (a) obtained using 

the matching pursuit method. au: Acceleration units. 

Time samples ( I to R 192 samples) 

(b) 

M-1 M-1 

Fig. 2. (a) The VAG signal of a pathological knee with chon- 

(R"z,grn) (Rmwrm)* ~bYn,gYml(t>4, dromalacia of the patella. (b) TFD of (a) obtained using the 

n=O m=o matching pursuit method. au: Acceleration units. 

m#n 

1310 

195 



where Wgr,(t,w) is the Wigner transform of the Gaus- 

sian window function. The double sum corresponds to 

the cross-terms of the Wigner distribution, and should 

be removed in order to obtain an interference-free energy 

distribution of z(t) in the TF plane. Thus only the first 

term is retained, and the interference-free TFD is given 

by 

M-1 

W’(t,4 = I(Rn~lSrn)IZwgrn(tlw). (5) 

n=O 

Figs.l(b) and 2(b) show the TFDs of the normal 

VAG signal in fig.l(a) and the abnormal VAG signal in 

fig.2(a). The TFD of the normal signal was computed 

using 392 TF atoms, and the abnormal signal’s TFD 

was computed using 441 TF atoms. The TFDs obtained 

are of very high resolution, and with synthetic signals of 

known TF dynamics, the TFDs based on the MP algo- 

rithm showed very good localization in both time and 

frequency. The bright spots in the TFD figures corre- 

spond to the TF atoms; the brightness increases with 

energy. In the illustration, the TFD of the abnormal sig- 

nal shows more high-frequency activity than the TFD of 

the normal signal. However, this need not be always true 

(especially in the case of normal noisy knees), and mere 

visual interpretation of the TFDs will not help in discrim- 

inating pathological knees from normal knees. In order 

to differentiate the signals, features of diagnostic value 

need to be extracted from the TF plane. 

C. Time-Frequency Signal Features 

As the TFD obtained using the TF atoms is an 

interference-free distribution and is always positive, fea- 

tures derived from the TF plane will posses a high de- 

gree of accuracy as compared to features obtained with 

Cohen’s class TF transforms. The features used in the 

present work were computed as marginal calculations of 

the TFDs. The four TF features of relevance derived 

from the TFDs of VAG signals were: 

Instantaneous energy (IE): As the TFDs were ob- 

tained using TF atoms that were coherent with the 

signal structure and the signal components that were 

determined to be noise by the MP algorithm were re- 

jected, the IE obtained as afunction of time will have 

a high signal-to-@se ratio. The IE was computed 

as the mean of W (t,w) along each time slice, which 

gives a measure of energy variation with time. Sig- 

nals generated by pathological knees will be highly 

time-variant (i.e., they are highly nonstationary) be- 

cause of the differences in cartilage roughness and 

nonuniformity. Thus the IE of an abnormal signal is 

expected to show large variations with time. 

Instantaneous energy spread (IES): IES measures 

the spread of energy over frequency for each time 

slice. This was computed as the standard deviation 

131 1 

of W’(t,w) along each time slice. This is a good 

measure if the signal is multicomponent in nature. 

Abnormal VAG signals generated as a result of fric- 

tion between rough cartilage surfaces may have more 

components because of the nonuniformity of the sur- 

faces, and a high signal energy spread is expected 

around the IE. 

Instantaneous mean frequency (IMF): IMF was com- 

puted as the first moment along each time slice: 

IMF measures the frequency dynamics of the sig- 

nal. The movement of the knee during signal acqui- 

sition may cause some linear or nonlinear frequency 

modulation of the signal, with the modulation in- 

dex depending on the state of lubrication, stiffness, 

and roughness of the cartilage surfaces. Pathologi- 

cal knees have less lubricated and rougher cartilage 

surfaces than normal knees, and hence the IMF of 

pathological knees will be different from that of nor- 

mal knees. 

Instantaneous mean frequency spread (IMFS): IMFS 

is given by the second central moment along each 

time slice: 

IMFS gives the spread of frequency about the mean 

frequency for each time instant. The spread of fre- 

quency at a time instant arises as a result of am- 

plitude modulation. Amplitude modulation is pos- 

sible in VAG signals, and may be dependent on the 

quality and intensity of sound produced due to joint 

vibration. IMFS could be an excellent feature in 

identifying noisy knees, 

The four featires derived above are dependent on the 

functional state of the cartilage surfaces in the knee joint, 

and are expected to be suitable for discriminating patho- 

logical knees from normal knges. 

D. Pattern Classification 

The features discussed in previous section are time 

dependent. This could be easily observed in the wave- 

forms shown in fig.3, which were derived from the TFD 

shown in fig.2(b). In order to facilitate a global deci- 

sion on the signal, the mean and variance of the features 

were calculated. Therefore, a given VAG signal will have 

eight parameters. The classification of knees as normal or 

pathological was achieved using a statistical pattern clas- 

sifier based on discriminant analysis of the parameters [7]. 

The database was randomly split into two (almost) equal 

parts. The features of signals in one part of the database 

196 



Fig. 3. Features of the abnormal signal in fig.?. (a) IE waveform. 

(b) IES waveform. (c) IMF waveform. (d) IMFS waveform. 

were used to train the classifier. The classifier was then 

tested on the second part of the database. The signals 

used in the test stage were different from those in the 

training stage. The classification accuracy is given as the 

percentage of the number of correctly classified signals to 

the number of signals in the group. 


The classifier was trained with the TF parameters of 

19 signals (9 normal and 10 abnormal), and was tested 

with 18 signals (10 normal and 8 abnormal). The classifi- 

cation accuracy in training was 89.5%, and the test stage 

accuracy was 77.8%. 

Among the features used, IE and IES contributed 

significantly to the classification accuracy. This shows 

that abnormal VAG signals possess highly time-varying 

energy as compared to normal VAG signals. The perfor- 

mance of IES leads to the conclusion that VAG signals 

of pathological knees are more multicomponent in na- 

1312 

197 

ture than those of normal knees. This may be due to 

the possibility that the roughness of cartilage surfaces in 

pathological knees give rise to multiple sources of vibra- 

tion signals. Though IMF did not show high discrimina- 

tion, it can be used to detect linear frequency-modulated 

components or chirp signals (if any) in VAG signals. We 

are currently working on detection of chirp components 

in TFDs from an image processing perspective. While 

IMFS may not be a very good feature for classification of 

knees as normal or abnormal, it could be used to study 

the sound patterns of knees, and to investigate whether 

the sound patterns of pathological knees differ from those 

of normal knees. 

The use of the variance of the features could allow 

this method to be applied on a larger set of signals, and 

could avoid any bias as a result of variations in transducer 

placement, amplifier-filter settings, etc. 

1V. CONCLUSIONS 

A novel approach to analyze nonstationary VAG sig- 

nals was used. The method does not require any joint an- 

gle information to label the components of a VAG signal, 

and is independent of patient information such as age, 

activity level, and gender. Signal features of diagnostic 

relevance were extracted from their TFDs obtained using 

the MP method. The TF features extracted have shown 

good potential of this method for screening of VAG sig- 

nals. 

Acknowledgements: We gratefully acknowledge support 

from the Alberta Heritage Foundation for Medical Re- 

search and the Arthritis Society of Canada. 

REFERENCES 

M.L. Chu, LA. Gradisar, and R. Mostardi. A noninvasive electroacoustical 

evalution technique of cartilage damage in pathological 

knee joints. Medical and Biological Engineering and 

Computing, 16:437-442,1978. 

W.G. Kernohan, D.E. Beverland, G.F. McCoy, A. Hamilton, 

P. Watson, and R.A.B. Mollan. Vibration arthrometry. Acta 

Orthop. Scand., 61(1):70-79,1990. 

N.P. Reddy, B.M. Rothschild, M. Mandal, V. Gupta, and 

S. Suryanarayanan. Noninvasive acceleration measurements 

to characterize knee arthritis-and chondromalacia. Annals of 

Biomedical Engineering, 23:78-84, 1995. 

Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. Frank, 

K.O. Ladly, and Y.T. Zhang. Screening of vibroarthrographic 

signals via adaptive segmentation and linear prediction modeling. 

IEEE Transactions on Biomedical Engineering, 43:15-23, 

1996. 

S. Krishnan,R.M. Rangayyan, G.D. Bell, C.B. Frank, andK.0. 

Ladly. Screening of knee joint vibroarthrographic signals by 

statistical analysis of dominant poles. In CDROM proceedings, 

18th Annual International Conference of The IEEE Engineering 

in Medicine and Biologg Society, Amsterdam, The Netherlands, 

October 1996. 

S.G. Mallat and Z. Zhang. Matching pursuit with timefrequency 

dictionaries. IEEE Trans. on Signal Processing, 

41(12):3397-3415,1993. 


SPSS Inc., Chicago, IL. SPSS Advanced Statistics User's 

Guide, 1990.

138 

Detection of Chirp and Other Components in the Time-Frequency Plane 

using the Hough and Radon Transforms 


Dept. of Electrical and Computer Engineering, The University of Calgary, 


Email: (krishnan) (ranga)@enel.ucalgary.ca 

Abstract ~ We propose a novel approach to detect chirp 

(linear frequency modulated) components in multicomponent 

nonstationary signals in the time-frequency (TF) plane. 

The approach, based on the Hough and Radon transforms 

of TF distributions, can be used to detect chirp components 

with varying energy in unknown signal-to-noise ratio environments. 

In addition to detection of chirps, the proposed 

technique could also be used as a tool to evaluate the TF 

resolution provided by different TF analysis methods. 


Time-frequency distributions (TFD) give the energy dis- 

tribution of a signal in the time-frequency (TF) plane, and 

are suitable for analyzing nonstationary signals. In particu- 

lar, a TFD gives information about the time, frequency, and 

combined TF dynamics of a signal. Stationary, Dirac, and 

chirp (linear frequency modulated or FM) components of a 

signal appear as directional components in the TF plane. 

The directional components may be narrow or broad in the 

TF plane depending upon the resolution of the TF transfor- 

mation used and the energy spread of the component. If the 

signal energy is oriented only horizontally in the TF plane 

(i.e., a stationary component) or only vertically (i.e., a Dirac 

component), then signal detection is easy, and optimal de- 

tection can be achieved by computing the marginal densities 

of the TFD. However, in practice, chirp components may 

occur with arbitrary TF orientations. 

Detection of chirp components helps in understanding 

the underlying TF dynamics of a signal. Many methods of 

chirp detection have been proposed in the literature; typ- 

ical applications of chirp detection are found in synthetic 

aperture radar, communication over time-varying multipath 

channels, and seismology. A method for optimal detection 

of chirp components based on a maximum likelihood ap- 

proach was proposed by Kay and Boudreaux-Bartels [l]. 

This method of chirp detection is equivalent to the Radon 

transform (RT) of the TFD obtained using the Wigner dis- 

tribution. The Radon-Wigner method of chirp detection is 

computationally expensive, and an efficient implementation 

based on a dechirping method was proposed by Wood and 

Barry [a]. The Hough transform (HT) could be used instead 

of the RT to detect arbitrary shapes in TF planes which 

are not necessarily straight lines (chirps). A Wigner-Hough 

method to detect chirp and nonlinear FM components was 

0-7803-39O5-3/97/$1O.OO01997 IEEE 

198 

Nonstationary 

Sipal --+ 


Time-frequency (TF) 

atoms 

- - Matching pwsuit Wiper distribution HoughXadon (HR) 

algorithm ofTF atoms (TFD) transform 

Deteciion of 

chirps 

c- lhieshold 

Fig. 1. Block diagram of the proposed method. 

proposed by Barbarossa and Lemoine [3], [4]. 

The motivation of this work is to apply a combined 

Hough-Radon (HR) transform (HRT) on TFDs obtained us- 

ing the matching pursuit (MP) method to detect energy- 

varying chirps with unknown parameters. The block dia- 

gram of the proposed method is shown in Fig.1. The use of 

the MP technique facilitates application of this method in 

environments with unknown signal-to-noise ratio (SNR) . 

11. THE HOUGH-RADON TRANSFORM 

The TFD of a multicomponent nonstationary signal can 

be obtained using the MP method proposed by Mallat and 

Zhang [5]. In MP, the given signal is decomposed into a linear 

expansion of waveforms, known as TF atoms, selected 

from a large dictionary of Gabor functions. The TF atoms 

corresponding to only the coherent structures of the signal 

can be extracted, and the SNR of the signal with unknown 

noise power can be increased. The TFD obtained as a summation 

of the Wigner transforms of TF atoms is of high TF 

resolution and free of interference. 

To detect chirp components (straight lines at arbitrary 

orientations in the TF plane), the HT may be used. The HT 

is a common technique to detect lines and curves that satisfy 

a parametric constraint [6]. 

The HT is most commonly used as follows: Consider a 

point (zi, yi) in the image plane (please note that the image 

plane here denotes the TF plane). The general equation of 

a straight line in the slope-intercept form is yi = mzi + b, 

where m is the slope, 6 is the intercept with the ?J axis, and

the x and y axes correspond to the t and w axes respectively. 

There are an infinite number of lines that pass through a 

point (xi, yi) and still satisfy the equation yi = mxi + b, for 

varying values of the parameters m and b. Parameterizing 

the TF plane into the (m, b) parameter space poses a problem 

because of the unbounded nature of m and b. One way to 

avoid this problem is to use the normal representation of a 

line given by 

zcosB+ysinB=p. (1) 

The parameter space (pie), also known as the Hough do- 

main, is now bounded in 0 to the interval [O,.ir] and in p 

by the Euclidean distance to the farthest point in the image 

from the centre of the image. 

From Eqn.1, for a specific point in the TF plane (ti, wi), 

we obtain a sinusoidal curve in the Hough domain. All of 

the sinusoids resulting from a mapping of a line in the TF 

plane have a common point of intersection in the Hough 

domain. Thus, chirps in the TF plane will correspond to 

high-intensity points in the Hough domain. 

A. Hough-Radon Algorithm for Chirp Detection 

139 

way, chirp components appear as high-intensity points in the 

HR domain, and the brightness increases with the energy of 

the chirp. 

The HRT is a powerful way of determining directional 

elements (such as chirps) in gray-level images (such as TFD), 

but lacks by itself the capability to eliminate components 

that do not contribute coherently to a particular directional 

pattern. The high-intensity components (features of int,er- 

est) in the HR domain can be extracted by using gradient 

operators. A gradient operator that may be used in the 

Hough domain is the simple 3x3 mask shown below [7]: 

0 -2 0 

1 2 1 . 

0 -2 0 

A drawback with this approach is that the filter was designed 

for detecting lines of one pixel width, and cannot be used to 

detect broad directional components. As chirps are normally 

broad components because of the inherent tradeoff between 

time and frequency resolution of TF transforms, the above 

filter may often fail. 

The problem discussed above can be overcome by not 

using a convolution mask, but instead using a method to 

detect the peaks corresponding to broad components by ap- 

plying a suitable threshold in the HR domain. Values in cells 

greater than the threshold will be retained, and values lower 

than the threshold will be set to zero. The threshold is se- 

lected based on local statistics in the HR domain. First, the 

histogram (probability density function) of the HR image is 

The computational attractiveness of the HT arises from 

subdivision of the Hough domain into accumulator cells. 

The cell at coordinates (i,j), with accumulator value 

A(i, j), corresponds to the square associated with the 

parameter coordinates (0i , pj). Initially, the cells are 

set to zero. 

For every point (tk,wk) in the TF plane, we let the parameter 

0 equal each of the allowed subdivision values 

on the 0 axis and solve for the corresponding p using 

Eqn.1. The resulting p's are then rounded off to the 

nearest allowed value on the p axis. If a particular 0i 

value results in the solution pj , the ccrresponding accumulator 

A(i, j) is incremented. 

9 h(g) 

calculated, and the mean is computed as M = T*c , 

where h(g) is the frequency of occurrence of the gth gray 

level of the HR image with T rows and c columns. Then, the 

threshold is computed as 

At the end of the procedure, a value of M in A(i, j) corresponds 

to M points in the TF plane lying on the line 

threshold = signal factor * M. (3) 

t cos 0i f w sin 0i = pj. It is evident that more subdivi- The signal factor is dependent on the type of the signal being 

sions in the Hough domain will lead to a more accurate analyzed. 

estimate of collinear points, but at the expense of additional 

computational complexity. In this work, the full 

B. Mathematzcal Proof 

ranges of 0 and p were used. 

It can be mathematically shown that the HRT (or more 

The main drawback of the HT is that it is usually performed 

on binary images, and hence may not be appropriate 

generally, the RT) of a TFD provides maximum likelihood 

(ML) detection of a chirp signal. Wood and Barry [a] have 

for gray-level images and TFDs. As energy values of chirps stated that the RT of the general Wigner TFD is equivavary 

in the TF plane, they occupy different gray levels, with lent to the ML estimate of a chirp. In this paper, the above 

255 corresponding to the highest scaled energy component. statement is mathematically proved, and the results can be 

It is not appropriate to binarize the TF image, since the HT directly extended to the interference-free TFD obtained uswill 

then not be able to detect energy-varying chirp components. 

This drawback can be avoided by using the combined 

HRT. 

With the HRT, the algorithm is exactly similar to the 

ing MP. The ML detection of a linear chirp is given by [l]: 

CO 

L = ma&,,,, 1, W(t, WO + mt) dt > 17, (4) 

one discussed earlier, except that instead of counting the 

number of collinear points in a cell, the gray values of 

collinear points are added to each cell and then multiplied 

with the total number of collinear points in that cell. In this 

where 

7- 7- 

W(t,w) = - x*(t + -)x(t 2 - -) 2 

2n Jm 

exp-jwr dr (5) 

199 


140 

x 

0 

% 

E 

Lr, 

Time 

Fig. 2. Graphical illustration of the HRT of a TFD. 

is the Wigner distribution of signal x(t)] and m is the slope 

of the path of integration in the TF plane as shown in Fig.2. 

Geometrically, the line integration in Eqn.4 is similar to the 

RT of W(t,w) given bv 

00 

~j~(t,u)) = S__~i-jrcos~- ssinB,rsinB+scosB)ds, 

where R is the Rador, aperator. Using Eqn.5 in Eqn.6, 

R (W(t, w)) J-”, J-”, z*(rcosB - ssin B + r/2) 

exp(-j(rsin0 + scosB)r) drds. 

= 

x(r cos B - s sin B - r/2) (7) 

From Fig.2 m = -A and WO = 5. For MI, estimation 

of a chirp, w = WO + mt from Eqn.4. Therefore 

wo+mt = __- s &(T sin R cos B + s sine) 

- r-rcosZO+ssin RcosO 

sin 0 

= rsinB+scosB. 

Now, transforming Eqn.7 to Cartesian co-ordinates and us- 

ing Eqn.8, we get 

(W(t, 

m a ? 

U)) = J-, J-, z* (t + ./a) x(t - 7-P) 

(6) 

(8) 

exp(-j(wo + mt)r) drdt (9) 

= J-”, W(t,wo + mt)dt. 

This proves that the RT or the HRT of a general Wigner 

TFD (or the TFD obtained by MP) is equivalent to ML 

detection of chirps. 

C. Analysts of TF Resolution 

The proposed technique can also be used to evaluate 

the TF resolution provided by different TFDs. A test signal 

composed of a sine and a Dirac component is passed through 

the TFD generator (e.g., Choi-Williams, spectrogram), and 

the HRT of the output is computed. The TF resolution of 

200 

(a) E 0.6- 

- 

.$0.4 

0 z0.2- 

I / j 1 


I 

Fig. 3. Testing the thresholdedHR method with an image. (a) Original 

image, (b) HR image, (c) After applying the gradient mask operator 

to (b), and (d) After thresholding (b). 

angle in degrees 

Fig. 4. Directional concentration plots provided by two methods. (a) 

Gradient mask method, (b) Threshold method. 

the TFD can be evaluated by observing the peaks at 0 and 90 

degrees in the HR domain. A TFD with good time resolution 

will have a narrow component at 90 degrees, whereas a TFD 

with good frequency resolution will have a narrow component 

at 0 degree. At present there is no technique available to 

check the TF resolution provided by different TFDs. The 

proposed method should be a good tool to evaluate the TF 

resolution performance of different TF methods. 

111. RESULTS 

Experiment 1: The proposed method was tested on the non- 

TF image shown in Fig.S(a). The image has broad direc- 

tional features, comparable to what is expected with chirp 

signals in the TF plane. The HR image is shown in Fig.3(b), 

from which it is evident that the broad directional compo- 

nents in the test image correspond to bright, broad com- 

ponents in the HR domain. The HR image also has other 

less intense components that do not relate to the directional

Fig. 5. Results with synthetic signal 1. (a) Synthetic signal (z axistime 

samples, y axis- amplitude), (b) TFD of (a) (.: axistime, y 

axis- frequency), (c) HR image (after thresholding), 2 axis- 0 (0 to 

T), y axis- p. 

features. The less-intense components are reduced to some 

extent by using the 3x3 mask. The mask did not perform 

well in extracting the broad components. By thresholding 

the parameter space using the threshold as in Eqn.3, the 

broad components were extracted, as illustrated in Fig.3(d). 

Figs.4(a) a.nd 4(b) show the directional concentration of the 

HR distributions in Figs.S(c) and 3(d). From Fig.4b it is evi- 

dent that the directional components are well resolved by the 

threshold method as compared to the gradient mask method. 

Experiment 2: The proposed method of chirp detection was 

tested on two synthetic signals of known time and frequency 

dynamics. The synthetic signals were computed using a si- 

nusoid (frequency dynamics), and two chirps (TF dynamics) 

with some random noise. Both the synthetic signals had the 

above components, but with different time durations. Each 

signal was decomposed into TF atoms (Gabor functions) by 

using the MP algorithm, with the maximum number of TF 

atoms allowed being 200. The TFDs of the signals were com- 

puted by adding the Wigner distributions of the TF atoms. 

The TFDs of the two signals are shown in Figs.5 and 6. It 

is interesting to note that frequency dynamics (sine com- 

ponent) and TF dynamics (chirp components) are treated 

equally well by this representation. The TFD obtained us- 

ing MP is interference-free, and gives the best possible TF 

resolution among all the TFD methods available. 

The HRT was applied to the TFDs in Figs.5(b) and 

6(b). The two chirp components appear as bright spots in 

the HR image at angles of about 60' and 120'. The TFD 

of Fig.5(b) has broader components than that in Fig.G(b); 

this is because of the lower TF resolution of shorter-duration 

signals. This difference can be seen in Figs.5(c) and 6(c). 

(C) 

201 


(e) 

141 

Fig. 6. Results with synthetic signal 2. (a) Synthetic signal, (b) TFD 

of (a), (c) HR image (after thresholding). 


A novel approach to detect chirps in unknown SNR 

environments has been proposed in this paper. The com- 

bined Hough and Radon transform of a TFD obtained using 

the MP method provides maximum likelihood detection of 

chirps. The problem of identifying directional components 

and their dynamics in the TF plane simplifies to locating 

high-intensity spots in the HR plane. The method could be 

extended to detect arbitrary patterns in the TF plane by 

using the generalized HT. 

Acknowledgements: We gratefully acknowledge support from 

the Alberta Heritage Foundation for Medical Research and 

the Natural Sciences and Engineering Research Council of 

Canada. 

REFERENCES 

[l] S. Kay and G.F. Boudreax-Bartels. On the optimalityof the Wigner 

distribution for detection. IEEE ICASSP, pages 2331-2334, 1986. 

[2] J.C. Wood and D.T. Barry. Radon transformation of time- 

frequency distributions for analysis of multicomponent sig- 

nals. IEEE Transactions on Signal Processing, 42( 11):3166-3177, 

November 1994. 

[3] S. Barbarossa. Analysis of multicomponent LFM signals by com- 

bined Wigner-Hough transform. IEEE Transactions on Signal Pro- 

cessing, 43(6):1511-1515, June 1995. 

[4] S. Barbarossa and 0. Lemoine. Analysis of nonlinear FM signals 

by pattern recognition of their time-frequency representation. IEEE 

Signal Processing Letters, 3(4):112-115, April 1996. 

[5] S.G. Mallat and 2. Zhang. Matching pursuit with time-frequency 

dictionaries. IEEE Transactions on Signal Proceesing, 41( 12):3397- 

3415, 1993. 

[6] R.C. Gonzalez and P. Wintz. Digital Image Processing. Addison- 

Wesley, Inc., Reading, MA, second edition, 1987. 

[7] W.A. Rolston. Directional image analysis. Master's thesis, Dept. 

of Elect. and Comp. Engg., The Univ. of Calgary, Calgary, AB, 

Canada, 1994.

22 

Spatial-Temporal Decorrelating Decision-Feedback Multiuser Detector 

for Synchronous Code-Division Multiple- Access Channels 

Sridhar Krishnan and Brent R. Petersen 

Dept. of Electrical and Computer Engineering,'l'he University of Calgary, 


Email: (krishnan)(bp)@enel.ucalgary.ca 

Abstract - In this paper, a new multiuser detector 

for synchronous code-division multiple-access channels is 

developed, and the performance is compared with other 

multiuser detectors. The proposed multiuser detector is 

based on spatial-temporal filtering and decision-feedback 

techniques, hence the name spatial-temporal decorrelat- 

ing decision-feedback (STDF) detector. An optimum 

STDF det.ector is expected to have an exponential com- 

plexity as he IIUIII~IX of users grow. A suboptimum 

STDF detector shows a better performance in terms of 

probability of error (or SNR) and asymptotic efficiency 

as compared to the other suboptimum detectors. Simu- 

lation results under diverse channel conditions show that 

STDF is a bandwidth efficient technique, which is an es- 

sential requirement for modern wireless communications. 

Also the results indicate that STDF performance gains 

are more significant for relatively weak users. 


Multiuser communications has been an active area 

of research and has numerous applications, especially in 

the area of wireless communications. There are several 

different ways in which multiple users can communicate 

through the channel to the receiver. One method is to 

allow more than one user to share a channel or a sub- 

channel by use of a unique code sequence or signature 

waveforms that allows the user to spread the informa- 

tion signal across the assigned bandwidth. This multiple- 

access method is called the code-division multiple-access 

(CDMA). The objective of this work is to develop an effi- 

cient multiuser detector based on spatial filtering (beam- 

formers) and decision feedback for synchronous CDMA 

channels. 

In a CDMA system, the receiver may have an idea 

about the assigned signature waveforms and observes the 

sum of the transmitted signals in additive white Gaussian 

noise. The conventional single-user detector consists of a 

bank of single-user ma,tched filters followed by thresh- 

old detectors. If the assigned signature waveforms are 

orthogonal and if the powers of the users are not very 

different then the conventional detector would achieve 

optimum demodulation [l]. It has been shown that the 

optimum maximum likelihood receiver employing a gen- 

eralization of the Viterbi algorithm significantly outper- 

0-7803-3905-3/97/$ 10.OOO 1997 IEEE 

202 

forms the conventional single-user detector at the expense 

of a high computational complexity [a]. Since these conditions 

are often difficult to satisfy in practice, several 

suboptimum detectors have been proposed in literature 

PI, [3i1 ~41, 151. 

A. The Lanear Decorrelatang Detector 

The linear decorrelating detector also known as the 

decorrelator can significantly outperform the conven- 

tional single user detector [l]. This is because the decor- 

relator takes into account the outputs of all the matched 

filters in making a decision as against a single matched 

filter output used in conventional type. As the outputs of 

all the matched filters are considered in making decisions, 

the multiuser interference among users can be easily ex- 

ploited. The signal at the output of the matched filter is 

given as 

y = RWb + n, 


(1) 

where R is the crosscorrelation matrix of signature waveforms, 

W is a diagonal matrix with Wk,k = &, k = 

T 

1,. . . , I

the users and is explained in next section. Also, a decor- 

relating decision-feedback (DF) detector for synchronous 

and asynchronous CDMA channels have been proposed 

in the literature [3], [4]. Performance gains of DF detec- 

tors are more significant than decorrelator, especially for 

relatively weak users [3], [4]. 

B. Spatial- Temporal Decorrelating (STD) Detector 

Integrated spatial-temporal processing of the re- 

ceived signal has been shown to provide significant per- 

formance improvement over the decorrelator [5]. In most 

cases the multiple users are distributed in space in such 

a way that they are intercepted at the detector from var- 

ious directions. By exploiting the signals' spatial dis- 

tribution (direction matrix) and the temporal properties 

(crosscorrelation matrix R) followed by linear decorrela- 

tion (decorrelator) a uniformly superior performance has 

been achieved. The signal at the output of the matched 

filter is given as 

y = MWb + n. (3) 

The above equation is similar to Eqn.1 except the 

matrix filter M. Where M here is the spatial-temporal 

crosscorrelation matrix. The next stage of processing is 

similar to the decorrelator and the output of the matrix 

filter in this case is 

y = M-'y = Wb + 6, (4) 

where n is a Gaussian noise vector with the autocorrela- 

tion matrix M(n) = (r2M-l. Comparative results with 

other suboptimum detectors have shown superior perfor- 

mance gains of STD detectors [5]. The superior perfor- 

mance of STD detectors has been the motivation of this 

work, in which spatial filtering is combined with decision- 

feedback (STDF). The STDF detector is derived in next 

section and its performance is compared with the other 

suboptimum detectors. 

11. SPATIAL-TEMPORAL DECORRELATING 

DECISION-FEEDBACK (STDF) DETECTOR 

It has already been shown that for STD the complex- 

ity of the detector grows exponentially as the number of 

users increase [5]. Hence, the same complexity can be ex- 

pected for optimum STDF detectors. Here only a subop- 

timum STDF is considered. The suboptimum STDF de- 

tector is derived by exploiting the spatial-temporal cross- 

correlation matrix M given in Eqn.3. The matrix M can 

be written as the Hadamard or Schur product of matrices 

A and R. 

M = R O (A~A), (5) 

where o denotes the Hadamard product of matrices [6]. 

A is the directions matrix comprising the direction vec- 

torsal,aa,...,ak,i.eA= [ a1 a2 ... aK ],andthe 

2- 

v- 

3+ 

: Spatid 

Pl, 

'3 

MF2 (cjl ' 

1 Filter I 

I :: FLll 

@J- 

w% 

I 

I 

: CL :: :: 

I I 

2Kl 

I 

.. n 

1- - - 

I 

I 

I 

12 

@L : 

I K I , 

direction vector ak = [ 1 ejsxl ... ejokp 1'. The direction 

vector ak expresses the phases and gains of the P 

sensors relative to a reference sensor in the direction of 

arrival of the wavefront of user k. AH is the conjugate 

transpose of A. Analysis shows that R is a positivedefinite 

matrix. 

Also, it can be easily shown that AHA is always a 

positive definite matrix, that implies matrix M will also 

be a positive definite matrix. If M is a positive definite 

matrix, then it can be decomposed as 

M = L ~L, (6) 

where LH is an upper triangular matrix and L is a lower 

triangular matrix. The above method of matrix decom- 

position is known as the Cholesky decomposition [7]. The 

matrices LH and L correspond to causal and stable ma- 

trix filters. If the filters are represented as a spectrum 

(frequency domain representation), then the Cholesky de- 

composition can be viewed as a spectral factorization the- 

orem [8]. 

In STDF, the sampled output y of the matched fil- 

ter is passed through the feedforward filter (LH)-l, and 

the resulting output vector of the matrix filter (LH)-' is 

given as 

y = (L*)-'y = LWb + n, (7) 

where n is a white Gaussian noise matrix with the di- 

agonal elements being the variance u'. Therefore, the 

feedforward filter (LH)-' is nothing but a whitening fil- 

ter. The model given in Eqn.7 is a white noise model of 

the CDMA channel. Also the expression given in Eqn.7 

makes analysis simpler. The kth component of the vector 

203 


bK

24 

y can be written as Now the SNR of kth user can be given as 

From the above equation it can be seen that the kth 

user term has a desired term, an interference term, and 

a white noise t,erm. 

The above model gives rise to the decision- 

feedback technique used in STDF. The users signals 

in STDF are arranged in decreasing order of energy 

(E1 2 & 2 . . . 2 &). Where, user 1 will be the strongest 

user and the I

,-t 

1 2 J 4 5 e 7 e e 10 

SNR (I" de.) 

Low corr, Low sep 

Fig. 2. Probability of error curves for weaker user in a two-user system for different combinations of signal cross correlations and angular 

separations. Legend: corr- correlation, sep- separation, SD- spatial decorrelator, STDF- spatial-temporal decorrelating decision 

feedback, STD- spatial-temporal decorrelating, DF- decision feedback. 

received energy of the user divided by the power spectral 

density level (No) of the background thermal white Gaus- 

sian noise (not including interference from other users). 

In essence, the efficiency represents the performance loss 

due to multiuser interference. The desirable figure of 

merit is the asymptotic efficiency Ij)k, where the back- 

ground Gaussian noise level goes to zero, i.e., 

Ij)k = lim ~ 

SNR,i 

No-+O SNR,, ' 

which characterizes the underlying performance loss 

when the dominant impairment is the existence of other 

users rather than the additive channel noise. 

Since for the STDF detector, from the expression in 

Eqn.17, the ideal probability of error does not depend on 

the noise level or the power of the interferers, and the 

ideal asymptotic efficiency of STDF is 

In the following simulation experiments, the perfor- 

mance figures of the proposed STDF detector have been 

compared with that of the other suboptimum detectors. 

A. Signal-to-Noise ratio 

IV. SIMULATION RESULTS 

For comparing SNR of the multiuser detectors two 

experiments were performed. In experiment 1, a two-user 

205 

25 

system was considered. For the two-user system different 

types of channel and direction matrix combinations were 

tried. The two channels considered were 

R1 has low crosscorrelation factor, whereas cross- 

correlations between the users signature waveforms are 

relatively high in case of R2. In simple words R2 simu- 

lates a bandwidth-efficient channel. Also the two different 

spatial distributions of the users were considered 

A1 corresponds to a low angular (13') separation 

between the two users, whereas in case of A2 the users 

are separated by an angle of 67.5'. 

Fig.2 shows the probability of error graphs for weaker 

user. In Fig.2(a) the users have low signal crosscorrela- 

tions and low angular separation between them. STDF 

performs slightly better than STD. Also DF performs bet- 

ter than decorrelator. The advantage of spatial filtering is 

clearly evident from the superior performances of STDF 

and STD over DF and decorrelator. The same observa- 

tions can be found in Fig.2(b), but the spatial decorrela- 

tor (SD) [5] shows some performance improvement which 

is obvious in case of highly separated users. 

Figs.2(c) and 2(d) corresponding to highly corre- 

lated users signature waveforms show interesting results. 


Fig. 3. Probability of error curves for stronger user in a two-user system for different combinations of signal cross correlations and 

angular separations. Legend same as in Fig.2. 

The STDF detector clearly outperforms the other detec- 

tors. The results indicate that STDF can be used in 

bandwidth-efficient CDMA channels where the signature 

waveforms have significant crosscorrelations. In case of 

Fig.2(d) the SD shows a slight improvement over the 

decorrelator. 

Fig.3 shows the graphs for the stronger user of the 

two. The graphs clearly indicate that there is no perfor- 

mance difference between STDF and STD, and also be- 

tween DF and decorrelator. The only factor which makes 

STDF or STD better is the spatial filtering. As there is 

no feedback involved for strongest user in STDF and DF, 

the error rates are identical to that of STD and decor- 

relator respectively (this confirms with theory). Also 

in Fig.S(d) the SD shows a slightly better performance 

than the decorrelator, further confirming that in case of 

highly correlated and highly separated users, spatial fil- 

tering will make a significant contribution towards SNR 

improvement. 

In experiment 2, a four-user system was considered. 

The signal crosscorrelation matrix was given by 

r 1 0.5 0.4 0.2 1 

R3 = I 0.5 1 0.8 0.6 I 

0.4 0.8 1 0.3 

1 0.2 0.6 0.3 1 1 

and the direction matrix of a sensor with respect to a 

reference sensor in the direction of arrival of the users 

206 

wavefronts is 

Fig.4 shows the probability of error curves for the 

above four-user system. The four-user system simulates a 

multiuser environment better than the two-user examples 

considered earlier. From the graphs, it is clearly evident 

that the STDF detector shows a better performance than 

the other suboptimum detectors considered in this work. 

As the users become stronger the performance difference 

between STDF and STD narrows down. The results con- 

firm that STDF is a very powerful detection technique 

for relatively weak users. Also, in case of the strongest 

user there is no performance difference between STDF 

and STD, which coincides with our earlier observations. 

B. Asymptotzc Eficzency 


In experiment 3, the asymptotic efficiency of differ- 

ent detectors were compared. As asymptotic efficiency 

basically measures the performance loss of a detector due 

to multiuser interference, a four-user system (R3 and 

A3) was considered. The reason is because a four-user 

system with different signal crosscorrelations will better 

simulate an hostile environment with different levels of 

multiuser interference than when compared to a two-user 

system. Fig.5 shows the histogram plot of asymptotic ef- 

ficiencies for all the four users. The plot shows that STDF 

is always more efficient than the other suboptimum de- 

tectors, except in case of the strongest user, where STDF

'il! (c) 

(a) Weakea User 

2nd Strongest User 

SNR (in dB) 

10-1 

1 2 3 4 a e 7 B Q ,o 

SNR (In OB) 

18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam 1996 

4.2.2: Time-varying Analysis of Various Signals 

Screening of Knee Joint Vibroarthrographic Signals by Statistical 

Pattern Analysis of Dominant Poles 

S. KRISHNAN’, R.M. RANGAYYAN112, G.D. BELL’l3, C.B. FRANK2$3, K.O. LADLY3 

‘Dept. of Electrical and Computer Engineering, ’Dept. of Surgery, 3Sport Medicine Centre 

The University of Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca) 

Abstract-Analysis of human knee joint vibration signals 

or vibroarthrographic (VAG) signals could lead to a non- 

invasive method for the diagnosis of cartilage pathol- 

ogy. In this study, the nonstationary VAG signals 

were adaptively segmented into locally stationary seg- 

ments. Autoregressive (AR) model coefficients were de- 

rived from the stationary segments by using the Burg- 

lattice method. The dominant poles of the models ex- 

tracted from the AR polynomials and a signal variability 

parameter were used as VAG signal features. The VAG 

signal features with a few relevant clinical parameters 

were used as feature vectors in statistical pattern classifi- 

cation experiments based on logistic regression analysis. 

The results indicated a classification accuracy of 81.7% in 

screening 90 VAG signals with no restriction imposed on 

the type of abnormal signals, and an accuracy of 93.7% 

in classifying 71 VAG signals with abnormal signals re- 

stricted to a specific type of articular cartilage pathology 

known as chondromalacia patella. 


Vibration signals emitted from human knee joints 

during normal movement of the leg, known as vi- 

broarthrographic (VAG) signals, are expected to be as- 

sociated with roughness, softening, or the state of lubri- 

cation of the cartilage surfaces, and may be useful indi- 

cators of early joint degeneration or disease. VAG aig- 

nal analysis could decrease the need for diagnostic use of 

arthroscopy. A variety of VAG signal analysis techniques 

have been proposed in the literature (for a review of pre- 

vious publications, please see Moussavi et al. [l]). The 

present work investigates, with a large database of sig- 

nals, the diagnostic potential of VAG based on pattern 

classification experiments performed using signal model 

parameters and a few clinical parameters as features. 

A. Data Acquistion 

11. METHODS 

In order to detect the VAG signal, a Dytran ac- 

celerometer (model 3115a) was placed on the surface of 

the skin at the mid-patella position of the knee, and the 

signal was recorded during swinging movement of the leg 

from 135’ to 0’ to 135’ in a total time period of 4 s. An 

electronic goniometer was placed on the lateral side of the 

knee to measure the angle of motion. Before digitizing the 

signal at a sampling rate of 2.5 kHz and 12 bits/sample, 

the signal was amplified and conditioned using a 10 Hz 

to 1 kHz bandpass filter. 

Auscultation was performed during swinging move- 

ment of the leg by placing a stethoscope on the lateral, 

medial, and anterior surfaces of the knee. The sounds 

heard were categorized and coded along with the ap- 

proximate corresponding joint angle for use as features in 

classification experiments. For subjects who underwent 

arthroscopy, the location of the pathology was used to es- 

timate the joint angle at which the pathological surfaces 

would come in contact and contribute to the VAG signal. 

Two databases were used in this study: 1). Database 

A consists of 51 normal and 39 abnormal signals, with 

no restriction on the type of cartilage pathology, and 2). 

Database B, extracted from database A, consists of 51 

normal and 20 abnormal’ signals, with the abnormals re- 

stricted to chondromalacia patella only. (Chondromala- 

cia patella is a common type of articular cartilage pathol- 

ogy in which the cartilage softens, fibrillates, and finally 

the bone is exposed.) 

208 

0-7803-3811-1/97/$10.00 OIEEE 968 

B. Feature Extmctzon 

Like many other biological signals, VAG signals are 

nonstationary. Hence, in order to apply standard sig- 

nal processing methods such as parametric modeling or 

spectral analysis, the signals have to be segmented into 

quasi-stationary segments. En this work VAG signals were 

adaptively segmented into stationary segments by using a 

recursive least-squares lattice algorithm [2]. An example 

of a VAG signal of an abnormal subject with chondroma- 

lacia patella, along with the final segment boundaries, is 

illustrated in figure 1. The VAG signal segments were au- 

toregressive (AR) modeled using the Burg-lattice method 

[2]. The transfer function of the AR or the “all pole” filter 


where M is the model order, and Csk are the AR coeffi- 

cients [3]. By factorizing the denominator, Eq. 1 can be 

rewritten as 

1 

H(r) = (2 - Pl)(t - P2)(. - P3)...(z. - PM) ’ 


(2)

18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam 1996 

4.2.2: Time-varying Analysis of Various Signals 

Fig. 1. VAG signal of an abnormal subject with chondromalacia 

patella. The vertical lines represent segment boundaries. au: 

Arbitrary units. 

where pl ,pa, ..., p~ are the complex poles of the model. A 

model order of 40 was used [2]. Since the model order was 

an even number, the poles occurred in conjugate pairs. 

The distance T of a pole from the origin in the complex 

z- plane determines its spectral bandwidth, fB, as 

fs = cos-l [ (1 + r2) - 2(1- r) 

2r 

Poles with a large T contribute to the dominant peaks 

in the signal spectrum [4]. The superior performance of 

poles in tracking the frequency or spectral behavior of a 

signal makes them an appropriate choice for parametric 

representation of signals with multi-peaked spectra, such 

as VAG signals. 

C. Pattern Analysis 

From the twenty poles (complex conjugate pole pairs 

were represented by only one pole from each pair) of the 

model of each VAG signal segment, six poles with the 

highest r were selected as the dominant poles. The six 

dominant poles; a signal variability parameter computed 

as the variance of the means (VM) of the segments of a 

VAG signal record; and a few clinical parameters such as 

the type of sound heard during auscultation, age, gender, 

and activity level of the subject were used to form feature 

vectors for use in classification experiments. 

The accuracy rate in classification of VAG signal 

segments into normal and abnormal groups was deter- 

mined by applying logistic regression analysis [5] on ran- 

dom splits of the databases into training and test sets. 

The VAG signal segments used in the test sets were tu 

tally different from and independent of the VAG signal 

segments used in the training sets. The overall accuracy 

rate was calculated as the percentage of the correctly- 

classified segments in the test set. 

111. RESULTS 

Several random split experiments were conducted 

with the database A and the database B. Table 1 shows 

969 

TABLE I 

CLASSIFICATION ACCURACY WITH DATABASE A AND DATABASE B 

- Database 

A 

B 

Correct Classification Accuracy Rate 

Normal Segments Abnormal Segments Overall 

201/211 36 f '79 2371290 

95.3% 45.6% 81.7% 

1881195 35/43 2231238 

96.4% 81.4% 93.7% 

the best test results obtained. The use of poles instead 

of the AR model coefficients [2] has provided an increase 

in classification accuracy of about 2 to 3%. 

IV. DISCUSSION 

The results confirm that VAG signal analysis is in- 

deed a potential tool for noninvasive diagnosis of artic- 

ular cartilage pathology. The results with database B 

further indicate that the proposed methods have poten- 

tial in detecting chondromalacia patella with noninvasive 

procedures. 

The use of AR model poles has the advantage that 

the pole frequencies could be directly related to domi- 

nant frequency components present in the signals. Such a 

parametric representation of signals should facilitate bet- 

ter description and understanding of signal and system 

characteristics than the use of more abstract parameters 

such as the AR model coefficients. 

Future work will be directed towards wavelet analysis 

for improved feature analysis of the nonstationary VAG 

signals, which may overcome some of the approximations 

involved in our current segmentation-based approach. 

ACKNOWLEDGEMENT 

We gratefully acknowledge support of this project 

with grants from the Arthritis Society of Canada and the 

Alberta Heritage Foundation for Medical Research. 

REFERENCES 

[l] Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. Frank, 

K.O. Ladly, and Y.T. Zhang. Screening of vibroarthrographic 

signals via adaptive segmentation and linear prediction model- 

ing. IEEE Transactions an Biomedical Engineering, 43:15-23, 

1996. 

[2] S. Krishnan. Adaptive filtering, modeling, and classification of 

knee joint vibroarthrographic signals. Master's thesis, Dept. of 

Electrical and Computer Engineering, The University of Cal- 

gary, Calgary, AB, Canada, April 1996. 

[3] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood 

Cliffs, N.J., 2nd edition, 1990. 

[4] 0. Paiss and G.F. Inbar. Autoregressive modeling of surface 

EMG and its spectrum with application to fatigue. IEEE 

Transactiona on Biomedical Engineering, BME-34( 10):761- 

769, 1987. 

[5] SPSS Inc., Chicago, IL. SPSS Advanced Statrattcs User's 

Guide, 1990. 

209 


339 

RECURSIVE LEAST-SQUARES LATTICE-BASED ADAPTIVE 

SEGMENTATION AND AUTOREGRESSIVE MODELING OF KNEE JOINT 

VIBROARTHROGRAPHIC SIGNAILS 

S. Krishnanl, R.M. Rangayyan1T2, G.D. Bel1273, C.B. F'~-ank~?~, K.O. Ladly3 

'Department of Electrical and Computer Engineering,2 Department of Surgery, Sport Medicine Centre 

The University of Calgary, Alberta, T2N 1N4, Canada. (Email : rangaQenel.ucalgary.ca) 

Abstract : Vibration signals emitted during movement 

of the knee, known as vibroarthrographic (VAG) sig- 

nals, may bear diagnostic information. We propose a 

new technique for adaptive segmentation based on the 

recursive least-squares lattice algorithm to segment the 

non-stationary VAG signals into locally-stationary com- 

ponents, which were then autoregressive modeled using 

the Burg-Lattice method. Classification of 90 VAG sig- 

nals as normal or abnormal using the signal and clini- 

cal parameters provided an accuracy of 71.1% with the 

leave-one-out method. When the abnormal signals were 

restricted to chondromalacia patella only, the classifica- 

tion accuracy increased to 80.3%. The results indicate 

that VAG is a potential tool for non-invasive screening 

for chondromalacia patella. 


Based on the many investigations that have been carried 

out on vibroarthrographic (VAG) signal analysis in the 

past few years, there is a good evidence to suggest that 

the VAG or knee joint sound signal has an exciting poten- 

tial for distinguishing between normal and abnormal car- 

tilage surfaces [l]. However, in previous studies on VAG, 

signal classification experiments were performed on a lim- 

ited number of signals. Using different adaptive signal 

processing techniques, the present work closely investi- 

gates the diagnostic potential of VAG based on extensive 

pattern classification experiments. 

In this paper, utilizing a reasonably large database 

of 90 subjects, the following approaches and techniques 

are addressed : 

0 Improved adaptive segmentation of the non- 

stationary VAG signals using the recursive least- 

squares lattice (RLSL) algorithm; 

0 Improved autoregressive (AR) modeling of VAG sig- 

nal segments using the Burg-Lattice method; and 

0 Classification of VAG signals into two groups - Nor- 

mal and Abnormal - using logistic regression analysis 

and the leave-one-out method. 

CCECE'96 0-7803-3143-5 /96/$4.00 0 1996 IEEE 

210 

The proposed methods should be useful as clinical 

tools for diagnosis of cartilage pathology or as tests before 

arthroscopy or major surgery. 

2 Clinical Data Acquisitioin 

Subjects sat on a rigid table with both legs suspended, 

and repeatedly extended and flexed their legs at an ap- 

proximate angular speed of 67'/s; the range of motion was 

approximately 135' to 0' to 135' in a total time period 

of 4 s[l]. It has been found that this rate of movement is 

the most comfortable rate for subjects to move their legs 

smoothly with consistency [2]. 

Auscultation was performed during swinging move- 

ment of the leg by placing a stethoscope on the medial, 

lateral, and anterior surfaces of the knee. Sounds such as 

pops, clicks, grinds, and clunks heard during auscultation 

were coded along with the approximate corresponding 

joint angle for use as discriminant features in classification 

experiments. For patients who underwent arthroscopy, 

the position of the observed pathology was used to esti- 

mate the joint angle at which the affected surfaces could 

come into contact and contribute to VAG or sound sig- 

nals. For all subjects who participated in the study, the 

following information was also documented : age, gender, 

and number of times the subject exercised per week. 

3 VAG Signal Recording Setup 

The VAG signal was detected by a Dytran (Dytran, 

Chatsworth, CA) miniature accelerometer (model 3115a) 

placed on the skin over the mid-patella of the subject dur- 

ing dynamic movement of the knee. The signal was am- 

plified and conditioned by a bandpass filter of bandwidth 

10 Hz to 1 kHz using Gould (Gould, Cleveland, OH) iso- 

lation pre-amplifiers (model 11-5407-58) and (Gould uni- 

versal amplifiers (model 13-4615-18), and recorded on a 

Hewlett Packard (Hevvlett Packard, San Diego, CA) in- 

strumentation recorder (model 3968A). The bandpass fil- 

ter minimizes low-frequency movement artifacts and also 

prevents aliasing effects. A National Instruments (Na- 

tional Instruments, Austin, TX) AT-MIO-16L data ac- 


quisition board and Lab Windows (National Instruments, 

Austin, TX) software on a Zenith (Zenith, Los Angeles, 

CA) 386 computer were used to digitize the signals at a 

sampling rate of 2.5 kHz and 12 bits/sample. The data 

were then transferred to a SUN (SUN, Cupertino, CA) 

Sparcstation for processing. 

An electronic goniometer to measure the angle of the 

limb during movement was placed on the lateral aspect of 

the knee with the axis of rotation at the joint line. The 

signal from the goniometer was converted after digitiza- 

tion to the real angle in degrees based on the voltage of 

the goniometer at 0' and 90'. In this study, two databases 

were used : 

e Database AB, which consists of VAG signals of 51 

normal subjects, includes historically normal as well 

as symptomatic subjects who underwent arthroscopy 

and were found to be normal, and VAG signals 

of 39 symptomatic subjects with arthroscopically- 

confirmed cartilage pathology; and 

Database C extracted from database AB, which con- 

sists of 51 normal VAG signals and 20 abnormal 

VAG signals (restricted to chondromalacia patella 

only). Among the 20 chondromalacia patella cases, 

17 had additional pathology such as meniscal tears 

and chondromalacia of the tibial plateau. 

4 Adaptive Segmentation 

VAG signals are recorded during swinging movement of 

the knee, over a range of motion of 135' to 0' (exten- 

sion) and 0' to 135' (flexion). This kind of movement 

causes the joint surfaces to rub against each other, and 

also against the under-surface of the patella. The regions 

of the joint surfaces coming in contact are different at 

each position during the swing. The contact area may 

not be the same for every swing even for the same angle 

position: further, the quality of the joint surfaces com- 

ing in contact may change with joint angle. This means 

that signals of different characteristics are expected at 

different joint angles. As the statistical characteristics of 

the VAG signals are time-variant, the signals are non- 

stationary in nature. Hence, in order to apply standard 

signal processing techniques such as parametric modeling 

or spectral analysis on VAG signals, the signals have to 

be first adaptively segmented into locally-stationary seg- 

ments or components. 

Adaptive segmentation of VAG signals has already 

been reported in the literature by Tavathia et al. [3] 

and Moussavi et al. [l]. The new adaptive segmenta- 

tion method developed in the present work is based on 

the RLSL algorithm. The advantage in using a lattice fil- 

ter for segmentation of VAG signals is that the statistical 

211 


340 

changes in the signals are well reflected in the filter pa- 

rameters, and hence segment boundaries can be detected 

by monitoring any one of the filter parameters such as 

the mean squared error, conversion factor, or the reflec- 

tion coefficients. Also, under certain circumstances, the 

required segmentation filter order can be predicted from 

the forward prediction error power. It was found that for 

VAG signals, the ensemble-averaged forward prediction 

error power (computed using 35 primary VAG signals) 

reaches a constant value after a model order of six. In 

this study, the conversion factor (y) has been used to 

monitor statistical changes in the VAG signals. In a sta- 

tionary environment, y starts with an initial value of zero 

and remains small during the early part of the initializa- 

tion period. After a few iterations, y begins to increase 

rapidly towards a final value of unity [4]. In the case of 

non-stationary signals such as VAG, y will fall from its 

steady value of unity whenever a change occurs in the 

statistics of the signal. This can be used in segmenting 

the VAG signal into locally-stationary components. The 

segmentation algorithm, in brief, is as follows: 

1. The VAG signal is passed twice through the segmen- 

tation filter. The first pass is used to allow the filter 

to converge, and the second pass is used to test the y 

value at each sample with a preferred fixed threshold 

value for detection of segment boundaries. 

2. Whenever y at a particular time sample during the 

second pass is lesser than the threshold value, a pri- 

mary segment boundary (PSB) is marked. 

3. If the difference between a PSB and the previous 

PSB of the same signal is greater than or equal to the 

minimum desired segment length of 120 data points 

[l], the PSB is marked as a final segment boundary; 

if not, the PSB is deleted and the process continued 

until all the PSBs are tested. 

Test results of the RLSL-based adaptive segmenta- 

tion method on different non-stationary synthetic signals 

indicated the high efficiency of this method in detecting 

rapid and gradual changes in signals [5]. The main advan- 

tage of the new method of adaptive segmentation is that 

the threshold is a fixed value as opposed to a variable 

value that was adopted in the previous study of Mous- 

savi et al.[l]. For some signals, especially normal VAG 

signals, it was found that the adaptive segmentation pro- 

cedure gave almost the same results as manual segmen- 

tation based on auscultation and/or arthroscopy. Figure 

2 shows the plot of y for the corresponding VAG signal 

in figure 1. The dashed lines in figure 1 show the final 

segment boundaries for the corresponding VAG signal. 

On the average, eight segments per VAG signal were ob- 

tained.

300 

100 

Ill Ill 

Ill Ill 

Ill Ill 

Ill Ill 

Ill Ill 

Ill Ill 

I l l Ill 

Ill Ill 

Ill Ill 

Ill Ill II 

1 1 1 Ill II 

Ill 111 I1 

I,,, 0, I 

-300 

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

lime Samples 

Figure 1: VAG signal of an arthroscopically abnormal 

subject (chondromalacia grade 111) with the final segment 

boundaries shown by vertical dashed lines. The final seg- 

ment boundaries were determined by the RLSL adaptive 

segmentation algorithm. au: Arbitrary units. 

0.996 1 I‘ I 1 

0.9955 

0.995 0 1000 2000 3000 4600 300 6000 7000 8M)O 9000 10000 

number of iterations 

Figure 2: Plot of the conversion factor (7) for the abnor- 

mal VAG signal shown in figure 1. The horizontal dashed 

line is the fixed threshold line. 

212 

5 Autoregressive Modeling 

Modeling techniques such as autoregressive (AR) mod- 

eling, also referred to as “all-pole” modeling, provide 

parameters which could potentially be correlated with 

the physiological system producing the signals. The AR 

model is a linear, second-moment stationary model. Al- 

though VAG signals are neither linear nor stationary, 

second-moment stationarity holds over VAG signal seg- 

ments. Hence, approlpriate analysis of VAG segments 

may be based on an 14R model to extract all linearly- 

retrievable information from the signal in a minimum- 

variance manner. Some of the common ways to estimate 

the AR parameters are the autocorrelation or the Yule- 

Walker method [6], covariance method [6], Cholesky de- 

composition method [4:1, least squares method 141, and the 

Burg-Lattice method [4J. In this study on VAG signals, 

AR modeling method based on Burg-Lattice algonthm is 

investigated. 

The Burg-Lattice method was applied to stationary 

VAG signal segments and the AR prediction coefficients 

were derived. The model order used was 40. This order 

was chosen based on a,pplication of the Akaike Informa- 

tion Criterion (AIC), and models of this order were ob- 

served to predict the VAG signal segments well [l]. How- 

ever, a performance analysis of AR model coefficients in 

terms of the classification accuracy rate indicated that 

the first six AR coefficients of VAG signal segments are 

adequate for pattern classification of VAG sigioals [5]. 

6 VAG Pat tern Classification 

As described in the previous section, the AR prediction 

coefficients were derived by modeling each VAG segment 

by the Burg-Lattice method. One of the obvious visual 

differences between normal and abnormal signatls was that 

the abnormal signals were much more variable in ampli- 

tude across a swing cycle than the normal signals. How- 

ever, this difference is lost in the process of dividing the 

signals into segments and considering each segment as a 

separate signal. To overcome this problem, the means 

(time averages) of the segments of each subject’s signal 

were computed, and the variance of these means (VM) 

was computed across the various segments of the same 

signal. The first six AR model coefficients, tlhe VM pa- 

rameter, and a few clinical parameters such as sound, 

age, gender, and activity level were used as discriminant 

features in the classification experiments. 

In this study, the classification of signals was done 

using the logistic analysis subroutine available in the Sta- 

tistical Package for Social Sciences (SPSS) software [7], 

and the leave-one-out method [8] was used l;o estimate 

the correct classification accuracy rate. In applying this 

method, all the segments of the VAG signal of one sub- 


Table 1: Comparison of different classification experi- 

ments by using the accuracy rates determined by applying 

the leave-one-out method, and the best test classification 

results obtained with the random split method. 

ject were excluded from the database, the classifier was 

designed with the segments of the VAG signals of the re- 

maining subjects, and then the VAG signal segments of 

the excluded subject were tested by the classifier. This 

operation was repeated to test all the subjects in each 

database. If segments spanning more than 10% of the 

duration of a subject’s signal were classified as abnormal, 

the subject was labeled as an abnormal subject; other- 

wise the subject was labeled as a normal subject. The 

number of correctly-classified subjects was then counted 

to estimate the classification accuracy rate. Since each 

test subject is excluded from the training sample set in 

turn, independence between the test set and the training 

set is maintained. 

Further, in another procedure, the accuracy rate in 

classifying the VAG signal segments into two groups was 

determined by applying logistic analysis on random splits 

of the databases into training and test sets. The VAG 

signal segments used in the test sets were totally different 

and independent from the VAG signal segments used in 

the training sets. The overall-accuracy rate of a training 

or a test set was given as the percentage of the number of 

correctly-classified segments in the training/test stage to 

the total number of segments in the training/test stage. 

7 Results 

Table 1 shows the classification results with database AB 

and database C. Several random split experiments were 

conducted [5], and Table 1 shows the best test classifi- 

cation results obtained with the random split method. 

From the results of the leave-one-out and random split 

methods, we can infer that the proposed method shows a 

better classification result with database C, and is sensi- 

tive to chondromalacia patella cases. 

8 Discussion and Further Work 

Substantial numbers of normal and abnormal VAG sig- 

nals were analyzed in this work, and the results ascer- 

tain that VAG signal analysis is indeed a potential tool 

for non-invasive diagnosis of articular cartilage pathology. 

213 

342 

Also, the proposed method has shown tremendous po- 

tential in detecting chondromalacia patella (results with 

database C) with non-invasive procedures. 

Future work will be directed towards time- 

scale/time-frequency analysis for improved feature anal- 

ysis of the non-stationary VAG signals, which may over- 

come the approximations involved in our current para- 

metric approach and the difficulties in segment-based 

analysis, and could lead to improved identification of dif- 

ferent types of cartilage pathology. 


We gratefully acknowledge support of this project 

with grants from the Arthritis Society of Canada and the 

Alberta Heritage Foundation for Medical Research. 

References 


[l] Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. 

Frank, K.O. Ladly, and Y.T. Zhang. Screening of 

vibroarthrographic signals via adaptive segmentation 

and linear prediction modeling. IEEE Transactions 

on Biomedical Engineering, 43( 1):15-23, 1996. 

[a] K.O. Ladly. Analysis of patellofemoral joint vibration 

signals. Master’s thesis, The University of Calgary, 

Calgary, AB, Canada, 1992. 

[3] S. Tavathia, R.M. Rangayyan, C.B. Frank, G.D. Bell, 

K.O. Ladly, and Y.T. Zhang. Analysis of knee vi- 

bration signals using linear prediction. IEEE Trans- 

actions on Biomedical Engineering, 39(9):959-970, 

1992. 

[4] S. Haykin. Adaptive filter theory. Prentice-Hall, En- 

glewood Cliffs, N.J., 2nd edition, 1990. 

[5] S. Krishnan. Adaptive filtering, modeling, and classi- 

fication of knee joint vibroarthrographic signals. Mas- 

ter’s thesis, Submitted, Dept. of Electrical and Com- 

puter Engineering, The University of Calgary, Cal- 

gary, AB, Canada, April 1996. 

[6] J. Makhoul. Linear prediction: A tutorial review. 

Proc. IEEE, 63(4):561-580, 1975. 

[7] SPSS Inc., Chicago, IL. SPSS Advanced Statistics 

User’s Guide, 1990. 

[8] K. Fukunaga. Introduction to Statistical Pattern 

Recognition. Academic Press, Inc., San Diego, CA., 

2nd edition, 1990.

Other Refereed Conference Papers 

T. Tabatabaei, S. Krishnan and A. Guergachi, Speech-based emotion recognition using 

sequence discriminant Support Vector Machines, 4 pages in CDROM Proc. Canadian Medical 

and Biological Engineering Conference (CMBEC), Toronto, Ontario, May 2007. 

O. Nedjah, A. Hussein, S. Krishnan, R. Sotudeh, CN tower lightining current derivative Heidler 

model for the validation of wavelet de-noising algorithm, In Proc. 18th International Wroclaw 

Symposium and Exhibition on Electromagnetic Compatibility, Wroclaw, Poland, pp:282 – 287, 

June 2006. 

A. Morrison, S. Krishnan, A. Anpalagan and B. Grush, Receiver autonomous mitigation of GPS 

non-line-of-sight multipath errors, 6 pages in Proc. ION National Technical Meeting, Monterey, 

California, January 2006. 

A. Ramalingam and S. Krishnan, Video fingerprinting using space-time and Gaussian mixture 

models, 4 pages in Proc. Canadian Workshop on Information Technology (CWIT), Montreal, 

Quebec, June 2005. 

K. Momen, S. Krishnan, D. Beal, E. Bouffet, B. Kavanagh, T. Chau, Self-organization of the 

communication space based on user range-of-motion: a framework for configuring non-contact 

augmentative communication devices, 4 pages in Proc. Canadian Medical And Biological 

Engineering Conference, Quebec City, Quebec, September 9-11, 2004 

J. Lukose and S. Krishnan, EEG signal analysis for screening alcoholics, 4 pages in Proc. 

International Dynamics of Continuous, Discrete, and Impulsive Systems (DCDIS) Conference, 

Guelph, Ontario, May 2003. 

K. Umapathy and S. Krishnan, Pathological voice screening using local discriminant bases, 

International 4 Pages in Proc. International Dynamics of Continuous, Discrete, and Impulsive 

Systems (DCDIS) Conference, Guelph, Ontario, May 2003. 

S. Erkucuk (and S. Krishnan, M. Zeytinoglu), A novel technique for digital audio watermarking, 

Student Contest Presentation at the IEEE International Conference on Multimedia and Expo 

(ICME), Laussanne, Switerland, August 2002. (Won the IBM T.J. Watson Research award for 

innovative ideas) 

K. Umapathy, S. Krishnan, and S. Jimaa, Time-frequency analysis of wideband speech and 

audio, 2 pages in Proc. Micronet Annual Workshop, Aylmer, Quebec, April 2002. 

S. Krishnan, R.M. Rangayyan, and K. Umapathy, A time-frequency approach for auditory 

display of time-varying signals, in Proc. IASTED International Conference on Signal and Image 

Processing, Hawaii, USA, pp 236-241, August 2001. 

K. Umapathy and S. Krishnan, Joint time-frequency coding of audio signals, in Proc. 5th 

WSES/IEEE Multiconference on Circuits, Systems, Communications, and Computers, Crete, 

Greece, pp 32-36, July 2001. 

214

K. Umapathy and S. Krishnan, Low bit-rate time-frequency coding of wideband audio signals, in 

Proc. IASTED International Conference on Signal Processing, Pattern Recognition and 

Applications, Rhodes, Greece, pp 101-105, July 2001. 

R.M. Rangayyan, S. Krishnan, G.D. Bell, and C.B. Frank, Computer-aided auscultation of knee 

joint vibration signals. In Proc. European Medical and Biological Engineering Conference, 

Vienna, Austria, pp: 464-465, Nov. 1999. 

S. Krishnan and R.M. Rangayyan, Knee joint vibration signal analysis using adaptive timefrequency 

distributions. In Proc. European Medical and Biological Engineering Conference, 

Vienna, Austria, pp: 466-467, Nov. 1999. 

S. Krishnan and R.M. Rangayyan, Feature identification in the time-frequency distributions of 

knee joint vibroarthrographic signals using Hough and Radon transforms. In Proc. International 

Conference on Robotics, Vision, and Parallel Processing, Tronoh, Malaysia, pp: 82-89, July 

1999. 

R.M. Rangayyan, S. Krishnan, G.D. Bell, C.B. Frank, and K.O. Ladly, Impact of muscle 

contraction interference cancellation on vibroarthrographic screening, Proc. International 

Conference on Biomedical Engineering, Kowloon, Hong Kong, pp 16-19, June 1996. (invited 

paper) 

S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank, and K.O. Ladly, Adaptive segmentation 

and cepstral analysis of vibroarthrographic signals for non-invasive diagnosis of knee joint 

cartilage pathology, Proc. 22nd Canadian Medical and Biological Engineering Conference, 

Charlottetown, PEI, Canada, pp 8-9, June 1996. 

N. Kumaravel and S. Krishnan, Knowledge based biosignal processing system for diagnosing 

heart disorders, Proc. International Conference on Robotics, Vision, and Parallel Processing, 

Ipoh, Malaysia, pp 602-609, May 1994. 

S. Krishnan, An expert diagnostic system using signal processing tool, in Proc. International 

conference on expert systems for development, Bangkok, Thailand, March 1994. 

215

Signal Analysis Research (SAR) Group - RNet - Ryerson University

Create successful ePaper yourself

Delete template?

Save as template?