30.04.2013 Views

Signal Analysis Research (SAR) Group - RNet - Ryerson University

Signal Analysis Research (SAR) Group - RNet - Ryerson University

Signal Analysis Research (SAR) Group - RNet - Ryerson University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Signal</strong> <strong>Analysis</strong><br />

<strong>Research</strong> (<strong>SAR</strong>)<br />

<strong>Group</strong><br />

Refereed Conference Papers<br />

May 1996- December 2007


Table of Contents<br />

2007.12 Construction of Discriminative Positive Time-Frequency<br />

Distributions<br />

K. Umapathy and S. Krishnan<br />

2007.10 Combining Vocal Source and MFCC Features for Enhanced<br />

Speaker Recognition Performance Using GMMs<br />

D. Hosseinzadeh and S. Krishnan<br />

2007.08 Multiresolution <strong>Analysis</strong> and Classification of Small Bowel<br />

Medical Images<br />

A. Khademi and S. Krishnan<br />

2007.06 Interference Detection in Spread Spectrum Communication Using<br />

Polynomial Phase Transform<br />

R. Zarifeh, S. Krishnan, A. Anpalagan and N. Alinier<br />

2007.05 Emotion Recognition Using Novel Speech <strong>Signal</strong> Features<br />

T.S. Tabatabaei, S. Krishnan and A. Guergachi<br />

2007.04 A Watermarking Method for Speech <strong>Signal</strong>s Based on the Time-<br />

Warping <strong>Signal</strong> Processing Concept<br />

C. Ioana, A. Jarrot, A. Quinquis and S. Krishnan<br />

2006.12 Chirp-Based Image Watermarking as Error-Control Coding<br />

B. Ghoraani and S. Krishnan<br />

2006.07 Automatic Content-Based Image Retrieval Using Hierarchical<br />

Clustering Algorithms<br />

K. Jarrah, S. Krishnan and L. Guan<br />

2006.07 Computational Intelligence Techniques and their Applications in<br />

Content-Based Image Retrieval<br />

K. Jarrah, M. Kyan, S. Krishnan and L. Guan<br />

2006.07 Discrete Polynomial Transform for Digital Image Watermarking<br />

Application<br />

L. Le, S. Krishnan and B. Ghoraani<br />

2006.05 Improving Position Estimates From a Stationary GNSS Receiver<br />

Using Wavelets and Clustering<br />

M. Aram, B. Li, S. Krishnan and A. Anpalagan<br />

2006.05 Keystroke Identification Based on Gaussian Mixture Models<br />

D. Hosseinzadeh, S. Krishnan and A. Khademi<br />

2006.05 Soccer Video Retrieval Using Adaptive Time-Frequency Methods<br />

J. Marchal, C. Ioana, E. Radoi, A. Quinquis and S. Krishnan<br />

1 - 5<br />

6 - 9<br />

10 – 13<br />

14 - 19<br />

20 - 23<br />

24 - 27<br />

28 – 31<br />

32 – 37<br />

38 – 41<br />

42 – 45<br />

46 – 50<br />

51 – 54<br />

55 – 58


2006.05 Support Vector Machines Based Approach for Chemical<br />

Phosphorus Removal Process in Wastewater Treatment Plant<br />

T.S.Tabatabaei, T. Farooq, A. Guergachi and S. Krishnan<br />

2005.11 Data Embedding in Miu-Law Speech with Spread Spectrum<br />

Techniques<br />

L. Zhang, H. Ding and S. Krishnan<br />

2005.09 Comparison of JPEG 2000 and Other Lossless Compression<br />

Schemes for Digital Mammograms<br />

A. Khademi and S. Krishnan<br />

2005.07 Gaussian Mixture Modeling Using Short Time Fourier Transform<br />

Features for Audio Fingerprinting<br />

A. Ramalingam and S. Krishnan<br />

2005.05 Multipath Mitigation of GNSS Carrier Phase <strong>Signal</strong>s for an On-<br />

Board Unit for Mobility Pricing<br />

R. Puri, A. El Kaffas, A. Anpalagan, S. Krishnan and B. Grush<br />

2005.03 A <strong>Signal</strong> Classification Approach Using Time-Width VS<br />

Frequency Band Sub-Energy Distributions<br />

K. Umapathy and S. Krishnan<br />

2005.03 Indexing of NFL Video Using MPEG-7 Descriptors and MFCC<br />

Features<br />

S.G. Quadri, S. Krishnan and L. Guan<br />

2004.12 Audio <strong>Signal</strong> Feature Extraction and Classification Using Local<br />

Discriminant Bases<br />

K. Umapathy, S. Krishnan and R.K. Rao<br />

2004.05 A Novel Robust Image Watermarking Using a Chirp Based<br />

Technique<br />

A. Ramalingam and S. Krishnan<br />

2004.05 A Novel Way of Lossless Compression of Digital Mammograms<br />

Using Grammar Codes<br />

X. Li, S. Krishnan and N.-W. Ma<br />

2004.05 Content Based Audio Classification and Retrieval Using Joint<br />

Time-Frequency <strong>Analysis</strong><br />

S. Esmaili, S. Krishnan and K. Raahemifar<br />

2004.05 Modified Local Discriminant Bases and its Applications in <strong>Signal</strong><br />

Classification<br />

K. Umapathy and S. Krishnan<br />

2004.05 Radio Over Multimode Fiber for Wireless Access<br />

R. Yuen, X.N. Fernando and S. Krishnan<br />

59 – 63<br />

64 – 67<br />

68 - 71<br />

72 – 75<br />

76 – 79<br />

80 – 83<br />

84 – 87<br />

88 – 92<br />

93 – 96<br />

97 - 100<br />

101 – 104<br />

105 – 108<br />

109 - 112


2004.05 Sub-Dictionary Selection Using Local Discriminant Bases<br />

Algorithm for <strong>Signal</strong> Classification<br />

K. Umapathy, S. Krishnan and A. Das<br />

2003.09 Ultrasound Backscatter <strong>Signal</strong> Characterization and<br />

Classification Using Autoregressive Modeling and Machine<br />

Learning Algorithms<br />

N.R. Farnoud, M. Kolios and S. Krishnan<br />

2003.07 Robust Audio Watermarking Using a Chirp Based Technique<br />

S. Erkucuk, S. Krishnan and M. Zeytinoglu<br />

2003.07 Time-Frequency Filtering of Interferences in Spread Spectrum<br />

Communications<br />

S. Erkucuk and S. Krishnan<br />

2003.05 A General Perceptual Tool for Evaluation of Audio Codecs<br />

K. Umapathy, S. Krishnan and G. Sinanian<br />

2003.05 Non-Stationary Noise Cancellation in Infrared Wireless Receivers<br />

S. Krishnan, X. Fernando and H. Sun<br />

2003.04 Adaptive Denoising at Infrared Wireless Receivers<br />

X. N. Fernando, S. Krishnan, H. Sun and K. Kazemi-Moud<br />

2003.04 Audio Watermarking Using Time-Frequency Characteristics<br />

S. Esmaili, S. Krishnan and K. Raahemifar<br />

2002.10 Time-Frequency Modeling and Classification of Pathological<br />

Voices<br />

K. Umapathy, S. Krishnan, V. Parsa and D. Jamieson<br />

2002.08 Audio <strong>Signal</strong> Classification Using Time-Frequency Parameters<br />

K. Umapathy, S. Krishnan and S. Jimaa<br />

2002.05 Detection of Linear Chirp and Non-Linear Chirp Interferences in a<br />

Spread Spectrum <strong>Signal</strong> by Using Hough-Radon Transform<br />

S. Thayilchira and S. Krishnan<br />

2002.05 Discrimination of Pathological Voices Using an Adaptive Time-<br />

Frequency Approach<br />

K. Umapathy, S. Krishnan, V. Parsa and D. Jamieson<br />

2002.05 Interference Excision in Spread Spectrum Communications Using<br />

Adaptive Positive Time-Frequency Distributions<br />

S. Erkucuk and S. Krishnan<br />

2001.05 Fixed Block-Based Lossless Compression of Digital<br />

Mammograms<br />

M.Y. Al-Saiegh and S. Krishnan<br />

113 - 116<br />

117 – 120<br />

121 – 124<br />

125 – 128<br />

129 – 132<br />

133 – 137<br />

138 – 146<br />

147 – 151<br />

152 - 153<br />

154 - 157<br />

158<br />

159 - 162<br />

163<br />

164 - 169


2001.05 Instantaneous Mean Frequency Estimation Using Adaptive Time-<br />

Frequency Distributions<br />

S. Krishnan<br />

2000.07 Sonification of Knee-joint Vibration <strong>Signal</strong>s<br />

S. Krishnan, R.M. Rangayyan, G.D. Bell and C.B. Frank<br />

1999.05 Denoising Knee Joint Vibration <strong>Signal</strong>s Using Adaptive Time-<br />

Frequency Representations<br />

S. Krishnan and R.M. Rangayyan<br />

1998.10 Comparative <strong>Analysis</strong> of the Performance of the Time-Frequency<br />

Distributions with Knee Joint Vibroarthrographic <strong>Signal</strong>s<br />

R.M. Rangayyan and S. Krishnan<br />

1998.10 Detection of Nonlinear Frequency-Modulated Components in the<br />

Time-Frequency Plane Using an Array of Accumulators<br />

S. Krishnan and R.M. Rangayyan<br />

1997.10 Time-Frequency <strong>Signal</strong> Feature Extraction and Screening of Knee<br />

Joint Vibroarthrographic <strong>Signal</strong>s Using the Matching Pursuit<br />

Method<br />

S. Krishnan, R.M. Rangayyan, G.D. Bell and C.B. Frank<br />

1997.08 Detection of Chirp and Other Components in the Time-Frequency<br />

Plane Using the Hough and Radon Transform<br />

S. Krishnan and R.M. Rangayyan<br />

1997.08 Spatial-Temporal Decorrelating Decision-Feedback Multiuser<br />

Detector for Synchronous Code-Division Multiple-Access<br />

Channels<br />

S. Krishnan and B.R. Petersen<br />

1996.10 Screening of Knee Joint Vibroarthrographic <strong>Signal</strong>s by Statistical<br />

Pattern <strong>Analysis</strong> of Dominant Poles<br />

S. Krishnan, R.M Rangayyan, G.D. Bell, C.B. Frank and K.O. Ladly<br />

1996.05 Recursive Least-Squares Lattice-Based Adaptive Segmentation<br />

and Autoregressive Modeling of Knee Joint Vibroarthographic<br />

<strong>Signal</strong>s<br />

S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank and K.O. Ladly<br />

Other Refereed Conference Papers<br />

170 – 175<br />

176 – 179<br />

180 – 185<br />

186 - 189<br />

190 - 193<br />

194 – 197<br />

198 - 201<br />

202 - 207<br />

208 – 209<br />

210 - 213<br />

214 - 215


Construction of Discriminative Positive<br />

Time-frequency Distributions<br />

Karthikeyan Umapathy and Sridhar Krishnan<br />

Dept. of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />

Email: (karthi)(krishnan)@ee.ryerson.ca<br />

Abstract— Positive time-frequency energy distributions<br />

(PTFD) are suitable for studying the non-stationary dynamics<br />

of a signal. Instantaneous features extracted from the PTFD are<br />

often used in classification applications where the discriminatory<br />

clue lies in the non-stationary behavior of the signal. From a<br />

classification point of view it would be desirable to identify<br />

and extract instantaneous features that correspond to only the<br />

discriminative portion of the signal. By doing so we get an added<br />

advantage of eliminating the overlap from the non-discriminatory<br />

portion of the signal in the instantaneous feature extraction<br />

process. In this paper, we propose a front-end processing<br />

using a novel time-width versus frequency band mapping that<br />

facilitates the construction of PTFD corresponding to only the<br />

discriminatory portion of the signal. Instantaneous features<br />

extracted from these PTFDs are readily discriminative and<br />

attractive for classification and characterization applications. The<br />

proposed method is demonstrated with a speech classification<br />

example.<br />

Keywords: Time-frequency, Positive time-frequency distributions,<br />

Instantaneous features, Matching pursuits, Time-width<br />

versus frequency band mapping.<br />

I. INTRODUCTION<br />

Joint time-frequency (TF) analysis has been widely employed<br />

for extracting TF features from non-stationary signals.<br />

While the parametric TF decomposition approaches are highly<br />

suitable for objective feature extraction, the non-parametric<br />

Cohen’s class TF energy distributions (TFD) are usually used<br />

to extract instantaneous TF features. In order to extract meaningful<br />

instantaneous features from a TFD, the TFD has to<br />

satisfy the following properties: (i) positivity and (ii) time and<br />

frequency marginals [1]. Positivity refers to that the energy<br />

values of the TFD are ≥ 0. The marginal property states that<br />

integrating a TFD in time and frequency directions should<br />

yield the instantaneous energy and energy spectral density of<br />

a signal.<br />

In reality, the presence of cross-terms with multicomponent<br />

signals affect the positivity of a TFD. In some cases<br />

though we could achieve positive TFDs by compromising<br />

TF resolution, they do not satisfy the marginals. Various<br />

methods and conditions have been proposed over the years to<br />

construct PTFDs that satisfy both the properties of positivity<br />

and marginals [2], [3]. One known way of constructing PTFDs<br />

with high TF resolution and free from cross-terms is to use the<br />

adaptive TF transformation (ATFT - Matching Pursuits with<br />

TF dictionaries) approach [4]. While this approach produces<br />

cross-term free high resolution PTFDs, it does not satisfy the<br />

marginal properties. A correction to the marginals using minimum<br />

cross entropy (MCE) optimization has been proposed<br />

in the work of [5] that make the ATFT based PTFDs suitable<br />

for instantaneous feature extraction. An added advantage of<br />

the ATFT based PTFD approach is that the building blocks<br />

of the PTFD are the TF functions (represented by a set of<br />

decomposition parameters) whose parameters can be cleverly<br />

manipulated to achieve desirable effects in the PTFDs.<br />

In authors previous works [6], [7], [8] a novel time-width<br />

versus frequency band (TWFB) mapping based on ATFT was<br />

used to identify the discriminative decomposition subspaces<br />

between classes of signals. Objective TF features were extracted<br />

from these subspaces and successfully applied in real<br />

world applications to achieve high classification accuracies.<br />

The TWFB mapping utilizes the parametric benefits of ATFT<br />

in identifying the discriminative decomposition subspaces that<br />

are suitable for objective feature extraction. In order to extend<br />

this approach to extract instantaneous TF features from the<br />

discriminatory portion of the signal, the information provided<br />

by the TWFB mapping has to be combined with the ability of<br />

ATFT to construct PTFD. This possibility is explored in this<br />

paper and a methodology to construct discriminative PTFDs<br />

using TWFB mappings is presented. These discriminative<br />

PTFDs are constructed using only the discriminative portion of<br />

a signal which ensures that the instantaneous features extracted<br />

from these PTFDs truly reflect the discriminating dynamics<br />

between different classes of signals.<br />

The block diagram of the proposed methodology is shown<br />

in Fig. 1. The solid lines in the block diagram show the main<br />

components of the proposed work. The paper is organized as<br />

follows: Section 2 covers the methodology, which includes<br />

subsections on TWFB mappings, discriminative subspace selection,<br />

and construction of discriminative PTFDs, Section 3<br />

presents the results and discussion, and conclusions are given<br />

in Section 4.<br />

II. METHODOLOGY<br />

A. Time-width versus Frequency Band Mappings<br />

TWFB mappings are constructed using the decomposition<br />

parameters of an ATFT [6], [7], [8]. In ATFT, any given real<br />

signal is modeled using a sum of L TF functions selected from<br />

a redundant dictionary of TF functions. The TF functions used<br />

to model a real signal can be represented using five model<br />

or decomposition parameters (ai, si, pi, fi, and φi). The<br />

parameter ai is the expansion coefficient for the TF functions,<br />

si represents the time-width or scale parameter of the TF<br />

function, pi its time position, fi its center frequency and φi<br />

1–4244–0983–7/07/$25.00 c○ 2007 IEEE ICICS 2007<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />

1


Class A<br />

Class B<br />

TWFB<br />

Mapping<br />

Discriminative<br />

TF functions<br />

Marginals<br />

Wigner−Ville<br />

Distributions<br />

MCE<br />

Optimization<br />

Discriminative<br />

TWFB − Time−width vs Frequency Band Mapping, MCE − Minimum Cross Entropy, PTFD − Positive Time−frequency Distribution<br />

represents the phase of the TF function. The index i represents<br />

the iteration number. In our study Gaussian TF functions were<br />

used due to their excellent TF resolution properties [1].<br />

In order to effectively utilize the ATFT decompositions for<br />

discriminant subspace selection, the decomposition parameters<br />

need to be rearranged in a pseudo dictionary format. Of the<br />

five essential TF decomposition parameters explained, from<br />

a TF decomposition subspace point of view only the energy<br />

parameter ai, time-width si, and the frequency fi parameters<br />

are of relevance. This is because, the main feature of a TF<br />

function and thereby the decomposition itself lies in these three<br />

parameters. The phase parameter φi provides the information<br />

about how the TF functions combine to form the signal, which<br />

is of more relevance in a signal reconstruction scenario. The<br />

information provided by the time parameter pi is not important<br />

for identifying global signal patterns since in most cases the<br />

pattern recognition applications look for only global patterns<br />

irrespective of its occurrence in time (translation invariance).<br />

Also the time (pi) independence is the key to bring generality<br />

and organization to the TWFB mapping. Hence only ai,<br />

si, and fi are used in the construction of TWFB mapping.<br />

However it is obvious that without the time and time-varying<br />

information neither instantaneous features nor features related<br />

to time-triggered events can be extracted. So, after locating<br />

the discriminant subspaces using the TWFB mapping, all 5<br />

ATFT decomposition parameters (including pi and φi) ofthe<br />

TF functions that correspond to the discriminatory subspaces<br />

will be used to construct the PTFD.<br />

Let us redefine the subscript of si to sw, fi to fb and<br />

a 2 i (energy parameter) to a 2 (j, sw, fb)<br />

Fig. 1. Block diagram of the proposed methodology<br />

. The index w in the<br />

sw now represents the possible time-width values of the TF<br />

function. The sw then represents all the TF functions with a<br />

particular time-width w. Similarly the index b in fb represents<br />

the possible frequency band values. The fb then represents all<br />

the TF functions that occur within a particular frequency band<br />

b. The range of values for w and b is determined by the discrete<br />

implementation of the ATFT algorithm and the choice of the<br />

frequency band resolution. Depending upon the application<br />

requirement, finer resolution TWFB can be constructed by<br />

controlling the step size of the time-width and frequency<br />

PTFD<br />

parameters. The subscript (j, sw, fb) of the energy parameter<br />

corresponds to the j th TF function that has a time-width value<br />

of w and occurs within the b th frequency band. For every<br />

combination of (sw,fb), there may be j =1, ..., M TF<br />

functions. The M varies for each combination of (sw,fb)and<br />

is signal dependent.<br />

The TWFB mapping can then be defined as the cumulative<br />

energy distribution of the TF functions for all the possible<br />

combinations of the time-widths (sw) and frequency bands<br />

(fb) and is given by<br />

TWFB(sw,fb) =<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />

2<br />

M (sw ,fb ) <br />

j=1<br />

a 2 (j,sw,fb) , (1)<br />

In the implementation used in this study, the index w<br />

takes values from 1 to 14 (which translates into a length<br />

of 2w time samples) and b values range from 1 to 4 (Each<br />

covering 1 th<br />

4 of normalized frequency spectrum). This give<br />

us 56 possible combinations of sw and fb. Let us call each<br />

of these combinations as a tile in the TWFB mapping. Each<br />

of the these cell corresponds to the cumulative energy of<br />

all the TF functions corresponding to a particular sw and<br />

fb combination. Fig. 2 shows a sample signal, spectrogram,<br />

and its TWFB mapping. The X axis of the TWFB map<br />

corresponds to the time-width or scale parameter sw and the<br />

Y axis corresponds to the frequency bands fb. Here we would<br />

like to stress that the TWFB mapping is independent of time<br />

occurrences of a signal pattern which is a desirable property<br />

(translation invariance) for pattern recognition. The Z axis<br />

(color) indicates the magnitude of the cumulative energy of TF<br />

functions for each cell. The next subsection will explain how<br />

the TWFB mapping can be used for identifying discriminative<br />

TF decomposition subspace.<br />

B. Discriminative Subspace Selection<br />

TWFB maps of different classes of signals can be compared<br />

to arrive at the TWFB tiles (difference mapping) that demonstrate<br />

high discrimination between the classes of the signal [6],<br />

[7], [8]. As an example, for a 2 class problem (class A and<br />

class B as shown in Fig. 1), we compute the average TWFB


Amplitude<br />

Frequency<br />

Frequency bands (f b )<br />

1<br />

0.5<br />

0<br />

−0.5<br />

1<br />

0.5<br />

0<br />

1 TWFB map<br />

2<br />

3<br />

4<br />

2000 4000 6000 8000 10000 12000 14000 16000<br />

Time samples<br />

Spectrogram<br />

1000 2000 3000 4000<br />

Time<br />

5000 6000 7000 8000<br />

2 4 6 8<br />

Time−width (s )<br />

w<br />

10 12 14<br />

Fig. 2. A sample TWFB mapping of a synthetic signal. From top to bottom:<br />

time domain signal, spectrogram, and TWFB map.<br />

mapping for each of the class using a set of training signals.<br />

The corresponding cells of the average TWFB mappings are<br />

then compared by applying a dissimilarity measure D to obtain<br />

a difference mapping. The choice of dissimilarity measure D<br />

can be made depending on the application. In the proposed<br />

method, the simple absolute energy difference between the<br />

cells were used as the dissimilarity measure. The difference<br />

mapping is then given by<br />

TWFB(sw,fb) diff = |TWFB(sw,fb) A − TWFB(sw,fb) B |<br />

(2)<br />

After arranging the cells of the difference mapping in the<br />

decreasing order of their absolute difference value, the top P<br />

cells are chosen as the discriminatory cells. The restricted span<br />

of sw and fb or the discriminant TF decomposition subspace<br />

corresponding to these P cells then represents the highly<br />

discriminatory clues of a signal. Once the span of sw and<br />

fb are identified using the training signals, the TF functions<br />

corresponding to this span could be used to construct the<br />

discriminative PTFD.<br />

C. Construction of Discriminative PTFD<br />

The ATFT based PTFD is constructed by adding the<br />

Wigner-Ville distributions (WVDs) of the individual TF functions<br />

[4]. WVD is a quadratic classical TFD and is known to<br />

have the best TF resolution [1]. However it suffers from the<br />

cross term artifacts when dealing with multicomponent signals<br />

due its bilinear nature. For notational convenience from this<br />

point forward let us represent the ATFT based PTFD as just<br />

AP. In order to explain the cross-term free nature of the AP, let<br />

us express a signal x(t) in terms of TF functions (gγi) afterL<br />

ATFT iterations as x(t) = L−1<br />

i=0 gγi(t)+R L x(t)<br />

[4]. The first part of the preceding equation represents the<br />

signal modeled using the L TF functions and the second part<br />

of the equation represents the residue of the signal x(t).<br />

As explained in Section II-A each of the TF function gγi is<br />

represented by a set of decomposition parameters (ai, si, pi,<br />

fi, and φi). Now assuming that the signal x(t) is completely<br />

modeled with L TF functions (i.e. the residue is zero after L<br />

iterations), we could express the WVD of x(t) in terms of the<br />

TF functions gγi as<br />

WVDx(t, f) =<br />

+<br />

<br />

L−1<br />

| | 2 Wgγi(t, f)<br />

L−1 L−1 <br />

h=i (3)<br />

W[gγi,gγh](t, f)<br />

where Wgγ(t, f) is the WVD of the TF function. Here it<br />

should be noted that the TF functions gγi used in this work<br />

are Gaussian hence the WVD of the individual TF functions<br />

Wgγ(t, f) are positive [4]. The second term (double sum)<br />

corresponds to the cross term artifacts that occur if x(t) is<br />

a multicomponent signal. These cross terms could be easily<br />

removed by retaining only the first term in the Equ. 3 which<br />

yields the AP<br />

L−1 <br />

AP (t, f) = | | 2 Wgγi(t, f) (4)<br />

i=0<br />

The AP generated this way is positive, free from cross<br />

terms, and has sufficiently high TF resolution for analyzing<br />

non-stationary signals. However the AP does not satisfy the<br />

marginal properties. The marginal properties of AP will be<br />

addressed in the later part of this section.<br />

Similar to the above case of constructing AP for the complete<br />

signal x(t), we could compute AP for the discriminatory<br />

portion of the signal x(t) identified using the TWFB mappings.<br />

Let us denote the discriminatory portion of the signal as ˆx(t)<br />

that corresponds to Q TF functions that occurred within the<br />

restricted span of sw and fb (as explained in Section II-B). The<br />

ˆx(t) can then be written as ˆx(t) = Q−1<br />

i=0 gγi(t),<br />

where Q


as an added advantage of this approach, the AP is automatically<br />

denoised from the non-discriminant and overlapping signal<br />

structures. Here the term “noise” means irrelevant signal<br />

information for a particular application.<br />

As mentioned earlier the AP (t, f)ˆx still needs to be corrected<br />

for its marginals. The works of [2], [3] have demonstrated<br />

successful ways of correcting the marginals of an<br />

improper TFD using the minimum cross entropy (MCE) optimization<br />

and TF copulas respectively. Although the TF copula<br />

based approach [3] is a recent work that is computationally<br />

attractive, we chose to use the MCE approach since it has<br />

been already successfully applied to correct ATFT based TFDs<br />

in [5]. Moreover the choice between these two approaches<br />

does not affect the main focus of this paper in constructing<br />

the discriminative PTFD. MCE is an iterative process where<br />

the correction factor of the marginals is computed as the<br />

ratio of the marginals of the improper PTFD to the actual<br />

marginals extracted from the signal. Iteratively this procedure<br />

is alternatively applied in time and frequency directions till the<br />

marginals of the PTFD matches to that of the actual signal<br />

marginals. Let us assume the AP (t, f) ′<br />

ˆx<br />

is the corrected<br />

ATFM-TFD that satisfies the marginals, u(t) as the true time<br />

marginal which could be extracted from the time domain<br />

signal, and u ′ (t) as the actual time marginal of AP (t, f)ˆx,<br />

then, after simplification the first iteration is shown as<br />

AP 1 (t, f) ′<br />

u(t)<br />

ˆx = AP (t, f)ˆx<br />

u ′ (t)<br />

This operation scales the AP (t, f)ˆx by the time marginal<br />

ratio. After this operation, AP 1 (t, f) ′<br />

ˆx would be the corrected<br />

AP in the time marginal. Now, to satisfy the frequency<br />

marginal, the operation is repeated with AP 1 (t, f) ′<br />

ˆx as the<br />

prior estimate and computing the correction factor using u(f)<br />

and u ′ (f). This process is repeated alternatively to correct the<br />

time and frequency marginals. The only difference in the above<br />

procedure compared to the previous works is that the true<br />

time u(t) and frequency marginal u(f) are computed from the<br />

discriminatory portion of the signal ˆx(t). ˆx(t) is reconstructed<br />

using the discriminant TF functions and the true marginals are<br />

computed before being used with MCE. The block diagram<br />

in Fig.1 shows this operation of extracting marginals from the<br />

discriminant TF functions. After n iterations the AP n (t, f) ′<br />

ˆx<br />

will become the corrected discriminative AP satisfying the<br />

marginal conditions and is suitable for extracting meaningful<br />

instantaneous features.<br />

III. RESULTS AND DISCUSSIONS<br />

To demonstrate the proposed construction of AP we present<br />

here a pathological speech classification (characterization)<br />

application. Fig. 3 has 5 rows and 2 columns containing<br />

10 images. The top row of the figure shows a normal and<br />

pathological speech segments of length 16384 samples in time<br />

domain. The second row shows the spectrogram of the speech<br />

segments which gives an idea about their time-frequency<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />

(6)<br />

4<br />

Fig. 3. Example of constructing discriminatory PTFD - Pathological speech<br />

classification application<br />

energy distribution. The TWFB mappings of the normal and<br />

pathological speech segments are shown in the third row of<br />

the figure. These TWFB mappings were constructed using<br />

1000 TF functions each. Please note that the X axis of the<br />

TWFB mappings are in time-width and not time. It is evident<br />

from the TWFB images that there is a significant difference in<br />

the cumulative energy distribution of the corresponding cells<br />

especially between sw of 5 to 11 and frequency bands fbof 3<br />

and 4. The difference mapping was computed and the cells that<br />

exhibited high difference between the normal and pathological<br />

speech segments were chosen. In the third row of the figure,<br />

the chosen cells are shown bounded by a rectangular box.<br />

All the Q TF functions that fell within these boxes were<br />

used to reconstruct the discriminatory portion of the normal<br />

and pathological speech segments (ˆx(t)). During the reconstruction<br />

all the 5 ATFT decomposition parameters were used.<br />

The discriminatory portion of the reconstructed signals are<br />

shown in the fourth row of the figure. By closely comparing<br />

the first row of original signal x(t) and the fourth row of ˆx(t)<br />

it could be observed that the discriminatory cells of the TWFB<br />

mapping did capture the signal components that differ between<br />

the two speech segments. The discriminatory AP n (t, f) ′<br />

ˆx was<br />

then computed in 5 MCE iterations using Q TF functions and<br />

the marginals extracted from ˆx(t). TheAP n (t, f) ′<br />

ˆx of the<br />

normal and pathological speech segments are shown in the fifth


Normalized Frequency<br />

Normalized Frequency<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

IMF of the normal speech segment<br />

2000 4000 6000 8000<br />

Time instants<br />

10000 12000 14000 16000<br />

IMF of the pathological speech segment<br />

2000 4000 6000 8000<br />

Time instants<br />

10000 12000 14000 16000<br />

Fig. 4. Instantaneous mean frequency (IMF) extracted from the discriminatory<br />

AP n (t, f) ′<br />

ˆx ’s<br />

row of the figure. Extracting instantaneous features from these<br />

AP n (t, f) ′<br />

ˆx ’s will readily demonstrate the discrimination<br />

between the normal and pathological speech segments. As an<br />

example, we extracted the instantaneous mean frequency (IMF)<br />

from the AP n (t, f) ′<br />

ˆx ’s of normal and pathological speech<br />

segments and are shown in the Fig. 4. The difference between<br />

the IMF pattern is evident from the figure. It should be noted<br />

that we achieved the above result using only the PTFD that<br />

was constructed using the discriminatory portion of the signal.<br />

IV. CONCLUSIONS<br />

A novel approach to construct discriminatory PTFD for<br />

instantaneous feature extraction was presented. The proposed<br />

method used the TWFB mappings to identify the discriminatory<br />

decomposition subspaces and translated them into<br />

corresponding PTFD. The instantaneous features extracted<br />

from discriminatory PTFD are expected to be free of overlaps<br />

from irrelevant signal structures and ideally suitable for<br />

classification applications. Since PTFDs are the best way to<br />

extract meaningful instantaneous features from the signal, the<br />

proposed approach is an optimal way to perform meaningful<br />

classification that demands instantaneous features. A pathological<br />

speech classification example was used to demonstrate<br />

the proposed technique and the results were discussed.<br />

Future work involves applying the proposed method to real<br />

world applications and compare its performance with the<br />

conventional ways of extracting instantaneous features. TF<br />

copula based approach will also be investigated in constructing<br />

discriminative PTFDs.<br />

REFERENCES<br />

[1] L. Cohen, “Time-frequency distributions - a review,” Proceedings of the<br />

IEEE, vol. 77(7), pp. 941–981, 1989.<br />

[2] P. Loughlin, J. Pitton, and L. Atlas, “Construction of positive timefrequency<br />

distributions,” IEEE Trans. on <strong>Signal</strong> Processing, vol. 42, pp.<br />

2697–2705, 1994.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />

5<br />

[3] M. Davy and A. Doucet, “Copulas: a new insight into positive timefrequency<br />

distributions,” IEEE <strong>Signal</strong> Processing Letters, vol. 10, no. 7,<br />

pp. 215–218, 2003.<br />

[4] S. G. Mallat and Z. Zhang, “Matching pursuit with time-frequency<br />

dictionaries,” IEEE Trans. <strong>Signal</strong> Processing, vol. 41, no. 12, pp. 3397–<br />

3415, 1993.<br />

[5] S. Krishnan, R. M. Rangayyan, G. D. Bell, and C. B. Frank, “Adaptive<br />

time-frequency analysis of knee joint vibroarthrographic signals for<br />

noninvasive screening of articular cartilage pathology,” IEEE Trans. on<br />

biomedical engineering, vol. 47, no. 6, pp. 773–783, June 2000.<br />

[6] K. Umapathy and S. Krishnan, “Time-width versus frequency band mapping<br />

of energy distributions,” IEEE Transactions on <strong>Signal</strong> Processing,<br />

vol. 55, no. 3, pp. 978–989, Mar 2007.<br />

[7] K. Umapathy, S. Krishnan, and A. Das, “Sub-dictionary selection using<br />

local discriminant bases algorithm for signal classification,” in Proceeding<br />

of IEEE Canadian Conference on Electrical and Computer Engineering,<br />

Niagara falls, Canada, May 2004, pp. 2001–2004.<br />

[8] K. Umapathy and S. Krishnan, “A signal classification approach using<br />

time-width vs frequency band sub-energy distributions,” in Proc. IEEE<br />

International conference on Acoustics, Speech and <strong>Signal</strong> processing<br />

ICASSP05, Philadelphia, USA, Mar 2005, pp. V 477–480.


Combining Vocal Source and MFCC Features for Enhanced<br />

Speaker Recognition Performance Using GMMs<br />

Abstract— This work presents seven novel spectral features for speaker<br />

recognition. These features are the spectral centroid (SC), spectral<br />

bandwidth (SBW), spectral band energy (SBE), spectral crest factor<br />

(SCF), spectral flatness measure (SFM), Shannon entropy (SE) and Renyi<br />

entropy (RE). The proposed spectral features can quantify some of the<br />

characteristics of the vocal source or the excitation component of speech.<br />

This is useful for speaker recognition since vocal source information<br />

is known to be complementary to the vocal tract transfer function,<br />

which is usually obtained using the Mel frequency cepstral coefficients<br />

(MFCC) or linear predication cepstral coefficients (LPCC). To evaluate<br />

the performance of the spectral features, experiments were performed<br />

using a text-independent cohort Gaussian mixture model (GMM) speaker<br />

identification system. Based on 623 users from the TIMIT database, the<br />

spectral features achieved an identification accuracy of 99.33% when<br />

combined with the MFCC based features and when using undistorted<br />

speech. This represents a 4.03% improvement over the baseline system<br />

trained with only MFCC and ΔMFCC features.<br />

I. INTRODUCTION<br />

Speaker recognition has many potential applications as a biometric<br />

tool for resources that can be accessed via the telephone or internet.<br />

In these applications, the identity of users cannot be verified because<br />

there is no direct contact between the user and the service provider.<br />

Hence, speaker recognition is a cost effective and practical technology<br />

that can be used for enhanced security.<br />

Often in literature, the entire speech system is modeled with a<br />

time-varying excitation and a short-time-varying filter [1]. Using this<br />

model, the source and filter are assumed independent and hence the<br />

speech signal (s(t)) is modeled by the linear convolution of:<br />

Danoush Hosseinzadeh and Sridhar Krishnan<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON - M5B 2K3 Canada<br />

Email: (danoushh@hotmail.com) (krishnan@ee.ryerson.ca)<br />

s(t) =x(t) ∗ h(t) (1)<br />

where, x(t) is a periodic excitation (for voiced speech) or white<br />

noise (for unvoiced speech) and h(t) is a time-varying filter which<br />

constantly changes to produce different sounds. Although h(t) is<br />

time varying, it can be considered stable over a period of a few<br />

milliseconds (ms); typically around 10-30ms is commonly used in<br />

literature [1]. This convenient short-time stationary behavior is exploited<br />

by many speaker recognition systems in order to characterize<br />

the vocal tract configuration given by h(t), which is known to be<br />

a unique speaker-dependent characteristic for a given sound. While<br />

assuming a linear model, this information can be easily extracted<br />

from speech signals using well established deconvolution techniques<br />

such as homomorphic filtering or linear prediction methods.<br />

To date, the most effective features for speaker recognition have<br />

been the Mel frequency cepstral coefficient (MFCC) and the linear<br />

prediction cepstral coefficients (LPCC) [2][1][3]. These features can<br />

accurately characterize the vocal tract configuration of a speaker and<br />

can achieve good performance. Part of the success of these features<br />

is that they provide a compact representation of the vocal tract which<br />

can be modeled effectively. The first several MFCCs can characterize<br />

the speaker’s vocal tract configuration and LPCCs generally define<br />

lower order polynomials [1]. Additionally, the first derivative of the<br />

MFCC feature (ΔMFCC) is largely uncorrelated with the MFCC<br />

feature and has been shown to enhance recognition performance.<br />

Although the MFCC and LPCC based features have proven to be<br />

effective for speaker recognition, they do not provide a complete<br />

description of the speaker’s speech system. Hence, vocal source<br />

information can complement these traditional features by quantifying<br />

some speaker-dependent characteristics such as pitch, harmonic<br />

structure and spectral energy distribution [4][5].<br />

This work proposes seven novel spectral features for speaker<br />

recognition that can quantify the vocal source. These features are<br />

the spectral centroid (SC), spectral bandwidth (SBW), spectral band<br />

energy (SBE), spectral crest factor (SCF), spectral flatness measure<br />

(SFM), Shannon entropy (SE), and Renyi entropy (RE). These<br />

spectral features can be used to complement the MFCC or LPCC<br />

features since they can quantify characteristics of the vocal source.<br />

It is also known that there is some degree of coupling between<br />

the vocal source and vocal tract [6][4] - i.e. the linear model<br />

assumed when calculating MFCC and LPCC is not entirely accurate.<br />

Therefore, the vocal source signal is to some extent predictable<br />

for a given vocal tract configuration. Given these factors, features<br />

that characterize the vocal source can be expected to improve the<br />

performance of existing speaker recognition systems. In this work,<br />

the seven proposed spectral features are extracted from the speech<br />

spectrum and used to enhance the performance of MFCC-based<br />

features in order to illustrate their effectiveness.<br />

Others have attempted to use the vocal source for improving<br />

performance of speaker recognition systems. Attempts have been<br />

made to develop features from the LPCC residual [7][8] with some<br />

success. In these cases, the authors have noted improved performance<br />

by complementing vocal tract features with vocal source information.<br />

The paper is organized as follows. Section II defines the baseline<br />

system used for testing and presents the spectral features. Section<br />

III presents the results as well as the experimental conditions and<br />

Section IV concludes the paper.<br />

II. PROPOSED TESTING METHOD<br />

GMM based speaker recognition systems have become the most<br />

popular method to date. This is because GMMs can capture the<br />

acoustic phenomena or acoustic classes that are present in speech<br />

[2]. In fact, some of the GMM clusters have been found to be<br />

highly correlated with particular phonemes [9]. As a result, good<br />

recognition performance can be achieved with GMM based systems.<br />

The performance of the proposed spectral features will be compared<br />

to the baseline system, which is a cohort text-independent<br />

GMM classifier trained with 14-dimensional MFCC vectors and 14dimensional<br />

ΔMFCC vectors extracted from 30ms speech frames.<br />

The log-likelihood function is used to find the user model that best<br />

matches a given utterance.<br />

1-4244-1274-9/07/$25.00 ©2007 IEEE 365<br />

6<br />

MMSP 2007<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.


TABLE I<br />

SUBBAND ALLOCATION USED TO CALCULATED SPECTRAL FEATURES.<br />

Subband Lower Edge (Hz) Upper Edge (Hz)<br />

1 300 627<br />

2 628 1060<br />

3 1061 1633<br />

4 1634 2393<br />

5 2394 3400<br />

A. Training and GMM Estimation<br />

The expectation maximization (EM) algorithm was used to estimate<br />

the parameters of the GMM models. In the past, model orders<br />

of 8-32 have been commonly used in literature however, good results<br />

have been obtained with cohort GMM systems using as little as<br />

16 components [2][10]. A model order of 24 was in this work to<br />

account for the additional features being used in the system and also,<br />

preliminary experimental results indicate that this model order was<br />

the optimal order for the proposed feature set given models of order<br />

16, 20, 24, 28 and 32. The k-means algorithm was used to obtain<br />

the initial estimate for each cluster since it has been shown that the<br />

initial grouping of data does not significantly affect the performance<br />

of GMM based recognition systems [2].<br />

A diagonal covariance matrix was used to estimate the variances<br />

of each cluster in the models since it is well known that diagonal<br />

covariance matrices are much more computationally efficient than full<br />

covariance matrices. Furthermore, diagonal covariance matrices can<br />

provide the same level of performance as full covariance matrices<br />

because they can capture the correlation between the features if<br />

a larger model order is used [11]. For these reasons, diagonal<br />

covariance matrices have almost been exclusively used in previous<br />

speaker recognition works. Each element of these matrices is limited<br />

to a minimum value of 0.01 during the EM estimation process to<br />

prevent singularities in the matrix, as recommended by [2].<br />

B. Spectral Features<br />

The proposed spectral features can be expected to improve the<br />

performance of MFCC or LPCC features because they can capture<br />

complementary information related to the vocal source such as pitch,<br />

harmonic structure, energy distribution, bandwidth of the speech<br />

spectrum and even voiced or unvoiced excitation. To illustrate the<br />

effectiveness of these features, they are extracted from the speech<br />

spectrum and used to enhance the performance of MFCC and<br />

ΔMFCC features.<br />

Spectral features should be extracted from multiple subbands,<br />

as shown in Table I. This extraction method will provide better<br />

discrimination between different speakers because the trend for a<br />

given feature can be captured from the spectrum. This is better than<br />

obtaining one global value from the spectrum, which is not likely to<br />

show speaker-dependent characteristics.<br />

The proposed subbands are linearly spaced on the Mel scale and<br />

spans the range of a practical telephone channel (300Hz-3.4kHz).<br />

This allocation scheme reflects the fact that most of the energy<br />

of the speech signal is located in the lower frequency regions and<br />

therefore, narrowly defined subbands are used in the lower frequency<br />

regions in order to capture more detail. This is also consistent with<br />

the non-linearities of human auditory perception, which shows<br />

more sensitivity to lower frequencies than higher frequencies. This<br />

non-linearity has been shown to be important for cepstral based<br />

features such as the MFCC feature [3].<br />

Spectral features are extracted from 30ms speech frames as<br />

follows. Let si[n] for n ∈ [0,N], represents the i th speech frame<br />

366<br />

7<br />

and Si[f] represents the spectrum of this frame. Then, Si[f] can<br />

be divided into M non-overlapping subbands where, each subband<br />

(b) is defined by a lower frequency edge (lb) and a upper frequency<br />

edge (ub). Now, each of the seven spectral features can be calculated<br />

from Si[f] as shown below.<br />

Spectral Centroid (SC) - SC as given below is the weighted<br />

average frequency for a given subband, where the weights are the<br />

normalized energy of each frequency component in that subband.<br />

Since this measure captures the center of gravity of each subband it<br />

can locate large peaks in subbands. These peaks correspond to the<br />

approximate location of formants [12] or pitch frequencies.<br />

SCi,b =<br />

ub<br />

f=l b f|Si[f]| 2<br />

ub<br />

f=l b |Si[f]| 2<br />

Spectral Bandwidth (SBW) - SBW as given below is the weighted<br />

average distance from each frequency component in a subband to<br />

the spectral centroid of that subband. Here, the weights are the<br />

normalized energy of each frequency component in that subband.<br />

This measure quantifies the relative spread of each subband for<br />

a given sound and therefore, it might characterize some speaker-<br />

dependent information.<br />

SBWi,b =<br />

(2)<br />

ub<br />

f=l b (f − SCi,b) 2 |Si[f]| 2<br />

ub<br />

f=l b |Si[f]| 2 (3)<br />

Spectral Band Energy (SBE) - SBE as given below is the energy of<br />

each subband normalized with the combined energy of the spectrum.<br />

The SBE gives the trend of energy distribution for a given sound and<br />

therefore, it contains some speaker-dependent information.<br />

ub f=l |Si[f]|<br />

b<br />

SBEi,b =<br />

2<br />

<br />

f,b |S[f]|2 (4)<br />

Spectral Flatness Measure (SFM) - SFM as given below is a<br />

measure of the flatness of the spectrum, where white noise has a<br />

perfectly flat spectrum. This measure is useful for discriminating<br />

between voiced and un-voiced components of speech [13].<br />

SFMi,b =<br />

ub<br />

f=l b |Si[f]| 2<br />

1<br />

u b−l b+1<br />

1<br />

u b −l b +1<br />

ub<br />

f=l b |Si[f]| 2<br />

Spectral Crest Factor (SCF) - SCF as given below provides a<br />

measure for quantifying the tonality of the signal. This measure is<br />

useful for discriminating between wideband and narrowband signals<br />

by indicating the relative peak of a subband. These peaks correspond<br />

to the most dominant pitch frequency in each subband.<br />

SCFi,b =<br />

1<br />

u b−l b+1<br />

max(|Si[f]| 2 )<br />

ub<br />

f=l b |Si[f]| 2<br />

Renyi Entropy (RE) - RE as given below is an information theoretic<br />

measure that quantifies the randomness of the subband. Here, the<br />

normalized energy of the subband can be treated as a probability<br />

distribution for calculating entropy and α is set to 3, as commonly<br />

found in literature [14]. This RE trend is useful for detecting the<br />

voiced and unvoiced components of speech.<br />

REi,b = 1<br />

1 − α log ⎛ <br />

<br />

u α<br />

b<br />

<br />

⎝ Si[f]<br />

<br />

2 <br />

<br />

ub <br />

<br />

f=l f=l Si[f] <br />

b<br />

b ⎞<br />

⎠ (7)<br />

Shannon Entropy (SE) - SE as given below is also an information<br />

theoretic measure that quantifies the randomness of the subband.<br />

Here, the normalized energy of the subband can be treated as a<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.<br />

(5)<br />

(6)


probability distribution for calculating entropy. Similar to the RE<br />

trend, the SE trend is also useful for detecting the voiced and unvoiced<br />

components of speech.<br />

<br />

<br />

u b<br />

<br />

<br />

Si[f]<br />

SEi,b = − <br />

<br />

ub <br />

<br />

f=l f=l Si[f] <br />

b<br />

b · log <br />

<br />

<br />

<br />

Si[f]<br />

2 <br />

<br />

ub (8)<br />

f=l Si[f] <br />

b<br />

To the best of our knowledge, these features are being used for<br />

the first time in speaker recognition although they have previously<br />

been used in other areas [15]. These spectral features along with the<br />

MFCC and ΔMFCC features will be extracted from each speech<br />

frame and appended together to form a combined feature matrix for<br />

the speech signal. These vectors can then be modeled and used for<br />

speaker recognition. Equation 9 shows the feature matrix that can<br />

be extracted based on only one spectral feature, say the SC feature,<br />

from i frames; where the bracketed number is the length of the<br />

feature. It should be noted that any other spectral feature can be<br />

substituted in for the SC feature in the feature matrix.<br />

−→ F =<br />

⎡<br />

⎢<br />

⎣<br />

MFCC1(14)<br />

.<br />

ΔMFCC1(14)<br />

.<br />

SC1(5)<br />

.<br />

⎤<br />

⎥<br />

⎦ (9)<br />

MFCCi(14) ΔMFCCi(14) SCi(5)<br />

The spectral features are expected to be largely uncorrelated with<br />

the MFCC based features because the spectral features can capture<br />

some information about the vocal source, whereas the MFCC features<br />

tend to capture information about the vocal tract. Among the spectral<br />

features, there may be some correlation between the SC and the<br />

SCF features because they both quantify information about the peaks<br />

(locations of energy concentration) of each subband. The difference is<br />

that the SCF feature describes the normalized strength of the largest<br />

peak in each subband while the SC feature describes the center of<br />

gravity of each subband. Therefore, these features will perform well<br />

if the largest peak in a given subband is much larger than all other<br />

peaks in that subband. The RE and SE features are also correlated<br />

since they are both entropy measures. However, the RE feature is<br />

much more sensitive to small changes in the spectrum because of<br />

the exponent term α. Therefore, although these features quantify the<br />

same type of information, their performance may be different for<br />

speech signals.<br />

III. EXPERIMENTAL RESULTS<br />

All speech samples used in these experiments were obtained from<br />

623 speakers of the TIMIT speech corpus. Since the TIMIT database<br />

has a sampling frequency of 16kHz, the signals were down sampled<br />

to 8kHz which is well suited for telephone applications. Features<br />

were extracted from 30ms long frames with 15ms of overlap with the<br />

previous frames and a Hamming window was applied to each frame<br />

to ensure a smooth frequency transition between frames. Twenty<br />

seconds of undistorted speech from each speaker was used to train the<br />

system and the remaining samples were used for testing. Although the<br />

tests were performed with undistorted audio, it is expected that some<br />

of these features will remain robust to different linear and non-linear<br />

distortions [15].<br />

A. Results and Discussions<br />

MFCC based features are very effective for characterizing the<br />

vocal tract configuration. Although this is the main reason for the<br />

success of the MFCC based features, they do not provide a complete<br />

description of the speaker’s speech system. The proposed spectral<br />

features are expected to increase identification accuracy of MFCC<br />

367<br />

8<br />

TABLE II<br />

EXPERIMENTAL RESULTS USING 7S TEST UTTERANCES (298 TESTS)<br />

Feature Accuracy(%)<br />

MFCC & ΔMFCC (Baseline system) 95.30<br />

MFCC & ΔMFCC & SC 97.32<br />

MFCC & ΔMFCC & SBE 97.32<br />

MFCC & ΔMFCC & SBW 96.98<br />

MFCC & ΔMFCC & SCF 96.31<br />

MFCC & ΔMFCC & SFM 81.55<br />

MFCC & ΔMFCC & SE 90.27<br />

MFCC & ΔMFCC & RE 98.32<br />

MFCC & ΔMFCC & SBE & SC 96.98<br />

MFCC & ΔMFCC & SBE & RE 96.98<br />

MFCC & ΔMFCC & SC & RE 99.33<br />

based systems because they provide some information about the<br />

vocal source.<br />

Table II demonstrates the identification accuracy of the system<br />

when using spectral features in addition to the MFCC based features<br />

with undistorted speech. The table also shows several combinations<br />

of the best performing features. The accuracy rate represents the<br />

percentage of test samples that were correctly identified by the<br />

system, as shown below.<br />

Samples Correctly Identified<br />

Accuracy = (10)<br />

Total Number of Samples<br />

It is evident from these results that there is some speaker-dependent<br />

information captured by most of the proposed features since they improved<br />

identification rates when combined with the standard MFCC<br />

based features. In fact, when two of the best performing spectral<br />

features (SC and RE) were simultaneously combined with the MFCC<br />

based features, an identification accuracy of 99.33% was achieved,<br />

which represents a 4.03% improvement over the baseline system.<br />

These results suggest that the proposed spectral features provide<br />

complementary and discriminatory information about the speaker’s<br />

vocal source and system, which leads to enhanced identification<br />

accuracies.<br />

The best performing feature was the RE feature. This feature is<br />

very effective at quantifying voiced speech which is quasi-periodic<br />

(relatively low entropy) and un-voiced which is often represented<br />

by AWGN (relatively high entropy). However, we suspect that the<br />

RE feature may also be characterizing another phenomena other<br />

than voice and unvoiced speech. This is likely since the SE feature<br />

did not show any performance benefits and it too is an entropy<br />

measure capable of discriminating between voiced and unvoiced<br />

speech. One possibility is that the exponential term α in the RE<br />

definition is contributing to this performance improvement. Since the<br />

spectrum is a normalized to the range of [0,1] before calculating<br />

these features, the exponent term α has the effect of significantly<br />

reducing the contributions of the low energy components relative to<br />

the high energy components. Therefore, the RE feature is likely to<br />

produce a more reliable measure since it heavily relies on the high<br />

energy components of each subband. However, the entropy features<br />

in general are susceptible to random noise and will not perform well<br />

in all conditions.<br />

Figure 1(a) shows that the SC feature can capture the center of<br />

gravity of each subband. Since the subband’s center of gravity is<br />

related to the spectral shape of the speech signal, it implies that the SC<br />

feature can also detect changes in pitch and harmonic structure since<br />

they fundamentally affect the spectrum. Pitch and harmonic structure<br />

convey some speaker-dependent information and are complementary<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.


Mag.<br />

Mag.<br />

0.15<br />

0.1<br />

0.05<br />

(a) Location of SC<br />

0<br />

0.15<br />

0.1<br />

0.05<br />

500 1000 1500 2000 2500<br />

(b) Location of SCF<br />

3000 3500 4000<br />

Frequency (Hz)<br />

Mag.<br />

Mag.<br />

0<br />

0.2<br />

0.1<br />

500 1000<br />

8% 18% 2%<br />

1500 2000 2500<br />

(c) Percentage of SBW<br />

33% 38%<br />

3000 3500 4000<br />

Frequency (Hz)<br />

0<br />

0<br />

0.2<br />

0.1<br />

500 1000 1500 2000 2500<br />

(d) Percentage of SBE<br />

46% 5% 3% 2% 2%<br />

3000 3500 4000<br />

Frequency (Hz)<br />

0<br />

0 500 1000 1500 2000 2500 3000 3500 4000<br />

Frequency (Hz)<br />

Fig. 1. Plot of the spectral features. Subband boundaries are indicated<br />

with dark solid lines and feature location is indicated with dashed lines. (a)<br />

Location of the SC (b) Location of the SCF (c) SBW as a percentage of the<br />

five subbands. (d) SBE as a percentage of the of the whole spectrum.<br />

to the vocal tract transfer function for speaker recognition. In addition,<br />

the SC feature can also locate the approximate location of the<br />

dominant formant in each of the subbands since formants will tend<br />

towards the subband’s center of gravity. These properties of the SC<br />

feature provide complementary information and led to the improved<br />

performance of the MFCC based classifier.<br />

The SCF feature shown in Figure 1(b) quantifies the normalized<br />

strength of the dominant peak in each subband. Given that the<br />

dominant peak in each subband corresponds to a particular pitch<br />

frequency harmonic, it shows that the SCF feature is pitch dependent<br />

and therefore, it is also speaker-dependent for a given sound. This<br />

dependence on pitch frequency is useful when the vocal tract configuration<br />

(i.e. MFCC) is known as seen by the enhanced performance.<br />

Moreover, the SCF feature is a normalized measure and should not<br />

be significantly affected by the intensity of speech from different<br />

sessions.<br />

The SBE feature, shown in Figure 1(d), also performed well in<br />

the experiments. This feature provides the distribution of energy in<br />

each subband as a percentage of the entire spectrum, which is another<br />

measure that can quantify the harmonic structure of the signal. The<br />

SBE feature is also a normalized energy measure and should not<br />

be significantly affected by the intensity (or relative loudness) of<br />

speech from different sessions. The results in Table II suggests that<br />

for a given vocal tract configuration the SBE trend is predictable and<br />

complementary for speaker recognition.<br />

The SBW feature is largely dependent on the SC feature and the<br />

energy distribution of each subband therefore, it has also performed<br />

well for the reasons mentioned above. Figure 1(c) shows the SBW<br />

for each subband as a percentage of all subbands.<br />

The SFM feature did not perform well because it quantifies characteristics<br />

that are not well defined in speech signals. For example,<br />

the SFM feature measures the tonality of the subband, a characteristic<br />

that is difficult to define in the speech spectrum since its energy is<br />

distributed across many frequencies.<br />

IV. CONCLUSION<br />

Features such as the SC, SCF and SBE provide vocal source<br />

information as it relates to harmonic structure, pitch frequency and<br />

spectral energy distribution, while the entropy features quantify the<br />

368<br />

9<br />

spectrum in terms of voiced and unvoiced speech. The proposed<br />

features were shown to be complementary in nature and enhanced<br />

performance when used with the vocal tract transfer function (i.e.<br />

MFCC). This is mainly because the vocal tract transfer function is<br />

the most discriminating feature for speaker recognition and it greatly<br />

influences the spectral shape and harmonic structures of speech.<br />

Experimental results show that the proposed spectral features<br />

improve the performance of MFCC based features. Based on 623<br />

users from the TIMIT database, the combined feature set of MFCC,<br />

ΔMFCC, SC and RE achieved an identification accuracy of 99.33%<br />

(for clean speech) by incorporating information about the vocal<br />

source. This represents a 4.03% improvement over the baseline<br />

system, which only used the MFCC based features.<br />

The good performance of spectral features for speaker recognition<br />

in this speaker identification system is very promising. These features<br />

should also produce good results if used with more sophisticated<br />

speaker recognition techniques, such as universal background model<br />

(UBM) based approaches.<br />

REFERENCES<br />

[1] J. P. Campbell, “Speaker recognition: A tutorial.” Proc. of the IEEE,<br />

vol. 85, no. 9, pp. 1437–1462, Sep. 1997.<br />

[2] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker<br />

identification using gaussian mixture speaker models.” IEEE Trans. on<br />

Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, Jan. 1995.<br />

[3] S. B. Davis and P. Mermelstein, “Comparison of parametric representations<br />

for monosyllabic word recognition in continuously spoken<br />

sentences.” IEEE Trans. on Acoustics, Speech, and <strong>Signal</strong> Processing,<br />

vol. 28, no. 4, pp. 357–366, Aug. 1980.<br />

[4] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, “Modeling of the<br />

glottal flow derivative waveform with application to speaker identification.”<br />

IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5,<br />

pp. 569–586, Sept. 1999.<br />

[5] J. M. Naik, “Speaker verification: A tutorial.” IEEE Communications<br />

Magazine, vol. 28, no. 1, pp. 42–48, Jan. 1990.<br />

[6] C. Che and Q. Lin, “Speaker recognition using HMM with experiments<br />

on the yoho database.” in Proc. Eurospeech, Sept. 1995, pp. 625–628.<br />

[7] W. Chan, T. Lee, N. Zheng, and H. Ouyang, “Use of vocal source features<br />

in speaker segmentation.” in Proc. IEEE Int’l Conf. on Acoustics,<br />

Speech, and <strong>Signal</strong> Processing (ICASSP), vol. 1, May 2006, pp. 14–19.<br />

[8] N. Zheng and P. Ching, “Using haar transformed vocal source information<br />

for automatic speaker recognition.” in Proc. IEEE Int’l Conf. on<br />

Acoustics, Speech, and <strong>Signal</strong> Processing (ICASSP), vol. 1, May 2004,<br />

pp. 77–80.<br />

[9] R. Auckenthaler, E. S. Parris, and M. J. Carey, “Improving a GMM<br />

speaker verification system by phonetic weighting.” in Proc. IEEE Int’l<br />

Conf. on Acoustics, Speech, and <strong>Signal</strong> Processing (ICASSP), Mar. 1999,<br />

pp. 313–316.<br />

[10] J. Gonzalez-Rodriguez, S. Gruz-Llanas, and J. Ortega-Garcia, “Biometric<br />

identification through speaker verification over telephone lines.” in<br />

Proc. IEEE Int’l Carnahan Conf. on Security Technology, Oct. 1999,<br />

pp. 238–242.<br />

[11] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification<br />

using adapted gaussian mixtures models.” Digital <strong>Signal</strong> Processing,<br />

vol. 10, pp. 19–41, Jan. 2000.<br />

[12] K. K. Paliwal, “Spectral subband centroid features for speech recognition.”<br />

in Proc. IEEE Int’l Conf. on Acoustics, Speech, and <strong>Signal</strong> Proc.<br />

(ICASSP), vol. 2, May 1998, pp. 617–620.<br />

[13] R. E. Yantorno, K. R. Krishnamachari, and J. M. Lovekin, “The<br />

spectral autocorrelation peak valley ratio (SAPVR) − a usable speech<br />

measure employed as a co-channel detection system.” in Proc. IEEE<br />

Int’l Workshop on Intelligent <strong>Signal</strong> Processing (WISP), May 2001.<br />

[14] P. Flandrin, R. G. Baraniuk, and O. Michel, “Time-frequency complexity<br />

and information.” in Proc. IEEE Int’l Conf. on Acoustics, Speech, and<br />

<strong>Signal</strong> Processing (ICASSP), vol. 3, Apr. 1994, pp. 329–332.<br />

[15] A. Ramalingam and S. Krishnan, “Gaussian mixture modeling of shorttime<br />

fourier transform features for audio fingerprinting.” IEEE Trans. on<br />

Information Forensics and Security, vol. 1, no. 4, pp. 457–463, 2006.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.


Proceedings of the 29th Annual International<br />

Conference of the IEEE EMBS<br />

Cité Internationale, Lyon, France<br />

August 23-26, 2007.<br />

Multiresolution <strong>Analysis</strong> and Classification of Small Bowel Medical<br />

Images<br />

Abstract— This is the first reported work in the area of small<br />

bowel image classification and a novel analysis system was<br />

developed. Principles of human texture perception were used to<br />

design features which can discriminate between abnormal and<br />

normal images. The proposed method extracts statistical features<br />

from the wavelet domain, which describe the homogeneity<br />

of localized areas within the small bowel images. To ensure that<br />

robust features were extracted, a shift-invariant discrete wavelet<br />

transform (SIDWT) was explored. LDA classification was used<br />

with the leave one out method to improve classification under<br />

the small database scenario. A total of 75 abnormal and normal<br />

bowel images were used for experimentation resulting in high<br />

classification rates: 85% specificity and 85% sensitivity. The<br />

success of the system can be accounted to the discriminatory<br />

and robust feature set (translation, scale and semi-rotational<br />

invariant), which successfully classified various sizes and types<br />

of pathologies at multiple viewing angles.<br />

Index Terms— Biomedical image processing, feature extraction,<br />

computer-aided diagnosis, content-based image retrieval<br />

I. INTRODUCTION<br />

The PillCam TM SB is a tiny capsule endoscope (10mm ×<br />

27mm [1]), which is ingested from the mouth. As natural<br />

peristalsis moves the capsule through the gastrointestinal<br />

tract, it captures video and wirelessly transmits it to a data<br />

recorder the patient is wearing around his or her waist.<br />

This video provides visualization of the 21 foot long small<br />

bowel, which was originally seen as a “black box” to<br />

doctors [2]. Video is recorded for approximately 8 hours<br />

and then the capsule is excreted naturally. Clinical results<br />

for the PillCam TM show that it provides superior diagnostic<br />

capabilities for diseases of the small intestine [2].<br />

In the small intestine, there are four main types of cancers,<br />

which are named after the cell they originate from:<br />

adenocarcinoma, sarcoma, carcinoid and lymphoma. These<br />

types of cancers can occur in various sizes and shapes, and<br />

may be found anywhere along the small bowel tract. Since an<br />

internal view of the small bowel was previously not available,<br />

the PillCam TM offers gastroenterologists a new method of<br />

detecting disease. The downfall of this technology is that the<br />

doctor has to watch and diagnose approximately 8 hours of<br />

footage! Viewing this footage is a very labourious task for<br />

physicians, which could cause them to miss important clues<br />

due to fatigue, boredom or due to the repetitive nature of<br />

the task. Therefore, to aid the doctors with this labourious<br />

task, a computer-aided diagnosis (CAD) system may be<br />

used to offer a secondary opinion of the images. Such a<br />

system would automatically isolate suspicious video instants<br />

April Khademi and Sridhar Krishnan<br />

Dept. of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Tor., ON, Canada<br />

akhademi@ieee.org, krishnan@ee.ryerson.ca<br />

1-4244-0788-5/07/$20.00 ©2007 IEEE 4524<br />

10<br />

(images) for the doctor. The extracted features may also<br />

be used for content-based image retrieval (CBIR), where<br />

physicians can locate abnormal image(s) based on their<br />

semantic content, not based on text annotations.<br />

There are several challenges associated with the development<br />

of an automated classification scheme for small bowel<br />

imagery: the camera angle can be expected to be different<br />

from patient to patient, suspicious regions may occur in<br />

several different places along the gastrointestinal tract and<br />

pathologies come in various forms, sizes and shapes. This<br />

work aims to develop a unified feature extraction algorithm<br />

which can account for all these scenarios. This is the first<br />

reported work in the area of small bowel image classification<br />

and the system aims to detect both malignant and benign<br />

pathologies with a high classification rate. The small bowel<br />

images (video instants) are stored as lossy JPEG images, so<br />

feature extraction is completed in the compressed domain.<br />

II. METHODOLOGY<br />

To extract highly discriminatory features, image processing<br />

techniques are needed to analyze and understand the biomedical<br />

images. Since biomedical signals (including small<br />

bowel images) contain a combination of information which is<br />

localized spatially (i.e. transients, edges) as well as structures<br />

which are more diffuse (i.e. small oscillations, texture) [3], a<br />

technique which can exploit both these characteristics (which<br />

may be related to the diagnosis) is required. To perform this<br />

task, the discrete wavelet transform (DWT) will be utilized<br />

due to its excellent space-localization properties [4] [6].<br />

A. DWT Properties for Feature Extraction<br />

The DWT is scale-invariant since a complete decomposition<br />

will contain all the basis functions needed to decompose<br />

various scaled versions of the input image. Since pathological<br />

features do not come in a predefined size, scale-invariance<br />

will help to capture pathologies of different sizes.<br />

Although the DWT offers good localization and scaleinvariance<br />

properties, it is well known that the DWT is shiftvariant<br />

[4]. Different translations of an input image results<br />

in a different set of DWT coefficients. Therefore, extracting<br />

robust features from the wavelet domain is a challenging<br />

task.<br />

B. Shift-Invariant DWT<br />

SaP1B5.1<br />

To extract a consistent feature set, the 2-D version of<br />

Belkyns’s shift-invariant DWT (SIDWT) is utilized [5]. This<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.


algorithm computes the DWT for all circular translates of<br />

the image, in a computationally efficient manner. Coifman<br />

and Wickerhauser’s best basis algorithm [6] is employed to<br />

ensure the same set of coefficients are chosen, regardless<br />

of the input shift. This will permit for the selection of a<br />

consistent set of DWT coefficients, therefore allowing for<br />

the extraction of robust, shift-invariant features.<br />

C. Image Texture<br />

When a textured surface is viewed, the human visual system<br />

can discriminate between textured regions quite easily.<br />

To understand how the human visual system can easily differentiate<br />

between textures, Julesz defined textons, which are<br />

elementary units of texture [7]. Various textured regions can<br />

be decomposed using these textons, which include elongated<br />

blobs, lines, terminators and more. It was also found that the<br />

frequency content, scale, orientation and periodicity of these<br />

textons can provide important clues on how to differentiate<br />

between two or more textured areas [7] [4]. Therefore,<br />

a robust texture analysis scheme would take into account<br />

the localized spatial properties of images to understand the<br />

orientation, periodicity, scale or frequency content of the<br />

primitive texture elements. Consequently, to differentiate<br />

between normal and pathological cases of the small intestine,<br />

the proposed work aims to develop an automated system<br />

which mimics the human visual system to understand the<br />

texture content of the small bowel images.<br />

Small Bowel Texture: Normal small bowel images<br />

contain smooth, homogeneous texture elements with very<br />

little disruption in uniformity except for folds and crevices.<br />

This is shown in Figure 1(d)-(f). Abnormal small bowel<br />

images (benign and malignant) can contain various pathologies<br />

(polyp, Kaposi’s sarcoma, carcinoma, etc.). These diseases<br />

may occur in various sizes, shapes, orientations and<br />

locations within the gastrointestinal tract. Although there<br />

are many types of diseases, small bowel pathologies have<br />

some common textural characteristics: (1) diseased regions<br />

contain a variety of textured areas simultaneously and (2)<br />

diseased areas are mostly composed of heterogeneous texture<br />

components. Typical, abnormal cases are shown in Figure<br />

1(a)-(c). Another important factor which must be considered<br />

is that the camera angle will vary from image to image.<br />

Therefore, textural characteristics may appear in several<br />

orientations.<br />

D. Features<br />

To extract texture-based features, normalized graylevel cooccurrence<br />

matrices (GCMs) are used. Let each entry of<br />

the normalized GCM be represented as p (l1, l2, d, θ), where<br />

l1 and l2 are two graylevels at a distance d and angle θ.<br />

Normalized GCMs allow statistical quantities to be computed<br />

which reflect the textural properties of the region of interest.<br />

To exploit the textural characteristics of the small bowel<br />

images, texture features which describe the relative homogeneity<br />

or non-uniformity of the images are used since these<br />

texture properties differentiate between the normal and the<br />

abnormal images. The features used are homogeneity (h),<br />

4525<br />

11<br />

which describes how uniform the texture is and entropy (e),<br />

which is a measure of nonuniformity or the complexity of<br />

the texture.<br />

h(θ) =<br />

e(θ) =<br />

L−1 L−1 <br />

l1=0 l2=0<br />

L−1 L−1 <br />

l1=0 l2=0<br />

p 2 (l1, l2, d, θ), (1)<br />

p (l1, l2, d, θ) log2(p (l1, l2, d, θ)). (2)<br />

To gain a highly descriptive representation, textural features<br />

are computed from the wavelet domain. Extracting<br />

features from the wavelet domain will result in a localized<br />

texture description, since the DWT has excellent spacelocalization<br />

properties. To account for oriented texture, the<br />

GCMs are computed at various angles in the wavelet domain<br />

at d = 1 to account for fine texture. Typically, the DWT isn’t<br />

used for texture analysis due to its shift-variant property.<br />

However, using the SIDWT algorithm previously described<br />

will permit for the extraction of a consistent feature set,<br />

thus allowing for multiscale texture analysis. This scheme is<br />

devised to be in accordance with human texture perception.<br />

1) Multiresolutional Features: To examine texture features<br />

at various scales, GCMs p (l1, l2, d, θ) are computed<br />

from the wavelet domain for each scale j at several angles<br />

θ. Each subband isolates different frequency components -<br />

the HL band isolates horizontal edge components, the LH<br />

subband isolates horizontal edges, the HH band captures the<br />

diagonal high frequency components and LL band contains<br />

the lowpass filtered version of the original. Consequently,<br />

to capture these oriented texture components, the GCM is<br />

computed at 0 ◦ in the HL band, 90 ◦ in the LH subband,<br />

45 ◦ and 135 ◦ in the HH band and 0 ◦ , 45 ◦ , 90 ◦ and 135 ◦ in<br />

the LL band to account for any directional elements which<br />

may still may be present in the low frequency subband.<br />

From these GCMs, homogeneity h(θ) and entropy e(θ) are<br />

computed for each decomposition level using Equation 1 and<br />

2. For each decomposition level j, more than one directional<br />

feature is generated for the HH and LL subbands. The<br />

features in these subbands are averaged so that: features<br />

are not biased to a particular orientation of texture and<br />

the representation will offer some rotational invariance. The<br />

features generated in these subbands (HH and LL) are<br />

shown below. Note that the quantity in parenthesis is the<br />

angle at which the GCM was computed.<br />

h j<br />

HH<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.<br />

e j<br />

HH<br />

h j<br />

LL<br />

e j<br />

LL<br />

1<br />

<br />

= h<br />

2<br />

j<br />

HH (45◦ ) + h j<br />

HH (135◦ <br />

) , (3)<br />

1<br />

<br />

= e<br />

2<br />

j<br />

HH (45◦ ) + e j<br />

HH (135◦ <br />

) , (4)<br />

1<br />

<br />

= h<br />

4<br />

j<br />

LL (0◦ ) + h j<br />

LL (45◦ <br />

)<br />

(5)<br />

+ 1<br />

<br />

h<br />

4<br />

j<br />

LL (90◦ ) + h j<br />

LL (135◦ <br />

) ,<br />

1<br />

<br />

= e<br />

4<br />

j<br />

LL (0◦ ) + e j<br />

LL (45◦ <br />

)<br />

(6)<br />

+ 1<br />

<br />

e<br />

4<br />

j<br />

LL (90◦ ) + e j<br />

LL (135◦ <br />

) .


Fig. 1. Typical images of the small bowel captured by the PillCam TM SB, which exhibit textural characteristics. (a) Small bowel lymphoma, (b) GIST<br />

tumour, (c) polypoid mass, (d) healthy small bowel, (e) normal small bowel, (f) normal colonic mucosa.<br />

As a result, for each decomposition level j, two feature sets<br />

are generated:<br />

F j<br />

h =<br />

<br />

h j<br />

HL (0◦ ), h j<br />

LH (90◦ ), h j<br />

HH , h j<br />

<br />

LL , (7)<br />

F j <br />

e = e j<br />

HL (0◦ ), e j<br />

LH (90◦ ), e j<br />

<br />

HH , ej LL , (8)<br />

where h j<br />

HH , h j<br />

LL , ej HH and ej LL are the averaged texture<br />

descriptions from the HH and LL band previously described<br />

and h j<br />

HL (0◦ ), e j<br />

HL (0◦ ), h j<br />

LH (90◦ ) and e j<br />

LH (90◦ ) are homogeneity<br />

and entropy texture measures extracted from the HL<br />

and LH bands. Since directional GCMs are used to compute<br />

the features in each subband, the final feature representation<br />

is not biased for a particular orientation of texture and may<br />

provide a semi-rotational invariant representation.<br />

E. Classifier<br />

A large number of test samples are required to evaluate a<br />

classifier with low error (misclassification) rates since a small<br />

database will cause the parameters of the classifiers to be estimated<br />

with low accuracy. This requires the biomedical image<br />

database to be large, which may not always be the case since<br />

acquiring the images is always not easy and also the number<br />

of pathologies may be limited. If the extracted features are<br />

strong (i.e. the features are mapped into nonoverlapping<br />

clusters in the feature space) the use of a simple classification<br />

scheme will be sufficient in discriminating between classes.<br />

Therefore, linear discriminant analysis (LDA) will be the<br />

classification scheme used in conjunction with the Leave One<br />

Out Method (LOOM). LOOM combats the small sample size<br />

scenario by removing one sample from the whole set and<br />

generating the discriminant functions from the remaining<br />

N − 1 data samples. Using these discriminant scores, the<br />

left out sample is classified. This procedure is completed<br />

for all N samples. LOOM allows classifier parameters to be<br />

estimated with least bias [8].<br />

III. RESULTS AND DISCUSSION<br />

The objective of the proposed system is to automatically<br />

classify various pathologies from normal regions throughout<br />

the small bowel tract. The small intestine images used<br />

are 256×256, 24bpp and lossy (.jpeg). Forty-one normal<br />

and 34 abnormal (including: various sized lesions such as<br />

submucosal masses, lymphomas, jejunal carcinomas, polypoid<br />

masses, Kaposi’s sarcomas, multifocal carcinomas, etc.)<br />

images were used for experimentation (ground truth is<br />

supplied with the database and images are acquired from<br />

4526<br />

12<br />

TABLE I<br />

CONFUSION MATRIX CONTAINING THE NUMBER OF CORRECTLY<br />

CLASSIFIED SMALL BOWEL IMAGES AS EITHER NORMAL OR ABNORMAL.<br />

Normal Abnormal<br />

Normal 35 (85%) 6 (15%)<br />

Abnormal 5 (15%) 29 (85%)<br />

various patients). The images were converted to grayscale<br />

prior to any processing to examine the feature set in this<br />

domain. Features were extracted for the first five levels of<br />

decomposition. Further decomposition levels will result in<br />

subbands of 8×8 or smaller, which will result in skewed<br />

probability distribution (GCM) estimates and thus were not<br />

included in the analysis. Therefore, the extracted features are<br />

F j<br />

h and F j e for j = {1, 2, 3, 4, 5}. The block diagram of the<br />

proposed system is shown in Figure 2.<br />

In order to find the optimal sub-feature set, an exaustive<br />

search was performed (i.e. all possible feature combinations<br />

were tested using the proposed classification scheme). The<br />

optimal classification performance was achieved by combining<br />

homogeneity features from the first and third decomposition<br />

levels with entropy from the first decomposition level.<br />

These three feature sets are shown below:<br />

F 1 h =<br />

<br />

h 1 HL(0 ◦ ), h 1 LH(90 ◦ ), h 1 HH, h 1 F<br />

<br />

LL , (9)<br />

3 h =<br />

<br />

h 3 HL(0 ◦ ), h 3 LH(90 ◦ ), h 3 HH, h 3 F<br />

<br />

LL , (10)<br />

1 e = e 1 HL(0 ◦ ), e 1 LH(90 ◦ ), e 1 HH, e 1 <br />

LL . (11)<br />

Using the above features in conjunction with LOOM and<br />

LDA, the classification results for the small bowel images<br />

are shown as a confusion matrix in Table I. A total of 75<br />

abnormal and normal bowel images were correctly classified<br />

at rate of 85% specificity and 85% sensitivity. The classification<br />

rates are high even though: (1) the angle of the<br />

camera (or viewing angle) is different from image to image,<br />

(2) the images came from various patients and different<br />

regions within the gastrointestinal tract, (3) the pathologies<br />

were not restricted to a specific type, but in fact included<br />

many diseases and (4) the masses and lesions were of<br />

various sizes and shapes. The success of the system can be<br />

accounted to several factors. Firstly, the utilization of the<br />

DWT was important to gain a space-localized representation<br />

of the images’ nonstationary properties. Secondly, the choice<br />

of wavelet-based statistical texture measures (entropy and<br />

homogeneity) was critical in differentiating between the<br />

localized texture properties of the images, since abnormal<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.


images contain localized heterogeneous texture elements,<br />

whereas normal images are smooth (uniform). Utilization of<br />

the SIDWT allowed for the extraction of consistent (i.e. shiftinvariant)<br />

features. Furthermore, due to the scale-invariant<br />

basis functions of the DWT, pathologies of varying sizes<br />

were captured within one transformation (i.e. the features<br />

were scale-invariant).<br />

The system is relatively robust to the different camera<br />

angles by design. Since the viewing angle is different from<br />

image to image, features were collected at various angles (0 ◦ ,<br />

45 ◦ , 90 ◦ , 135 ◦ ) in the respective subbands in order to account<br />

for the texture properties, regardless of the orientation. The<br />

feature set thus offered a semi-rotational invariant representation<br />

which could account for oriented textural properties at<br />

various angles within the gastrointestinal tract.<br />

Since this is the first work in the area of small bowel<br />

image classification, the results are promising and show<br />

great potential for applications such as CAD and CBIR. This<br />

is especially true since all features were extracted in a fullyautomated<br />

manner without any intervention or assistance<br />

from a gastroenterologist. This means that such a system<br />

could in fact be used as a tool which could either (1) sort<br />

the 8 hours of film and highlight suspicious regions or (2)<br />

automatically retrieve a specific region or mass, without<br />

having to use text annotations.<br />

Although the classification results are high, any misclassification<br />

can be accounted to cases where there is a lack<br />

of statistical differentiation between the texture uniformity<br />

of the abnormal and normal small bowel images. Additionally,<br />

normal tissue can sometimes assume the properties<br />

of abnormal regions; for example, consider a normal small<br />

bowel image which has more than the average amount of<br />

folds. This may be characterized as non-uniform texture and<br />

consequently would be misclassified.<br />

Another important consideration arises from the sizes of<br />

the databases. As was stated in Section II-E, the number of<br />

images used for classification can determine the accuracy<br />

of the estimated classifier parameters. Since only a modest<br />

number of images were used, misclassification could result<br />

due to the lack of proper estimation of the classifiers parameters<br />

(although the scheme tried to combat this with LOOM).<br />

Additionally, finding the right trade off between number of<br />

features and database size is an ongoing research topic and<br />

has yet to be perfectly defined [8].<br />

A last point for discussion is the fact that features were<br />

successfully extracted from the compressed domain. Since<br />

many forms of multi-media are being stored in lossy formats,<br />

it is important that classification systems may also be<br />

successful when utilized in the compressed domain.<br />

Fig. 2. System block diagram for the classification of small bowel images.<br />

4527<br />

13<br />

IV. CONCLUSIONS<br />

A unified feature extraction and classification scheme was<br />

developed using the DWT for small bowel images and this<br />

is the first reported work in the area. Textural features<br />

were extracted from the wavelet domain in order to obtain<br />

localized numerical descriptors of the relative homogeneity<br />

of the small bowel images. To ensure the DWT representation<br />

was suitable for the consistent extraction of features, a shiftinvariant<br />

discrete wavelet transform (SIDWT) was computed.<br />

To combat the small database size, a small number of<br />

features and LDA classification were used in conjunction<br />

with the LOOM to gain a more accurate approximation of<br />

the classifier’s parameters.<br />

Seventy-five abnormal and normal bowel images were<br />

correctly classified at rate of 85% specificity and 85%<br />

sensitivity. The success of the system can be accounted<br />

to the semi-rotational invariant, scale-invariant and shiftinvariant<br />

features, which permitted the extraction of discriminating<br />

features for multiple camera angles and various sized<br />

pathologies. Due to the success of the proposed work, it may<br />

be used in a CAD scheme or a CBIR application, to assist the<br />

gastroenterologists to diagnose and sort 8 hours of footage.<br />

REFERENCES<br />

[1] B. Kim, S. Park, C. Jee, and S. Yoon, in An earthworm-like locomotive<br />

mechanism for capsule endoscopes. International Conference on<br />

Intelligent Robots and Systems, Aug. 2005, pp. 2997 – 3002.<br />

[2] Given Imaging Ltd., PillCam TM SB Capsule Endoscopy. [ONLINE],<br />

2006, http://www.givenimaging.com/.<br />

[3] M. Unser and A. Aldroubi, “A review of wavelets in biomedical<br />

applications,” Proceedings of the IEEE, vol. 84, no. 4, pp. 626 – 638,<br />

Apr. 1996.<br />

[4] S. Mallat, Wavelet Tour of <strong>Signal</strong> Processing. USA: Academic Press,<br />

1998.<br />

[5] J.Liang and T. Parks, “Image coding using translation invariant wavelet<br />

transforms with symmetric extensions,” IEEE Transactions on Image<br />

Processing, vol. 7, pp. 762 – 769, May 1998.<br />

[6] A. Khademi, “Multiresolutional analysis for classification and compression<br />

of medical images,” Master’s thesis, 2006, ryerson <strong>University</strong>,<br />

Canada.<br />

[7] B. Julesz, “Textons, the elements of texture perception, and their<br />

interactions,” Nature, vol. 290, no. 5802, pp. 91–97, Mar. 1981.<br />

[8] K. Fukunaga and R. Hayes, “Effects of sample size in classifier design,”<br />

IEEE Transactions on Pattern <strong>Analysis</strong> and Machine Intelligence,<br />

vol. 11, no. 8, pp. 873 – 885, Aug. 1989.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />

Interference Detection in Spread Spectrum<br />

Communication Using Polynomial Phase Transform<br />

Randa Zarifeh and Nandini Alinier<br />

School of Electronics, Communication<br />

and Electrical Engineering<br />

<strong>University</strong> of Hertfordshire<br />

Hertfordshire, UK<br />

Email: rzarifeh@ee.ryerson.ca,n.d.alinier@herts.ac.uk<br />

Abstract—In this paper we propose an interference detection<br />

technique for detecting time varying jamming signals in spread<br />

spectrum communication systems. The technique is based on<br />

Discrete Polynomial Phase Transform (DPPT), where the jamming<br />

signal is synthesized from the modulated spread spectrum<br />

signal using the DPPT. The technique has shown good performance<br />

under low interference conditions with 2dB SJR, when<br />

correlation coefficient between the synthesized chirp signal and<br />

the reference chirp is 0.9. The computational complexity of the<br />

proposed technique is low compared to other techniques such as<br />

Hough-Radon Transform. This interference detection technique<br />

can be applied for different interference excision methods in<br />

military and wireless communication applications.<br />

I. INTRODUCTION<br />

The most commonly used type of spread spectrum signal is<br />

the direct sequence (DS/SS) spread spectrum signal, where a<br />

pseudorandom (PN) sequence is superimposed upon the data<br />

bits to achieve data spreading over a wider bandwidth. This<br />

increase in the bandwidth yields a processing gain, defined<br />

as the ratio of the bandwidth of the transmitted signal to the<br />

bandwidth of the message signal. The spread spectrum signal<br />

is not easily detected since it appears to be noise-like except<br />

at the intended receiver where the PN sequence is known.<br />

The DS spread spectrum signals are often used for their<br />

interference rejection capabilities in military and wireless communications.<br />

While SS systems can strongly reject narrowband<br />

interference, they fail in rejecting wideband interference. In<br />

practical systems, it is not possible to transmit high power<br />

wideband jamming signal due to power limitation. That is<br />

the reason for considering most of the jamming signals to<br />

be wideband signals with narrowband instantaneous frequency<br />

such as chirp signals, linear or nonlinear FM signals. The<br />

performance of the SS system can be further improved by<br />

detecting the interference/jamming and excise it prior to data<br />

despreading and detection.<br />

Different research works have been done in the area of<br />

interference (chirp) detection; some of the proposed methods<br />

are using: adaptive filters [1], evolutionary algorithm [2], and<br />

maximum likelihood estimation. The optimal method is the<br />

maximum likelihood technique which integrates along the<br />

Instantaneous Frequency (IF) ridge in the time frequency<br />

distribution. But if the initial information on the position of<br />

1-4244-0353-7/07/$25.00 ©2007 IEEE<br />

Sridhar Krishnan and Alagan Anpalagan<br />

Department of Electrical<br />

and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong><br />

Toronto, Canada<br />

Email: (krishnan)(alagan)@ee.ryerson.ca<br />

the IF is not available, the integration will be taken along all<br />

possible lines in the TF domain. The maximum likelihood can<br />

also be applied on the IF ridge which is the result of wavelet<br />

transforms as done in [3]. This method has a high computational<br />

complexity especially when the initial estimation of the<br />

IF is not available. Another technique was proposed by Amin<br />

et al. [12], where they evaluated the Wigner-Ville Distribution<br />

(WVD) of the observed signal and estimated the IF parameters<br />

from the WVD. Once the parameters have been estimated,<br />

an adaptive time varying filter can be set up to suppress the<br />

interference. One of the problems related to this method is<br />

that, if the interference is low with respect to the SS signal or<br />

the noise, the estimation of the interference parameter can fail<br />

and the suppression filter can track the useful signal instead<br />

of the interference.<br />

A linear chirp signal was detected by applying Hough-<br />

Radon Transform (HRT) on the WVD or the Spectrogram<br />

of the signal [4][5]. The HRT is an optimal technique for<br />

detecting directional lines in an image which requires a high<br />

degree of computational complexity. Other techniques for<br />

chirp detection are based on signal synthesis. Previous work<br />

on signal synthesis from bilinear Time Frequency Distribution<br />

(TFD) was done by Bartel and Parks [6]. In their work<br />

the signal was synthesized from the WVD using the least<br />

square approximation. Krattenthaler and Hlawatsch extended<br />

the work in [6] and synthesized the chirp signal from the<br />

smoothed versions of the WVD [7]. These techniques are<br />

based on the least square approximation where they have high<br />

computational complexity.<br />

The motive of this work is to detect a jamming/interfering<br />

chirp signal in a spread spectrum communication system. This<br />

interfering signal could be from intentional jammer (hostile<br />

source) or interference from multipath effects in multipath<br />

channel. In this paper we propose a chirp detection technique<br />

using signal synthesis approach, where a parametric signal<br />

analysis approach is used to represent the time domain chirp<br />

signal. The proposed solution based on DPPT will detect the<br />

chirp jammer/interferer even under low jamming power with<br />

low computational complexity, and hence is better than the<br />

existing approaches. The proposed technique is a good interference<br />

detection tool that can be applied prior to interference<br />

2979<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.<br />

14


excision blocks in communications.<br />

The paper is organized as follows: Section II describes<br />

the signal and system model, spread spectrum system and<br />

chirp signals. Section III defines the discrete polynomial<br />

phase transform technique. Section IV outlines numerical and<br />

simulation results. And the last Section V is the conclusion<br />

and the summary of the paper.<br />

II. SIGNAL AND SYSTEM MODEL<br />

A. Spread Spectrum System<br />

Assuming Binary Phase Shift Keying modulation (BPSK),<br />

the transmitted spread spectrum signal s(t) consists of the<br />

message signal m(t) and the spreading signal p(t),<br />

where<br />

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />

s(t) =m(t)p(t), (1)<br />

m(t) = <br />

bkrectTm (t − kTm), (2)<br />

k<br />

bk = {+1,-1} is the message bits, and rectTm is a rectangular<br />

pulse of duration Tm, and<br />

p(t) =<br />

L−1 <br />

cnrectTp (t − nTp), (3)<br />

n=0<br />

where cn = {+1,-1} is the nth chip of the L-element PN<br />

sequence,<br />

s(t) = <br />

bkp(t − kTm). (4)<br />

k<br />

During the transmission of the modulated signal, additive<br />

white Gaussian noise n(t) (with zero mean and variance = σ 2 )<br />

and interference i(t) are added to the signal in the channel,<br />

and the following signal is received:<br />

r(t) =s(t)+n(t)+i(t). (5)<br />

At the receiver the received signal r(t) is synchronized and<br />

correlated with the same PN sequence (known to the intended<br />

receiver) and estimation of the message signal ˆmk is made<br />

based on the polarity of the recovered message bits,<br />

ˆmk = 〈r(t),p(t)〉 = mk〈p(t),p(t)〉+〈n(t),p(t)〉+〈i(t),p(t)〉,<br />

(6)<br />

where is the correlation operator. From (6) it can be<br />

seen that correlating the received signal with the PN sequence<br />

p(t) will recover the message signal, but will spread both the<br />

noise and the interference. If the ratio of the interference power<br />

to the signal power is large so the processing gain can not<br />

suppress the interference then the estimation of the message<br />

bit will be wrong. The SS system (shown in Fig.1) is able to<br />

recover the correct data bit at low interference, but when the<br />

interference is high and time varying the SS system will fail.<br />

B. Chirp <strong>Signal</strong><br />

Fig. 1. Spread Spectrum System Block Diagram.<br />

Chirp signals are present in many areas of science and<br />

engineering. They are present in natural signals such as animal<br />

sounds and whistling sounds. Because of their ability to<br />

reject interference, they are widely used in spread spectrum<br />

communications, military communications, radar and sonar<br />

applications.<br />

Mathematically, chirp signals are modeled as nonstationary<br />

signals with polynomial phase parameters. A polynomial phase<br />

signal y(n) can be expressed as:<br />

M<br />

y(n) =b0 exp{jφ(n)} = b0 exp j am(n∆) m<br />

<br />

, (7)<br />

m=0<br />

where φ(n) is the phase of the signal, M is the polynomial<br />

order, N is the total signal length, and ∆ is the sampling<br />

interval and b0 is the signal amplitude.<br />

In this paper we will deal with linear and parabolic (nonlinear)<br />

chirp signals as interferences, where their phases are<br />

second and third order polynomial functions (M = 2, 3).<br />

Figures 2 and 3 show the Time-Frequency representation of<br />

the linear and parabolic chirp signal respectively.<br />

Frequency<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 1000 2000 3000<br />

Time<br />

4000 5000 6000<br />

Fig. 2. TF representation of Linear Chirp.<br />

III. DISCRETE POLYNOMIAL PHASE TRANSFORM (DPPT)<br />

The DPPT is a parametric signal analysis approach for<br />

estimating the phase parameters of a polynomial phase signal<br />

[10] [11] [14]. Normally, the phase parameters of a signal<br />

are determined by applying least square approximation to fit<br />

15<br />

2980<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.


Frequency<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />

0<br />

0 1000 2000 3000<br />

Time<br />

4000 5000 6000<br />

Fig. 3. TF representation of Parabolic Chirp.<br />

a polynomial to the phase curve. This process poses some<br />

difficulty especially when the phase is not available. The<br />

DPPT, on the other hand, is applied directly onto the signal<br />

and it works quiet well even in the presence of noise.<br />

The principle of the DPPT is as follow: when the DPPT of<br />

order M is applied to a signal with polynomial phase of order<br />

M, it produces a spectral peak. The position of this spectral<br />

peak at frequency ω0 provides an estimation of the coefficient<br />

âM . After the estimation of âM the order of the polynomial<br />

is reduced from M to M − 1 by multiplying the signal with<br />

the conjugate pair of the estimated phase. Then the coefficient<br />

âM−1 will be estimated the same way by applying the DPPT<br />

of order M − 1 on the signal. The procedure is repeated until<br />

all of the coefficients are estimated.<br />

The DPPT of order M of a continuous phase signal y(n)<br />

is the Fourier transform of the higher order DPM[y(n),τ]<br />

operator:<br />

DPPTM [y(n),ω,τ]=<br />

N−1 <br />

(M−1) τ<br />

where τ is a positive number and:<br />

DPM[y(n),τ]exp (−jωn∆) ,<br />

(8)<br />

DP1[y(n),τ]:=y(n), (9)<br />

DP2[y(n),τ]:=y(n)y ∗ (n − τ), (10)<br />

DPM[y(n),τ]:=DP2[DPM−1[y(n),τ],τ]. (11)<br />

The coefficient aM is estimated based on the following formula:<br />

1<br />

âM =<br />

M!(τM ∆) M−1 argmaxω{|DPPTM [y(n),ω,τ]|},<br />

(12)<br />

where DPPTM[y(n),ω,τ] is calculated as in Equation (11).<br />

The formulas for the DPPT of order one to three are shown<br />

below:<br />

DPPT1[y(n),ω,τ]=fft{y(n)}, (13)<br />

DPPT2[y(n),ω,τ]=fft{y(n)y ∗ (n − τ)}, (14)<br />

DPPT3[y(n),ω,τ]=fft{y(n)[ y ∗ (n−τ)] 2 y(n−2τ)}. (15)<br />

After the estimation of aM , the order of the signal<br />

phase will be reduced by multiplying the signal y(n) with<br />

exp{−jâM (n∆) M },<br />

y(n) (M−1) = y(n)exp{−jâM (n∆) M }. (16)<br />

To determine aM−1, apply the DPPT of order M −1 on the<br />

signal y(n) (M−1) from Equation (13). The process is repeated<br />

until all the remaining coefficients are calculated. Coefficient<br />

a0 and b0 are determined by the following formulas:<br />

N−1 <br />

<br />

â0 = phase y(n)exp − j<br />

n=0<br />

<br />

ˆb0 = 1<br />

N−1<br />

N<br />

n=0<br />

<br />

y(n)exp − j<br />

m=0<br />

M<br />

m=1<br />

M<br />

m=1<br />

am(n∆) m<br />

<br />

, (17)<br />

am(n∆) m<br />

<br />

. (18)<br />

The final synthesized signal is,<br />

ˆy(n) = ˆ M<br />

<br />

b0 exp j âm(n∆)m . (19)<br />

Figures 4 and 5 show the result of applying second order and<br />

third order DPPT on non linear chirp (parabolic chirp) with<br />

third order polynomial phase. The spectral peak only appears<br />

in the case of third order DPPT corresponding to third order<br />

polynomial phase.<br />

Amplitude<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5<br />

Normalized Frequency<br />

Fig. 4. Second Order DPPT.<br />

For the DPPT algorithm to work, the location of the spectral<br />

peak ω0 when taking the Fourier transform of the DP operator<br />

has to be smaller than half of the sampling frequency ωs:<br />

16<br />

2981<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.<br />

|ω0| = M!(τ∆) M−1 |aM |≤ ω0<br />

. (20)<br />

2


Amplitude<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />

0<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5<br />

Normalized Frequency<br />

Fig. 5. Third Order DPPT.<br />

This condition is translated into the following requirement on<br />

the range of coefficient aM:<br />

|aM |≤<br />

Π<br />

M!τ M−1 , (21)<br />

∆M when M = 1, we have |a1| ≤ ωs<br />

2 which is the Nyquist<br />

criterion.<br />

The accuracy of the DPPT method depends on many factors<br />

such as the level of noise present, the type of noise, the length<br />

of the chirp signal and the chosen value of τ [10] [11]. The<br />

best signal estimation is achieved when τ = N/M, where<br />

N is the signal length and M is the order of the polynomial<br />

phase of the chirp (M=2 for linear chirp, M=3 for parabolic<br />

chirp).<br />

The SNR (the jamming signal) at the spectral peek ω0<br />

should be at least 14dB for a good detection. Also the number<br />

of points when taking the Fourier transform will affect the<br />

results when estimating the position of the spectral peak.<br />

Increasing the order of the polynomial will also affect the<br />

estimation error. For example, if we consider a third order<br />

polynomial, any error which occurs in the coefficient a3 will<br />

make it impossible to remove this coefficient completely from<br />

the polynomial during phase unwrapping step. Therefore, the<br />

estimation of the next coefficient a2 will suffer and have error<br />

as well. Similarly, the new error in a2 will affect the precision<br />

of a1 and a0.<br />

The computational complexity of the DPPT is determined<br />

based on the number of multiplications needed to synthesize<br />

the chirp having a length of N. The DPPT process<br />

involves the calculation of the ambiguity function and then<br />

taking the Fourier transform of the ambiguity function. The<br />

computational complexity of the ambiguity function calculation<br />

is O(N) and the complexity of the fast Fourier trans-<br />

form is O(N log 2 N). Therefore the total complexity is only<br />

O(N log 2 N).<br />

IV. NUMERICAL AND SIMULATION RESULTS<br />

We used 128 chips per data bit for spreading the message<br />

signal and we assumed a Gaussian channel. We considered<br />

a constant amplitude linear and parabolic chirp as the interference<br />

source. We first evaluated the bit error rate (BER)<br />

resulting from the presence of linear chirp at different jamming<br />

ratios between [0, 60] dB. We assumed SNR (with Gaussian<br />

noise) to be -10 dB for each case. As seen in Figure 6, the bit<br />

error rate increase as the jamming ratio increases. The spread<br />

spectrum system is able to recover the data bits at low jamming<br />

ratio of 10 dB, but as the ratio increases the system fails to<br />

recover the correct data bits.<br />

BER<br />

10 0<br />

10 −1<br />

10 −2<br />

10 −3<br />

10<br />

0 10 20 30 40 50 60<br />

−4<br />

JSR<br />

Fig. 6. BER vs. JSR results for a self-excised SS system.<br />

Next we used the proposed DPPT technique to synthesize<br />

the jamming chirp in the previous spread spectrum system.<br />

The chirp signal used was a linear chirp with normalized<br />

instantaneous frequency varying from 0 to 0.5 Hz. We also<br />

used 128 data chips and 100 data bits for a total of 12800<br />

bits. Table I shows the correlation coefficient between the<br />

reference linear chirp and the synthesized chirp using second<br />

order DPPT. The first simulation was for -0.5 dB signal to<br />

noise ratio (SNR) with signal to jamming ratio (SJR) ranging<br />

from [6,-20] dB, and the second simulation for -5 dB signal<br />

to noise ratio with jamming ratio in the range [6,-20] dB.<br />

The DPPT showed good results since it was able to detect the<br />

chirp under low chirp ratio (0.9879 for -2dB signal to jamming<br />

ratio).<br />

In the next simulation we used a parabolic chirp signal as<br />

the interference source. Figure 8 shows the spectrogram of the<br />

received signal r(t) with interference and noise added.<br />

Table II shows the correlation coefficient between the reference<br />

chirp and the synthesized chirp. A third order DPPT<br />

was applied in the signal. The first simulation was for -0.5<br />

dB signal to noise ratio with signal to jamming ratio changing<br />

in the range [6,-20] dB. And the second simulation for -5 dB<br />

17<br />

2982<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.


Frequency<br />

TABLE I<br />

RESULTS WITH LINEAR CHIRP<br />

SJR in dB Corr-Coeff(SNR=-0.5 db) Corr-Coeff(SNR=-5 db)<br />

6 0.7916 0.0469<br />

4 0.9878 0.9480<br />

2 0.9879 0.9874<br />

0 0.9878 0.9879<br />

-2 0.9879 0.9879<br />

-4 0.9881 0.9880<br />

-6 0.9880 0.9880<br />

-8 0.9880 0.9880<br />

-10 0.9880 0.9880<br />

-15 0.9880 0.9880<br />

-20 0.9880 0.9880<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />

0<br />

0 1000 2000 3000<br />

Time<br />

4000 5000 6000<br />

Fig. 7. Received signal with chirp interference and Gaussian noise.<br />

signal to noise ratio with jamming ratio in the range [6,-20]<br />

dB.<br />

In the previous simulation the DPPT performed better in<br />

the parabolic chirp case than the linear chirp because the<br />

frequency variation in the parabolic chirp [0.6, 0.9] Hz was<br />

less than the variation in the linear chirp [0, 0.5] Hz. The<br />

experimental result shows that the proposed technique can be<br />

successfully used for detection of chirp like interference in<br />

spread spectrum system.<br />

Figure 8 shows the TF representation of the synthesized<br />

TABLE II<br />

RESULTS WITH PARABOLIC CHIRP<br />

SJR in dB Corr-Coeff(SNR=-0.5 db) Corr-Coeff(SNR=-5 db)<br />

6 0.0943 0.0482<br />

4 0.1117 0.0895<br />

2 0.9969 0.1672<br />

0 0.9981 0.9670<br />

-2 0.9987 0.9979<br />

-4 0.9991 0.9986<br />

-6 0.9994 0.9993<br />

-8 0.9995 0.9993<br />

-10 0.9997 0.9994<br />

-15 0.9997 0.9994<br />

-20 0.9997 0.9996<br />

parabolic chirp (Fig. 3) under 6dB signal to jamming ratio<br />

(low correlation coeff=0.0015).<br />

Frequency<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 1000 2000 3000<br />

Time<br />

4000 5000 6000<br />

Fig. 8. Detected parabolic chirp with 6dB SJR.<br />

Previous work [5] on chirp detection using Hough-Randon<br />

transform (HRT) showed good performance but computationally<br />

expensive. In their work they decomposed the received<br />

signal into its TF functions using an adaptive signal decomposition<br />

algorithm. The TF functions were mapped onto the<br />

TF plane and then treated as an image, and chirps present in<br />

the TF plane were detected using HRT. The HRT is an optimal<br />

technique for detecting lines in an image but it requires a high<br />

degree of computational complexity.<br />

This technique outperformers previous TF distribution techniques,<br />

which provide a distribution of the signal spectrum<br />

over a period of time but do not inherently provide chirp<br />

parameters like the DPPT. In addition, TF distribution functions<br />

always suffer from a tradeoff between resolution and<br />

interference terms which can result in incorrect synthesis and<br />

detection of the interfering signal.<br />

V. CONCLUSION<br />

A new technique is introduced for modulated interference<br />

detection in spread spectrum systems. The simulation results<br />

show that the new method provides accurate detection and<br />

estimation of linear and parabolic chirp interference. This<br />

technique does not suffer from the same limitation as previous<br />

techniques, where it detected the chirp signals under low<br />

jamming ratio and has low computational complexity. The<br />

method can be extended to include the excision of the detected<br />

interference which will be done in future work.<br />

REFERENCES<br />

[1] Genyuan Wang and Xiang-Gen Xia, “An adaptive filtering approach to<br />

chirp estimation and isar imaging of maneuvering targets,” IEEE 2000<br />

International Radar Conference, pp. 481-486, May 2000.<br />

[2] J.S. Dhanoa, E.J. Hughes, and R.F Ormondroyd, “Simultaneous detection<br />

and parameter estimation of multiple linear chirps,” Proc. IEEE Intl.<br />

Conference Acoustics, Speech, and <strong>Signal</strong> Processing (ICASSP), vol.6,<br />

pp. VI-129-32, Apr 2003.<br />

[3] M.Morvidone and B.Torresani, “Time scale approach for chirp detection,”<br />

International Journal of Wavelets, Multiresolution and Information<br />

Processing,vol. 1, no. 1, pp. 1949, 2003.<br />

18<br />

2983<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />

[4] A. Ramalingam and S. Krishnan, “A novel robust image watermarking<br />

using a chirp based technique,” In Proc. Canadian Conference on<br />

Electrical and Computer Engineering, pp. 1889-1892, May 2004.<br />

[5] S. Erkucuk, S. Krishnan, and M. Zeytinoglu, “Robust audio watermarking<br />

using a chirp based technique,” Proc. International Conference on<br />

Multimedia and Expo, 2003. ICME 03, pp. II - 513-16, July 2003.<br />

[6] G.F. Boudreaux-Bartels and T. Parks, “Time-varying filtering and signal<br />

estimation using wigner distribution synthesis techniques,” IEEE <strong>Signal</strong><br />

Processing Magazine,vol. 9, no. 2, pp. 21-67, April 1992.<br />

[7] W. Krattenhaler and F. Hlwatsch, “Bilinear signal synthesis,” IEEE<br />

Transactions on signal processing, vol. 40, no. 2, pp. 352–363, Feb<br />

1992.<br />

[8] W. Krattenhaler and F. Hlawatsch, “General signal synthesis allgorithms<br />

for smoothed version of wigner distribution,”Proc. IEEE International<br />

Conference on Acoustics, and speech, and <strong>Signal</strong> Processing (ICASSP),<br />

no. 3, pp. 1611–1614, Apr 1990.<br />

[9] A. Francos and M. Porat, “<strong>Analysis</strong> and synthesis of multicomponent<br />

signals using positive time-frequency distributions,” IEEE Transactions<br />

on <strong>Signal</strong> Processing, pp. 47(2):493-504, Feb. 1999.<br />

[10] S. Peleg and B. Friedlander, “Multicomponent signal analysis using<br />

the polynomial-phase transform,” IEEE Transactions on Aerospace and<br />

Electronic Systems, vol. 32, no. 1, pp. 378-387, Jan 1996.<br />

[11] S. Peleg and B. Friedlander, “The discrete polynomial-phase transform,”<br />

IEEE Transactions on <strong>Signal</strong> Processing,vol. 43, no. 8, pp. 1901-1914,<br />

August 1995.<br />

[12] M.G. Amin, “Interference mitigation in spread spectrum communication<br />

systems using time-frequency distributions,” IEEE Trans. <strong>Signal</strong><br />

Processing, vol. 45, no. 1, pp. 90–101, Jan 1997.<br />

[13] J.D. Laster and J.H. Reed, “Interference rejection in digital wireless<br />

communication,” IEEE <strong>Signal</strong> Processing Mag., pp. 37–62, May 1997.<br />

[14] L. Lee, “Time-frequency signal synthesis and its application in multimedia<br />

watermark detection,” Master Thesis, <strong>Ryerson</strong> <strong>University</strong>, 2005.<br />

19<br />

2984<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.


Emotion Recognition Using Novel Speech <strong>Signal</strong><br />

Features<br />

Talieh Seyed Tabatabaei, Sridhar Krishnan<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong><br />

Toronto, Canada<br />

{tseyedta, krishnan}@ee.ryerson.ca<br />

Abstract—Automatic Emotion Recognition (AER) is a very<br />

recent research topic in the Human-Computer Interaction (HCI)<br />

field which still has much room to grow. In this contribution a<br />

set of novel acoustic features and Least Square-Support Vector<br />

Machines (LS-SVMs) are proposed to set up a speaker-<br />

independent Automatic Human Emotion Recognition system.<br />

Six discrete emotional states are classified throughout this work:<br />

happiness, sadness, anger, surprise, fear, and disgust. Different<br />

multi-class SVM methods are implemented in order to get the<br />

best result. The result achieved by LS-SVM is then compared by<br />

that of a Linear Classifier. We achieved an overall accuracy of<br />

81.3%.<br />

I. INTRODUCTION<br />

<strong>Research</strong> has been done on emotion in the fields of<br />

psychology and physiology for a long time. More recently it’s<br />

been the subject of interest to engineers. Its most important<br />

application is in intelligent human-machine interaction. As<br />

computers have become an integral part of our lives, the need<br />

for a more natural communication interface between humans<br />

and machines has arisen. To accomplish this goal, a computer<br />

would have to be able to perceive its present situation and<br />

respond in a different manner depending on its perception. To<br />

make Human-Computer Interaction (HCI) more natural and<br />

friendly, it would be beneficial to give computers the ability to<br />

recognize situations the same way a human does.<br />

In today’s HCI systems, machines can recognize the<br />

speaker and also content of the speech, using speech<br />

recognition and speaker identification techniques. If machines<br />

are equipped with emotion recognition techniques, they can<br />

also know “how it is said” to react more appropriately, and<br />

make the interaction more natural. Other potential applications<br />

of Automatic Emotion Recognition (AER) include psychiatric<br />

diagnosis, intelligent toys, lie detection, learning environment,<br />

customer service, educational software, and detection of the<br />

emotional state in telephone call center conversations to<br />

provide feedback to an operator or a supervisor for monitoring<br />

purposes.<br />

One of the most important human communication<br />

channels is auditory channel which carries speech and vocal<br />

1-4244-0921-7/07 $25.00 © 2007 IEEE.<br />

345<br />

Aziz Guergachi<br />

Department of Information Technology Management<br />

<strong>Ryerson</strong> <strong>University</strong><br />

Toronto, Canada<br />

a2guerga@ryerson.ca<br />

intonation. In fact people can perceive each other’s emotional<br />

state by the way they talk. Therefore in this work we are<br />

analyzing the speech signal in order to set up an automatic<br />

system to recognize the human emotional state. Different<br />

researchers have decided on different number and kinds of<br />

emotional states, such as 3 categories of positive(joy),<br />

negative(anger, irritation), and neutral in [7], 4 categories of<br />

neutral, anger, Lombard, and loud in [9], and 5 categories of<br />

neutral, happiness sadness, anger, and fear in [8]. In this work<br />

we are automatically categorizing six different human<br />

emotional states: anger, happiness, fear, surprise, sadness,<br />

and disgust.<br />

Some researchers have developed speaker-dependent<br />

speech emotion recognition systems [7, 12]. We think that<br />

speaker independency is one of the intrinsic characteristics of<br />

an AER system. When a system is person-dependent the<br />

accuracy increases but on the other hand, for each new person<br />

we have to train our system all over again and that is a major<br />

drawback. So here we are trying to reach a very satisfying<br />

accuracy with a person-independent system by choosing right<br />

acoustic features and powerful classifier. While some<br />

researchers have utilized both acoustic characteristics and<br />

textual content of an emotional spoken utterance [10, 12], we<br />

are conducting our work using commonly used and newly<br />

proposed acoustic features of the speech signal only.<br />

Various different classifiers have been taken into<br />

consideration for categorizing the emotional states. The most<br />

common classifiers used are Hidden Markov Model (HMM)<br />

[13] and Neural Networks (NNs) [15], whereas the number of<br />

works which use Support Vector Machines (SVMs) are<br />

relatively very few [12]. SVM is a relatively new approach in<br />

the field of Machine Learning and has a large number of<br />

advantages to other conventional and popular classifiers like<br />

NNs. In this contribution we are using Least Square Support<br />

Vector Machines (LS-SVMs), which are the reformulations to<br />

the original SVMs.<br />

The paper is organized as follows: Section II explains the<br />

emotion database used in this research. Section III<br />

demonstrates the structure of the AER system proposed in this<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />

20


work and the corresponding steps. In Section II the theory of<br />

SVM is briefly discussed. In Section V the experimental<br />

results are presented, and Section VI is the conclusion.<br />

II. THE EMOTION DATABASE<br />

The database used in this research is the one created in<br />

[16]. We believe that the results obtained in different emotion<br />

recognition experiments are strongly related to the database<br />

used. So lack of a common good quality database for<br />

researchers makes it hard to compare the performance of their<br />

proposed systems.<br />

This audio-visual emotion database presented in [16] is a<br />

professional reference database for testing and evaluating<br />

video, audio or joint audio-visual emotion recognition<br />

algorithms. The final version of the database contains 42<br />

subjects, coming from 14 different nationalities. Among the<br />

42 subjects, 81% are men, while the remaining 19% are<br />

women. First, each subject is asked to listen carefully to a<br />

short story for each of the six emotions (happiness, sadness,<br />

surprise, disgust, fear, and anger) and to immerge themselves<br />

into the situation. Once the subject is ready, he or she may<br />

read, memorize and pronounce the five proposed utterances<br />

(one at the time), which constitutes five different reactions to<br />

the given situation. The subjects are asked to put as much<br />

expressiveness as possible, producing a message that contains<br />

only the emotion to be elicited. All the subjects talk in<br />

English but they might have different accents. All the<br />

utterances are approved by two experts in order to be genuine.<br />

III. SPEECH EMOTION RECOGNITION SYSTEM<br />

The structure of the speech emotion recognition system<br />

used in this paper is depicted in Fig. 1.<br />

A. Preprocessing<br />

In the preprocessing stage first each signal is de-noised by<br />

soft-thresholding the wavelet coefficients, and since the silent<br />

parts of the signals do not carry any useful information, those<br />

parts including the leading and trailing edges are eliminated by<br />

thresholding the energy of the signal. The signals are divided<br />

into frames using a Hamming window of length 23 ms.<br />

B. Feature Extraction<br />

We are proposing a set of novel acoustic features in this<br />

experiment. Most researchers use prosodic features and their<br />

statistical characteristics to classify the emotions [8, 11, 13,<br />

14, 15]. In this contribution we are using the set of features<br />

listed in Table I. Among these features only Mel Frequency<br />

Cepstrum Coefficients (MFCC) and Zero Crossing Rate<br />

(ZCR) have been used for speech emotion recognition in the<br />

past [9, 10, 11], while the rest are being used for the first time<br />

in this application. All the features are extracted from each<br />

frame and then the mean and standard deviation for each<br />

feature is considered to constitute the feature vector.<br />

C. Feature Selection<br />

The performance of a pattern recognition system highly<br />

depends on the discriminant ability of the features. Selecting<br />

the most relevant subset from the original feature set, we can<br />

increase the performance of the classifier and on the other<br />

Speech <strong>Signal</strong><br />

Final Result<br />

Figure 1. The structure of the speech emotion recognition system.<br />

hand decrease the computational complexity. We are using the<br />

forward selection method for each single binary classifier in<br />

our system in order to select the most efficient subset of<br />

features. At each step the variable which increases the<br />

performance of the classifier the most, is added to the feature<br />

subset. Fig. 3 illustrates the concept.<br />

D. Classification<br />

The recognition of human emotion is essentially a pattern<br />

recognition problem. We are using LS-SVM (descried in next<br />

section) as a classifier in this research. Since we are dealing<br />

with a multi-class classification problem, we need a method to<br />

extend our two-class support vector classification<br />

methodology to a multi-class problem. There are different<br />

ways for multi-category SVM mentioned in the literature<br />

among which one-against-all and one-against-one (pairwise)<br />

are the most popular ones. In this paper we are comparing the<br />

results achieved by one-against-all, fuzzy one-against-all,<br />

pairwise, and fuzzy pairwise [17].<br />

For the purpose of comparative study we are also applying<br />

a Linear Classifier with gradient descent optimization<br />

algorithm.<br />

IV. SUPPORT VECTOR MACHINES<br />

SVM was introduced first by Vapnik and co-workers, and<br />

it is such a powerful method that in the few years since its<br />

introduction has outperformed most other systems in a wide<br />

variety of applications. SVM is used in applications of<br />

regression and classification; however, it is mostly used as a<br />

binary classifier. SVM is based on the principle of structural<br />

risk minimization. The optimal boundary is found in such a<br />

way that maximizes the margin between two classes of data-<br />

TABLE I. LIST OF ACOUSTIC FEATURES USED FOR SPEECH EMOTION<br />

RECOGNITION<br />

Spectral Features<br />

• Shannon entropy<br />

• Renyi entropy<br />

• Spectral bandwidth<br />

• Spectral centroid<br />

• Spectral flux<br />

• Spectral roll-off<br />

frequency<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />

346<br />

21<br />

Preprocessing Feature<br />

Extraction<br />

Classifier 1<br />

Classifier 2<br />

Classifier n<br />

Audio Features<br />

Cepstral<br />

Features<br />

Feature Selection<br />

Feature Selection<br />

Feature Selection<br />

Time-domain<br />

Features<br />

MFCC Zero crossing rate<br />

.<br />

.<br />

.


points [1, 2, 3] (Fig. 2). SVM is based on kernel functions,<br />

which are used to map data points to a higher dimensional<br />

feature space in order to be linearly separable. The<br />

optimization problem here is the dual optimization problem<br />

which is solved by Lagrangian method and making use of very<br />

important Karush-Kuhn-Tucker (KKT) conditions. Equation<br />

(1) shows the optimization problem for SVM classifiers:<br />

minimize w, b<br />

<br />

n<br />

1 2<br />

2<br />

w + C∑<br />

ξ i<br />

(1)<br />

2<br />

i=<br />

1<br />

subject to<br />

<br />

( < w.<br />

x > + b)<br />

≥ 1−<br />

ξ ξ ≥ 0,<br />

i = 1,.....,<br />

n<br />

yi i<br />

i i<br />

where C is a regularization parameter which is a trade off<br />

between the empirical risk (reflected by the second term in<br />

(1)) and model complexity (reflected by the first term in (1)),<br />

and ξi are slack variables which are introduced to relax the<br />

constraints and make the system more noise-tolerant.<br />

The corresponding dual representation is:<br />

n<br />

n 1<br />

<br />

W ( α ) = ∑αi−∑yiyjαiαjK( xi.<br />

x j ) (2)<br />

2<br />

subject to constraints<br />

n<br />

∑<br />

i=<br />

1<br />

i=<br />

1 i,<br />

j=<br />

1<br />

α i yi<br />

= 0 α i ≥ 0,<br />

i = 1,.....,<br />

n<br />

<br />

where α i ≥ 0 are the Lagrange multipliers and K(<br />

xi.<br />

x j ) is<br />

the kernel function. Note that we don’t need to know the<br />

underlying mapping function, however it is necessary to<br />

define the kernel function. Among the different kernel<br />

functions, the most common kernels are polynomial,<br />

Gaussian Radial Basis function (RBF) and multi-layer<br />

perception (MLP).<br />

Our final decision rule can be expressed as<br />

o<br />

Nsv<br />

= ∑<br />

i=1<br />

*<br />

<br />

f ( x,<br />

α , b ) y α K(<br />

x . x)<br />

+ b (3)<br />

i<br />

where N sv and *<br />

αi denote the number of support vectors and<br />

the non-zero Lagrange multipliers corresponding to the<br />

support vectors respectively. This result reveals the important<br />

fact that only support vectors contribute to the final boundary.<br />

In fact this is a way to beat the curse of dimensionality which<br />

is a big worry for most of the classifiers. The dimension of<br />

input space can be as high as it needs to be, without having to<br />

worry about having too many free parameters which usually<br />

leads to overfitting.<br />

In this paper for training SVM we use the LS-SVM (Least<br />

Squares Support Vector Machine) MATLAB toolbox. LS-<br />

SVMs are reformulations to the original SVMs which lead to<br />

solving linear KKT systems [6]. In LS-SVMs the inequality<br />

constraints in SVM are replaced with equality constraints. As<br />

a result the solution follows from solving a set of linear<br />

equations instead of a quadratic programming problem which<br />

we have in original SVM formulation of Vapkin, and<br />

obviously we can have a faster algorithm.<br />

The primal problem of the LS-SVMs is defined as:<br />

*<br />

i<br />

i<br />

o<br />

Figure 2. A linear SVM classifier. Support vectors are those<br />

elements of the training set which are on the boundary hyperplanes of two<br />

classes.<br />

subject to<br />

<br />

d<br />

2<br />

min w,<br />

b,<br />

e J p ( w,<br />

b,<br />

e)<br />

= 1 2 w + γ 1 2∑<br />

e<br />

=<br />

2<br />

i<br />

i 1<br />

T<br />

[ w ϕ(<br />

x ) + b]<br />

= 1−<br />

e , i = 1,.....,<br />

d<br />

yi i<br />

i<br />

where γ is a parameter analogous to SVM’s regularization<br />

parameter (C).<br />

The main characteristic of LS-SVMs is the low<br />

computational complexity comparing to SVMs without<br />

quality loss in the solution.<br />

V. EXPERIMENTAL RESULTS<br />

Our database consists of 1260 instances of utterances 60%<br />

of which was used for the training phase exclusively and the<br />

remaining 40% for evaluating the trained classifiers (the<br />

division is done in random). All the binary LS-SVM<br />

classifiers are trained using RBF kernel function with different<br />

regularization and kernel parameters. The linear classifiers are<br />

trained using the gradient descent algorithm and perceptron<br />

Figure 3. The perfomance of one of the binary LS-SVMs by adding<br />

a new feature at each iteration of Forward Selection algorithm<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />

347<br />

22<br />

margin<br />

Support<br />

Vectors<br />

Hyperplane<br />

(4)


criterion function. The confusion matrix and the final<br />

recognition results are presented in Table II and Table III<br />

respectively. The abbreviations in Table II stand for the six<br />

different emotions: anger, fear, disgust, happiness, sadness,<br />

and surprise, and FS in Table III means Feature Selection.<br />

As it is shown in Table III, the best performance (81.3%)<br />

belongs to fuzzy-pairwise LS-SVM using the features selected<br />

by Forward Selection algorithm. Table II shows that the most<br />

difficult emotion to recognize in our experiment is surprise<br />

and the easiest ones are sadness and happiness. And fear and<br />

sadness have the highest probability to be confused with each<br />

other.<br />

VI. CONCLUSION<br />

In this contribution, we introduced a set of new acoustic<br />

features which are used for the first time in the application of<br />

AER. For classification we used LS-SVM which is a recent<br />

and powerful classifier with many advantages to other<br />

conventional and popular classifiers such as Neural Networks.<br />

We also implemented different schemes to adapt our binary<br />

classifiers to a multi-category problem. The result of a Linear<br />

Classifier is compared with LS-SVM performance. We<br />

achieved an overall classification accuracy of 81.3% with<br />

fuzzy-pairwise LS-SVM<br />

TABLE II. CONFISION MATRIX OF THE LS-SVM CLASSIFIER<br />

(FUZZY PAIRWISE WITH FEATURE SELECTION)<br />

Recognized Emotions (%)<br />

Ang Fea Dis Hap Sad Sur<br />

Ang 83.3 0 2.7 6.4 2.7 4.6<br />

Fea 1.8 71.9 7.4 1.8 13 3.7<br />

Dis 4.6 5.5 79.6 0 3.7 6.4<br />

Hap 1.8 1.8 0 92.4 1.8 1.8<br />

Sad 0 6.1 0.9 0 90.5 2.3<br />

Sur 11.1 9.2 5.5 4.6 13.8 55.5<br />

TABLE III. FINAL RECOGNITION RESULTS<br />

Recognition Rate<br />

One-Vs-All SVM 44.9%<br />

fuzzy One-Vs-All SVM 53.6%<br />

Pairwise SVM 74.5%<br />

fuzzy pairwise SVM 78.4%<br />

fuzzy pairwise SVM, FS 81.3%<br />

fuzzy pairwise LDA 37.7%<br />

348<br />

REFERENCES<br />

[1] N. Cristianini and J. SH. Taylor, An Introduction to<br />

Support Vector Machines and Other Kernel-based Methods. United<br />

Kingdom: Cambridge <strong>University</strong> Press, 2000.<br />

[2] C.J. Burges, “A tutorial on support vector machine for pattern<br />

recognition,” Knowledge Discovery and Data Mining, vol. 2, pp. 121-<br />

167, June, 1998.<br />

[3] I.E. Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and R. M.<br />

Nohikawa, “A support vector machine approach for detection of<br />

microcalifications,” IEEE trans. Med. Imag., vol.21, NO. 12,<br />

December, 2002.<br />

[4] P.H. Chen, C. J. Lin, and B. Schölkopf, “A tutorial on ν – support<br />

vector machines,” unpublished.<br />

[5] B. Schölkopf and A. J. Smola, Learning with kernels – support vector<br />

machines, regularization, optimization, and beyond. Cambridge, MA:<br />

MIT press, 2002.<br />

[6] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J.<br />

Vandewalle, Least square support vector machines. Singapore: World<br />

scientific publishing Co. Pte. Ltd., 2002.<br />

[7] S. Hoch, F. Althoff, G. McGlaun, and G. Rigoll, “Bimodal fusion of<br />

emotional data in an automotive environment,” proceeding of IEEE<br />

international conference on acoustic, speech, and signal processing.<br />

Vol. 2, PP. 1085-1088, 18-23 March 2005.<br />

[8] C.A. Martinez and A.B. Cruz, “Emotion recognition in non-structured<br />

utterance for human-robot interaction”, IEEE international workshop<br />

on robot and human interactive communication, PP. 19-23, 13-15 Aug.<br />

2005.<br />

[9] T. Nguyen and I. Bass, “Investigation of combining SVM and decision<br />

tree for emotion classification,” seventh IEEE international<br />

symposium on multimedia, PP. 540-544, 2005.<br />

[10] ZJ. Chuang, CH. Wu, “Emotion recognition using acoustic features and<br />

textual content”, IEEE international conference on multimedia and<br />

expo, Vol. 1, PP. 53-56, 27-3- June 2004.<br />

[11] YL. Lin and G. Wei, “Speech emotion recognition based on HMM and<br />

SVM,” proceeding of International conference on machine learning and<br />

cybernetics, Vol. 8, PP 4898-4901, 18-21 Aug. 2005.<br />

[12] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition<br />

combining acoustic features and linguistic information in a hybrid<br />

support vector machine-belief network architecture, ” Proceedings of<br />

IEEE International Conference on Acoustic, Speech, and <strong>Signal</strong><br />

Processing, vol.1, PP. I-577-80, 17-21 May, 2004.<br />

[13] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based<br />

speech emotion recognition, ” Proceeding of the IEEE International<br />

Conference on Acoustic, Speech, and <strong>Signal</strong> Processing, PP. I-401-04,<br />

6-10 April, 2003.<br />

[14] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in<br />

speech using Nueral Networks,” Proceedings of the 6 th International<br />

Conference on Neural Information Processing, vol. 2, PP. 495-501,<br />

1999.<br />

[15] V. A. Petrushin, “Creating emotion recognition agents for speech<br />

signal, ” unpublished.<br />

[16] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05<br />

audio-visual emotion database,” Proceedings of the 22 nd International<br />

Conference on Data Emgineering Workshop, 3-7 April 2006.<br />

[17] D. Tsujinishi, Y. Koshiba, and SH. Abe, “Why pairwise is better that<br />

One-against-All or All-at-Once,” Proceedings of IEEE International<br />

Conference on Neural Networks, vol. 1, PP. 693-698, July 2004.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />

23


A WATERMARKING METHOD FOR SPEECH SIGNALS BASED ON THE TIME–WARPING<br />

SIGNAL PROCESSING CONCEPT<br />

Cornel Ioana (1) , Arnaud Jarrot (2) , André Quinquis (2) , Sridhar Krishnan (3)<br />

(1) LIS Laboratory<br />

BP 46, 961 rue de la Houille Blanche<br />

38402 Saint-Martin d’Hères cedex, FRANCE<br />

phone: +33(0) 476 826 422<br />

email: cornel.ioana@lis.inpg.fr<br />

ABSTRACT<br />

This paper deals with the watermarking of audio speech signals<br />

which consists in introducing an imperceptible mark in<br />

a signal. To this end, we suggest to use an amplitude modulated<br />

signal that mimics a formantic structure present in the<br />

signal. This allows to exploit the time–masking effect occurring<br />

when two signals are close in the time–frequency plane.<br />

From this embedding scheme, a watermark extraction method<br />

based on nonstationary linear ltering and matched lter detection<br />

is proposed in order to recover information carried by<br />

the watermark. Numerical results conducted on a real speech<br />

signal show that the watermark is likely not hearable and informations<br />

carried by the watermark are easily retrievable.<br />

Index Terms— Watermarking, Time–warping signal processing,<br />

Time–frequency analysis.<br />

1. INTRODUCTION<br />

Today’s digital media have opened the door to an information<br />

era where the true value of a product is generally dissociated<br />

from any physical medium. While it enables a high degree<br />

of exibility in its distribution, the commerce of data without<br />

any physical media raises serious copyright issues. Data can<br />

be easily duplicated turning piracy into a simple data copy<br />

process.<br />

In order to secure the identity of the owner of a media, a<br />

solution consists in hiding digital–subcodes inside data since<br />

no physical media can be used for this purpose. This problematic<br />

is generally referred as watermarking [1]. The main<br />

rules in watermarking context are :<br />

• The watermarking should not be discernible from the<br />

media in order to keep the integrity of the media.<br />

• The watermarking should be easily retrievable. Providing<br />

a priori, the inserted watermark should be recovered<br />

as well as the digital–subcodes carried by the<br />

watermark.<br />

(2) E 3 I 2 Laboratory (EA 3876) – ENSIETA,<br />

2RueFrançois Verny, 29806, Brest,<br />

FRANCE<br />

phone: +33(0) 298 348 720<br />

emails: [jarrotar, quinquis]@ensieta.fr<br />

(3) Department of Electrical Engineering –<br />

<strong>Ryerson</strong> <strong>University</strong><br />

350 Victoria Street, Toronto, CANADA<br />

phone: 416.979.5000 x6086<br />

email: krishnan@ee.ryerson.ca<br />

• The watermarking should be robust to attacks (i.e. compression<br />

or noise insertion) since these phenomenons<br />

often occur in media transmissions.<br />

In this paper we propose a watermarking procedure that<br />

attempts to exploit the time–frequency region available between<br />

two formants. We suggest to use, for the watermark, an<br />

amplitude modulated signal whose carrier frequency is modulated<br />

according to the modulation law of a formant. In this<br />

way, the time-frequency content of the watermark follows the<br />

time-frequency content of the formant. This allows to put<br />

the watermark signal very close to the formant. As will be<br />

seen, this embedding strategy makes the watermark likely not<br />

perceptible from an acoustical point of view. The recovery<br />

of the watermark is ensured by nonstationary linear ltering<br />

and matched ltering method. Numerical results show that<br />

the watermark can be easily recovered as well as the coded<br />

sequence carried by the watermark.<br />

The paper is organized as follows. Section 2 is devoted<br />

to a short presentation of the time–warping signal processing<br />

concept. Based on this concept, a new watermarking procedure<br />

is proposed in Section 3. Numerical results presented<br />

in Section 4 illustrate the benets of the proposed technique.<br />

Concluding remarks are given in Section 5.<br />

2. TIME–WARPING SIGNAL PROCESSING<br />

CONCEPT<br />

2.1. Non-unitary Time–Warping Operators<br />

Let x(t) ∈ L 2 (R) be a squared integrable signal. The set of<br />

unitary time–warping operators {W, w(t) ∈C 1 , ˙w(t) ≥ 0:<br />

x(t) → (Wx)(t)}, isdenedin[2]by<br />

(Wx)(t) =| ˙w(t)| 1/2 x (w(t)) , (1)<br />

where ˙w(t) stands for the derivative of the warping function<br />

w(t) with respect to t. Properties of this transformation<br />

include linearity and unitary equivalence since the envelope<br />

| ˙w| 1/2 preserves the energy in the signal at the output of W.<br />

1­4244­0728­1/07/$20.00 ©2007 IEEE II ­ 201<br />

24<br />

ICASSP 2007<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.


In what follows, we deal with a modied version of time–<br />

warping operators that does not fulll the unitary equivalence<br />

property anymore.<br />

We dene the class of non–unitary time–warping operators<br />

by the set { ˘ W,w(t) ∈C1 , ˙w(t) ≥ 0:x(t) → ( ˘ Wx)(t)}<br />

for which<br />

<br />

˘Wx(t) = x(t ′ )δ(w(t) − t ′ ) dt ′<br />

(2)<br />

R<br />

Because ˙w(t) ≥ 0, w-1 (t) exists, we can dene the inverse<br />

projector by<br />

˘W -1 <br />

x(t) = x(t ′ )δ(w -1 (t) − t ′ ) dt ′<br />

(3)<br />

2.2. Time–warping convolution operator<br />

R<br />

The stationary convolution operator applied on x(t),h(t) ∈<br />

L2 (R) is given by<br />

<br />

x(t) ∗ h(t) = x(t ′ ) h(t ′ − t) dt ′<br />

(4)<br />

R<br />

From this denition, it is natural to ask whenever the convolution<br />

operator has an equivalent expression in the time warped<br />

space. We dene the time warping convolution operator by<br />

x(t) w(.)<br />

∗ h(t) = ˘ W -1 <br />

Wx(t ˘ ′<br />

) ∗ h(t) (5)<br />

where w(.)<br />

∗ stands for the time–warping convolution operator<br />

along the warping function w(t). Using Equ. 2, Equ. 3,<br />

Equ. 4, some straightforward algebra manipulations lead to<br />

x(t) w(.)<br />

∗ h(t) =<br />

<br />

2.3. Time–warping lter<br />

R<br />

x(t ′ ) d ˘ Wt<br />

dt h w -1 (t) − w -1 (t ′ ) dt ′<br />

From Equ. 2, one can show that any signal x(t) of the form<br />

x(t) =exp(2iπf0w-1 (t)), f0∈ R is transformed via non–<br />

unitary time–warping operators into<br />

˘Wx(t) =exp(2iπf0 w -1 (w(t)) (7)<br />

=exp(2iπf0t) (8)<br />

which is a pure harmonic signal with frequency fo. One can<br />

exploit this stationarisation effect to design efcient time–<br />

varying lters. Let hH (t) be the impulse response of a lin-<br />

fc<br />

ear time–invariant highpass lter, and hL (t) be the impulse<br />

fc<br />

response of a linear time–invariant lowpass lter. Both lters<br />

are designed to have a cutoff frequency equal to fc. Using<br />

the time–warping convolution operator dened in Equ. 6, we<br />

dene x H (t) and x L (t) by<br />

(6)<br />

x H (t) =x(t) w(.)<br />

∗ h H (t), fc (9)<br />

x L (t) =x(t) w(.)<br />

∗ h L fc (t). (10)<br />

II ­ 202<br />

25<br />

Then, Equ. 9 and Equ. 10 dene a non–stationary ltering<br />

procedure for which<br />

e(t) =fc<br />

˙<br />

w -1 (t) (11)<br />

is the time–varying cutoff frequency of the time–varying lter.<br />

3. TIME–WARPING–BASED<br />

AUDIO–WATERMARKING<br />

3.1. Watermark embedding<br />

w(t)<br />

Warping function<br />

m(t) =<br />

a(t) e<br />

x(t)<br />

j2πfot<br />

Watermark<br />

<strong>Signal</strong><br />

<br />

˘Wm(t)<br />

• δ(w(t) − t ′ )dt ′ xm(t)<br />

Fig. 1. Watermark embedding procedure.<br />

Watermarked<br />

signal<br />

The proposed watermarking embedding scheme is depicted<br />

in the Fig. 1. Roughly speaking, the embedding of the watermark<br />

is carried out in two steps. First, the watermark is<br />

matched to the specicity of the audio signal by means of<br />

adapted warping operator. Then, the watermark is added to<br />

the original signal.<br />

Human hears are sensitive to frequency–spread signals,<br />

which are interpreted as shufe [3]. For this reason we suggest<br />

to use a watermark m(t) that belongs to the class of frequency<br />

coherent signals expressed by<br />

m(t) =a(t) e j2πf0t , f0 ∈ R +<br />

(12)<br />

where a(t) is assumed to be a positive slowly time–varying<br />

signal. This class of signals is concentrated around the carrier<br />

frequency f0.<br />

In the proposed method, the rule of insertion of the watermark<br />

is based on the fact that two close signals with similar<br />

instantaneous frequency laws are very similar in an auditive<br />

point of view [3]. Therefore one can exploit this time–<br />

masking effect by choosing an area, on the time–frequency<br />

plane, where the watermark is designed to mimics some frequency<br />

concentrated component which is present in the signal.<br />

In what follows, we denotes by the term “masking component”<br />

such component. In the case of speech signals, a natural<br />

choice for the masking component is to select a formant<br />

that has a long enough time–duration.<br />

Let f(t) be the model of a formant described by<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.<br />

f(t) =af(t) e j2πφf (t) , t ∈ [ti,tf ],ti


In order to exploit the masking effect provided by the masking<br />

component f(t), the time–warped watermark ˘ Wm(t) should<br />

be as close as possible of the formant in the time-frequency<br />

plane. Therefore, we dene the time–warped watermark by<br />

˘Wm(t) =a(w(t)) e j2π(φf (t)+εt) , t ∈ [ti,tf ], (14)<br />

where ε ∈ R is the frequency shift of the watermark. The<br />

choice of ε depends on a trade–off between the separability<br />

of the watermark and performances of the masking effect. If<br />

ε is too large, the masking effect decreases. If ε is too small,<br />

the watermark cannot be retrieved because of the proximity<br />

of the formant.<br />

Beyond the stealthiness of the watermark, another topic<br />

of the watermarking concept is the coding of some specic<br />

information on signals. To achieve this topic, we suggest to<br />

use the amplitude of the watermark for information coding.<br />

Let the atom g(t) ≥ 0, t∈ <br />

−T T<br />

2 , 2 be a positive compactly<br />

supported function for which T is small compared to<br />

the time–duration of the masking component Tf − Ti. Based<br />

on this denition, we suggest to construct the amplitude of<br />

the watermark a(t) as a superposition of time–delayed versions<br />

of the atom g(t).<br />

The choice of the g(t) function can be be guided by physiological<br />

aspect of the human hear. It is generally accepted<br />

that hears are very sensitive to fast variations of signals since<br />

they produce a large spread in the frequency domain [3]. For<br />

this reason, we force the atom g(t) to be as smooth as possible<br />

which can be translated into a mathematical notation by<br />

requiring the atom g(t) to be of class C∞ , the class of in-<br />

nitely derivable functions. In the remaining of this paper we<br />

dene g(t) as a scaled version of the mother atom gm(t)<br />

⎧<br />

2<br />

⎨ −(t/a)<br />

exp<br />

gm(t) =<br />

1−(t/a)<br />

⎩<br />

2<br />

2 , t ∈ [−1, 1],<br />

(15)<br />

0, t /∈ [−1, 1],<br />

where a ∈ R + is the scaling factor. From empirical evidences,<br />

we saw that for detection reasons, atoms g(t) have<br />

to be separated each other of at least 5σg, whereσ 2 g is the<br />

variance of g(t).<br />

Let τ be the digital information that has to be watermarked<br />

in the audio signal which is expressed in binary by (τ)2 =<br />

τ0τ1 ...τN where τi are the bits of τ. Then, the amplitude<br />

a(t) of the watermark is encoded as follows<br />

a(t) =<br />

N<br />

τn g(t − 5iσg), (16)<br />

i=0<br />

which is known as an amplitude modulation coding scheme.<br />

3.2. Watermark recovery<br />

Once a signal has been watermarked, next step is to deal with<br />

the recovery of the watermark sequence. However, because<br />

of different aspects related to the transmission of the signal<br />

(compression, quantization, noise, ...) this recovery is generally<br />

performed on a modied version xm(t) of xm(t). Inthe<br />

proposed method, the watermark is said to be recovered if the<br />

digital information τ has been estimated from xm(t) without<br />

error. The recovery procedure is depicted in the Fig. 2 where<br />

the symbol (ˆ.) denotes an estimation of the quantity (.). As<br />

seen, the watermark recovery is carried out in three steps.<br />

xm(t) • w1(.)<br />

Watermarked<br />

signal<br />

II ­ 203<br />

26<br />

˘Wm(t)<br />

w1(t) =<br />

[φf (t)+(ε − Δ)t] −1<br />

<br />

∗ hH 1 (t) • w2(.)<br />

∗ hL 1 (t)<br />

• δ(w -1 (t) − t ′ )dt ′<br />

w2(t) =<br />

[φf (t)+(ε +Δ)t] −1<br />

➊ ➋<br />

<br />

•g(t − 5iσg)dt 1<br />

ˆm(t)<br />

g<br />

≷<br />

0 2<br />

➌ ➍<br />

Fig. 2. Watermark extraction procedure.<br />

τi<br />

Estimation<br />

of bit τi<br />

First step corresponds to the extraction of the time–warped<br />

watermark ˘ Wm(t) by means of time–warped lters (blocks<br />

➊ and ➋). Two time–varying lters are necessary to extract<br />

the watermark : one highpass (block ➊), and one lowpass<br />

(block ➋). This ltering stage denes a time–varying pass–<br />

band lter expressed by<br />

<br />

˙φf (t)+ε +Δ, the upper cuttof frequency,<br />

(17)<br />

˙φf (t)+ε − Δ, the lower cuttof frequency.<br />

It is well–known that the frequency spread of a time–<br />

varying signal around its instantaneous frequency law depends<br />

on the regularity of its amplitude. Because the amplitude of<br />

the watermark is of class C∞ , the frequency decay is faster<br />

than any power of f. Therefore, only a small Δ value is necessary<br />

to extract the time–warped watermark.<br />

Second step corresponds to the unwarping of the estimated<br />

time–warped sequence (block ➌) in order to recover an estimation<br />

of the original sequence m(t).<br />

Last step corresponds to the estimation of bits τi with<br />

matched ltering (block ➍). The estimation is performed as<br />

follows<br />

<br />

τi = ˆm(t) g(t − 5iσg)dt<br />

R<br />

1<br />

≷<br />

0<br />

where g is the norm of g(t).<br />

4. NUMERICAL RESULT<br />

g<br />

, i =1..N, (18)<br />

2<br />

The test signal is a male utterance of the word “bingo” sampled<br />

at 8 kHz. The Log–spectrogram of the test signal is depicted<br />

in the Fig. 3. The selected masking component is the<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.


formant referenced by the black arrow. The watermark is em-<br />

Normalized frequency<br />

0.5<br />

0.25<br />

0<br />

0 1000 2000 3000 4000<br />

Normalized time<br />

Fig. 3. Log–spectrogram of the test signal. Male utterance of<br />

the word “bingo” with a sampling rate of 8 kHz.<br />

bedded as described in Sec. 3.1. First, the data (τ)2 = 010011<br />

is used to generate the amplitude of the watermark by means<br />

of the Equ. 16. Then the insertion zone is manually choseninordertodene<br />

the warping operator used to generate<br />

the time–warped watermark. Finally, the time–warped watermark<br />

is added to the original signal. Result of the watermark<br />

embedding is shown in Fig. 4. As can be seen, the time–<br />

Normalized frequency<br />

0.5<br />

0.25<br />

0<br />

0<br />

1000<br />

Normalized time<br />

(a) Original signal<br />

Normalized frequency<br />

0.5<br />

0.25<br />

0<br />

0<br />

1000<br />

Normalized time<br />

(b) Watermarked signal<br />

Fig. 4. Zoomed log–spectrogram of the original and watermarked<br />

test signal.<br />

warped watermark is very close to the original formant. As<br />

expected, the frequency spread decreases very fast, thanks to<br />

the smoothness of the amplitude of the watermark sequence.<br />

We found the embedding strategy satisfactory since we<br />

were not able to guess wether the signal was watermarked or<br />

not during blind tests. In order to provide a more objective<br />

comparison criterion, we make use of the “Auditory Toolbox”<br />

[4] to generate auditory representations of original and<br />

watermarked signals which is a pseudo–time–frequency representation<br />

based on physiological aspects of human hears.<br />

Auditory representations of original and watermarked signals<br />

are depicted in Fig. 5. Both representations are very similar<br />

which conrms stealthiness of the watermark.<br />

Next step consists in the recovery of the watermark sequence.<br />

We tested the proposed approach on the true watermarked<br />

signal, and on two different deteriorated versions of<br />

the watermarked signal : the rst is a MP3 compression attack,<br />

and the second is an additive Gaussian noise attack with<br />

a signal–to–noise ratio of 0 dB.<br />

Results of the matched ltering estimation are presented<br />

in Tab. 1. Results of the estimation step show that the watermark<br />

is perfectly extracted and has resisted to the MP3 attack<br />

as well as the white–noise attack.<br />

II ­ 204<br />

27<br />

Channels<br />

120<br />

60<br />

0<br />

0<br />

1000<br />

Normalized time<br />

(a) Original signal<br />

Channels<br />

120<br />

60<br />

0<br />

0<br />

1000<br />

Normalized time<br />

(b) Watermarked signal<br />

Fig. 5. Auditory representation of the original signal and the<br />

watermarked signal.<br />

τ τ1 τ2 τ3 τ4 τ5 τ6<br />

True 0 1 0 0 1 1<br />

No attack 0 1 0 0 1 1<br />

MP3 attack 0 1 0 0 1 1<br />

Noise attack 0 1 0 0 1 1<br />

Table 1. Results of the estimation of the set {τi} by matched<br />

ltering.<br />

5. CONCLUSION<br />

In this paper we have proposed a new watermarking method<br />

for speech signals, based on the time–warping signal processing<br />

concept. We have shown that it is possible to exploit physiological<br />

aspects of the human hear in order to carry information<br />

while keeping stealthiness of the inserted watermark.<br />

Then, we have developed a complete extraction method based<br />

on time–varying lter, time–warping operators and match ltering,<br />

to recover the watermark sequence. Numerical results<br />

show that the watermark is likely not hearable and numerical<br />

information carried by the watermark are retrievable. Future<br />

work will include a close study the robustness of the method<br />

against various attacks. For real applications, another topic<br />

is the unsupervised embedding of the watermark according to<br />

the position of formant. This issue is left for future work.<br />

6. REFERENCES<br />

[1] M. Arnold, “Audio watermarking: Features, applications<br />

and algorithms,” in IEEE International Conference on<br />

Multimedia and Expo, New York, USA, July 2000.<br />

[2] R. Baraniuk, “Unitary equivalence: A new twist on signal<br />

processing,” IEEE Trans. on <strong>Signal</strong> Processing, vol. 43,<br />

no. 10, pp. 2269–2282, Oct. 1995.<br />

[3] M.C. Botte, G. Canevet, L. Demany, and C. Sorin, Psychoacoustique<br />

et perception auditive, Inserm, 1989.<br />

[4] M. Slanley, “Auditory toolbox, version 2.0,” Avaiable at<br />

http://www.slaney.org/malcolm/pubs.html, 1994.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.


Chirp-based image watermarking as error-control coding<br />

Behnaz Ghoraani, and Sridhar Krishnan<br />

Dept. of Elec. & Comp. Eng., <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />

E-mail: bghoraan@rnet.ryerson.ca, and krishnan@ee.ryerson.ca<br />

Abstract<br />

In this paper, we use post processing methods to compensate<br />

the bit errors occurred in watermark embedding<br />

and extracting. Forward error correction (FEC)-based and<br />

chirp-based techniques are applied to encode and shape<br />

the embedded watermark message so that even at the presence<br />

of some bit error rates (BERs) in the extracted watermark,<br />

the watermarking algorithm be able to successfully<br />

estimate the correct embedded watermark message.<br />

Repetition and Bose-Chaudhuri-Hocquenghem (BCH) codings<br />

are used as two well-known FEC schemes, and discrete<br />

polynomial transform (DPPT) and Hough-Radon transform<br />

(HRT) are utilized as two chirp detectors in chirp-based watermarking.<br />

Robustness of all the proposed post processing<br />

methods are tested for checkmark benchmark attacks,<br />

and we found that the chirp-based watermarking using the<br />

DPPT chirp detector offers the highest watermark extraction<br />

rate, and the best bit error compensation even at BERs<br />

of higher than 17%.<br />

1. Introduction<br />

The worldwide trend of using the Internet to electronically<br />

distribute multimedia offers lot of advantages such as<br />

huge cost reduction and considerable time saving of transportation<br />

process to both owners and consumers. However,<br />

the available methods for distribution of multimedia lacks<br />

the privacy and ownership proof. One of the suggested solutions<br />

to protect the copyrights and prevent illegal use of the<br />

multimedia is watermarking. Watermarking is embedding a<br />

hidden signature into the multimedia signal containing information<br />

about the content authentication, the access control<br />

and copy protection, and the identification and traitor<br />

tracing in the case of illegally distribution. The embedded<br />

watermark signal should be imperceptible, and do not effect<br />

the quality of the multimedia content. Also, since users normally<br />

apply many signal manipulations such as lossy compressions<br />

on a multimedia signal, the watermark should be<br />

robust to these typical signal operations. However, even us-<br />

Proceedings of the 2006 International Conference on Intelligent<br />

Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />

0-7695-2745-0/06 $20.00 © 2006<br />

ing the robustest embedding techniques, there will be some<br />

bit errors in the received watermark message. Therefore, the<br />

watermark detection process encounters some difficulties in<br />

extracting the exact watermark message, and retrieving the<br />

hidden information. Shaping the embedded signature in a<br />

certain way that the extractor could compensate the bit error<br />

rate (BER) of the message using the prior knowledge<br />

about the watermark structure can be useful in estimating<br />

the exact watermark message.<br />

In this study, we focus on utilizing chirps as watermark<br />

message structures [1][7], and comparing its results with<br />

forward error correction (FEC) schemes. To do the experiments,<br />

we use spread spectrum method to embed the watermark<br />

messages to discrete cosine transform (DCT) coefficients<br />

of the image. After extracting the watermark message<br />

from the watermarked image, there is a post processing<br />

stage. We utilize BCH and repetition codings for FECbased<br />

post processing, and discrete polynomial transform<br />

(DPPT) and Hough-Radon transform (HRT) detectors in<br />

chirp-based watermarking. In this paper, we present the results<br />

of chirp-based and FEC-based post processings, and<br />

show that the chirp-based watermarking is found comparable<br />

with the FEC schemes.<br />

2. Watermarking method<br />

In this study we use spread-spectrum watermarking<br />

scheme which is a correlation method that embeds pseudorandom<br />

sequence and detects watermark by calculating correlation<br />

between pseudo-random noise sequence and watermarked<br />

signal. Spread-spectrum scheme is the most popular<br />

scheme and has been studied well in literature [2]. The<br />

spread-spectrum method can be applied in time domain or<br />

transformed frequency domain. We utilized DCT coefficients,<br />

which are widely used in compression applications<br />

and are easier to impose human visual system (HVS) constraints<br />

on them. Figure 1 shows the block diagram of the<br />

watermarking embedding and extraction schemes [7].<br />

As mentioned earlier, because of the intentional and nonintentional<br />

signal processings there will be some BERs in<br />

the received message. In the next section we try to compen-<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />

28


Figure 1. Detection and extraction block diagram<br />

of the watermarking method<br />

sate these bit errors with concentrating more on the structure<br />

of the watermark message.<br />

3. Post processing in watermarking<br />

In case of severe signal manipulations, the extracted watermark<br />

has some bit errors. One of the post processing<br />

methods that can be useful in correcting the bit errors is encoding<br />

the watermark signature with a FEC scheme before<br />

embedding it to the multimedia signal.<br />

3.1 Forward Error Correction schemes<br />

FEC schemes, or channel codings, are used to protect<br />

digital communication by inserting redundant bits to the<br />

data. These additional bits contribute in detecting and correcting<br />

the errors happened in the data. Due to the similarity<br />

between watermarking and communication systems,<br />

FEC methods have been commonly used to increase the bit<br />

error compensation capacity of watermarking techniques.<br />

BCH, turbo and repetition codings are the most commonly<br />

used FEC schemes in watermarking application [4]. In this<br />

study, we utilized BCH and repetition codings as two wellknown<br />

FEC schemes.<br />

3.1.1 Bose-Chaudhuri-Hocquenghem (BCH) coding<br />

BCH is a kind of block coding scheme. A binary BCH<br />

code (n, k) segments the data into block of k bits, and transforms<br />

each k-bit data block into N-bit block. The (n − k)<br />

bits are called redundant bits, and the code rate is k/n.<br />

Since in this study our target is comparing different types of<br />

post processing methods, all the watermark messages used<br />

in each method have almost the same number of bits and<br />

redundancy rates. n and k for the BCH code that results in<br />

Proceedings of the 2006 International Conference on Intelligent<br />

Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />

0-7695-2745-0/06 $20.00 © 2006<br />

a watermark message length closer to 180 bits, and gives<br />

the highest redundancy rate are 63 and 7 respectively. BCH<br />

(63, 7) encodes a 21-bit watermark signature to a 189-bit<br />

long embedded message with 10.7/12 redundancy rate.<br />

3.1.2 Repetition coding<br />

Repetition coding is a very simple and well-known coding<br />

technique. Repetition coding with repetition number of<br />

n repeats each bit n times, so results in a redundancy rate of<br />

n/(n+1). We choose n =11to encode a 15-bit watermark<br />

to an embedded watermark of 180-bit long and redundancy<br />

rate of 11/12.<br />

Having a prior knowledge about the structure of the watermark<br />

message could also be useful in increasing the compensation<br />

of the BERs in the extracted watermark. The<br />

structure that has been proposed is embedding chirp signals<br />

as the watermark message in chirp-based watermarking.<br />

3.2. Chirp-based watermarking<br />

In chirp-based watermarking, the idea is embedding<br />

chirps in the multimedia signal as watermark signatures.<br />

Before embedding the watermark signal into the image, the<br />

watermark message is encoded to the embedded chirp according<br />

to a predefined codebook. Chirps are time varying<br />

frequency signals and they can be best detected in the TF<br />

plane; also, different chirp rates represent different watermark<br />

messages. Because the extracted watermark message<br />

should be in the form of a chirp, by applying a post processing<br />

step such as a chirp detector on the extracted watermark,<br />

the embedded chirp could be estimated successfully even in<br />

the presence of some BERs. HRT and DPPT are the two<br />

chirp detection tools that have been used in our experiments.<br />

3.2.1 Hough-Radon transform (HRT)<br />

The HRT is a parametric tool to detect the pixels that belong<br />

to a parametric constraint of either a line or curve in a<br />

gray level image [8]. HRT divides the Hough-Radon parameter<br />

space to cells, then calculates the accumulator value<br />

for each cell in the parameter space. The cell with the highest<br />

accumulator value represents the parameter of the HRT<br />

constraint. Since in the application of post-processing of<br />

chirp-based watermarking, we are looking for the embedded<br />

chirp as straight lines in the TF plane, we can apply<br />

the HRT method to detect the embedded chirp. First, the<br />

extracted watermark bits are transformed to the TF plane;<br />

then the HRT detects the line representing the chirp in TFD.<br />

In order to achieve a good detection performance, Wigner-<br />

Ville Transform (WV) is used as the TFD representation of<br />

the signal. In this study, HRT space has 182 X 182 cells that<br />

supports a 15-bit long watermark message. HRT-based post<br />

processing has a redundancy rate of 11/12.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />

29


3.2.2 Discrete Polynomial Phase Transform (DPPT)<br />

DPPT is a spatial parametric signal analysis for estimating<br />

the phase parameters of constant amplitude polynomial<br />

phase signals even under the presence of some BERs in the<br />

signal [5]. The embedded watermark in the chirp-based watermarking<br />

is in the form of:<br />

chirp(n) =exp j(a0 + a1n + a2n 2 ) <br />

(1)<br />

The DPPT gives an estimation of a0,a1 and a2 which enables<br />

us to synthesize the original chirp. The DPPT algorithm<br />

defines ambiguity functions, which applying the 2order<br />

function to a constant-amplitude chirp transforms the<br />

broadband signal into a single tone signal with frequency<br />

related to a2. The position of this spectral peak provides<br />

an estimate of the coefficient â2. Multiplying the signal<br />

with exp −jâ2n 2 reduces the order of the polynomial<br />

to 1, and repeating the procedure gives an estimation of all<br />

parameters. The judgment about the successful watermark<br />

extraction is done based on the following methods:<br />

Threshold-based (DPPT[T]) [3] - We make decision<br />

on the correct detection of the watermark based on<br />

the correlation between the estimated chirp and the<br />

embedded watermark. The threshold for 182-bit long<br />

chirp as embedded chirp, and 15 bit watermark signature<br />

is experimentally set up to 0.9.<br />

Correlation-based (DPPT[C]) - Searches for the chirp<br />

in the codebook which has the highest correlation coefficient<br />

with the estimated chirp. In this case to have<br />

a better watermark extraction rate the correlation between<br />

the chirps in the codebook is limited to a maximum<br />

of 0.93 that offers a redundancy rate of 11.08/12<br />

for a chirp length of 182 bits.<br />

Initial and final frequency-based (DPPT[F]) - Finds the<br />

chirp which has the closest initial and final frequencies<br />

with the estimated initial and final frequencies of<br />

the recovered chirp. Due to BER in the received watermark<br />

signal, the DPPT estimates the initial and final<br />

frequencies with some variations from the original<br />

ones. To increase the watermark extraction rate the<br />

minimum difference between the initial and final frequencies<br />

of chirps in the codebook are defined 4 Hz<br />

for 182-bit long chirp. This settings give the 11.09/12<br />

redundancy rate.<br />

4. Results<br />

To measure the robustness of the post processing algorithms,<br />

we perform the checkmark benchmark attacks [6]<br />

for 10 different images of size 512 X 512. The PN sequence<br />

is 100,000 samples and the watermark sampling frequency<br />

is 1 kHz. To have a fair comparison of all the post<br />

Proceedings of the 2006 International Conference on Intelligent<br />

Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />

0-7695-2745-0/06 $20.00 © 2006<br />

processing techniques, the methods have been set up with<br />

almost equal message lengths and redundancy rates. Figure<br />

2 shows important features of the applied schemes.<br />

Figure 2. Features of each coding schemes<br />

used to code the watermark message<br />

Figure 3 shows the results of all the post processing techniques.<br />

The first column shows different types of attacks ap-<br />

Figure 3. Watermark Detection Results for<br />

Checkmark Benchmark Attacks<br />

plied on the watermarked image, and the number shows the<br />

number of attacks. The number under each column represents<br />

the percentage of successful watermark detection under<br />

each class of attack. Although DPPT[T] method shows<br />

higher result comparing to DPPT[C] and DPPT[F], it results<br />

in 13% False positive error rate, which is too high to<br />

be applicable for multi user watermarking system. Therefore,<br />

DPPT[T] is useful in watermarking applications that<br />

we are interested in detecting the watermark rather than extracting<br />

the watermark message. DPPT[F] based method<br />

offers higher or in some cases equal results comparing<br />

to DPPT[C]-based method. In addition, DPPT[F]-based<br />

method does not require the long process of correlating the<br />

estimated chirp with all chirps in the CodeBook and is faster<br />

than DPPT[C]. Thus, we conclude that the initial and the final<br />

frequency comparison is the best DPPT-based method<br />

to find the embedded message in the CodeBook.<br />

Although HRT is an optimum line detection tool, hav-<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />

30


ing a large number of cells in Hough-Radon space shifts<br />

the peak of the accumulator to the neighbor cells and consequently<br />

results in a wrong detection of the slope in TFD.<br />

Therefore, we see in Figure 3 that DPPT-based method outperforms<br />

HRT-based algorithm in most of the attack types<br />

with a total detection of 92% compared to 87%. Also, comparing<br />

the complexity order of HRT-based and DPPT-based<br />

techniques, we conclude that the DPPT-based method is<br />

more practical for real-time applications. Figure 4, shows<br />

the complexity order and running time based on Pentium<br />

IV, CPU 2.66GHz and 512MB of RAM. DPPT-based watermark<br />

extraction is about 55 times faster than HRT-based<br />

algorithm.<br />

As we observe in Figure 3, the detection result for the<br />

BCH coding and repetition codings have almost the same<br />

detection rates, but DPPT-based method offers better or in<br />

some cases equal results when compared to REP and BCH<br />

codings. Figure 5 shows the detection results considering<br />

BER in the received message. As we see in this figure, both<br />

DPPT and BCH detect 100% watermark messages successfully<br />

up to a BER of 17%. However, the maximum BER<br />

that BCH detects a watermark correctly is 22% with 17%<br />

detection, while DPPT shows 50% detection rate at a BER<br />

of 28%. To highlight the outstanding performance of DPPT<br />

at high BERs, we calculate the watermark detection rate<br />

for BERs bigger than 17%. We see that the DPPT-based<br />

method offers 52% detection rate in higher BERs, while<br />

BCH and Repetition codings have 47% and 41% detection<br />

rates.<br />

Figure 4. Order of complexity of each coding<br />

schemes used to code the watermark message<br />

5. Conclusions<br />

In this paper, we compared FEC-based and chirp-based<br />

post processing methods in watermarking. The robustness<br />

of the proposed techniques was tested against checkmark<br />

benchmark attacks. The DPPT-based and BCH-based methods<br />

were able to compensate the BER of up to 17%. The<br />

DPPT-based post processing offered the highest detection<br />

rate of 92%, and showed the highest detection rate for BERs<br />

of higher than 17%. Also, we compared the computation<br />

complexity of the proposed methods; BCH, repetition<br />

and DPPT-based methods have almost the same complex-<br />

Proceedings of the 2006 International Conference on Intelligent<br />

Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />

0-7695-2745-0/06 $20.00 © 2006<br />

Successful watermark estimation(%)<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

HRT<br />

REP<br />

BCH<br />

DPT<br />

0<br />

0 5 10 15 20 25 30 35 40<br />

BER of the received watermark message<br />

Figure 5. Watermark detection under different<br />

bit error rates.<br />

ity, while HRT has a complexity of 55 times higher than the<br />

other methods, and this is because HRT operates on the TF<br />

plane, and calculates the accumulator value for all the cells<br />

in Hough-Radon plane.<br />

References<br />

[1] S. Erkucuk, S. Krishnan, and M. Zeytinoglu. Robust audio<br />

watermarking using a chirp based technique. IEEE Intl. Conf.<br />

on Multimedia and Expo, 2:513–516, July 2002.<br />

[2] D. Kirovski and H. Malvar. Spread-spectrum watermarking<br />

of audio signals. IEEE Transactions on <strong>Signal</strong> Processing,<br />

special issue on data hiding, 51:1020–1034, April 2003.<br />

[3] L. Le and S. Krishnan. Time-frequency signal synthesis<br />

and its application in multimedia watermark detection.<br />

EURASIP Journal on Applied <strong>Signal</strong> Processing, 2006:Article<br />

ID 86712, 14 pages, 2006.<br />

[4] J. Lee, H. Kim, and J. Lee. Information extraction method<br />

without original image using turbo code. Proc. International<br />

Conference on Image Processing, Greece, 3:880–883, October<br />

2001.<br />

[5] S. Peleg and B. Friedlander. The discrete polynomialphase<br />

transform. IEEE Transactions on <strong>Signal</strong> Processing,<br />

43:1901–1914, August 1995.<br />

[6] S. Pereira, S. Voloshynovskiy, M. Madueno, S. Marchand-<br />

Maillet, and T. Pun. Second generation benchmarking and<br />

application oriented evaluation. In Information Hiding Workshop<br />

III, Pittsburgh, PA, USA, April 2001.<br />

[7] A. Ramalingam and S. Krishnan. Robust image watermarking<br />

using a chirp detection-based technique. IEE Proceedings on<br />

Vision, Image and <strong>Signal</strong> Processing, 152:771–778, December<br />

2005.<br />

[8] R. Rangayyan and S. Krishnan. Feature identification in the<br />

time-frequency plane by using the Hough-Radon transform.<br />

IEEE Trans. on Pattern Recognition, 34:1147–1158, 2001.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />

31


2006 International Joint Conference on Neural Networks<br />

Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada<br />

July 16-21, 2006<br />

Automatic Content-Based Image Retrieval Using Hierarchical<br />

Clustering Algorithms<br />

Kambiz Jarrah, Student Member, IEEE, Sri Krishnan, Senior Member, IEEE, and Ling Gum, Senior<br />

Member, IEEE<br />

Amt-The overall objective of this paper is to present a<br />

methodology for guiding adaptations of an RBF-based<br />

relevance feedback network, embedded in automatic content-<br />

based image retrieval (CBIR) systems, through the principle of<br />

unsupervised hierarchical clustering. The self-organizing tree<br />

map (SOTM) is essentially attractive for our approach since it<br />

not only extracts global intuition from an input pattern space<br />

but also injects some degree of localization into the<br />

discriminative process such that maximal discrimination<br />

becomes a priority at any given resolution. The main focus of<br />

this paper is twwfold: introducing a new member of SOTM<br />

family, the Directed SOTM (DSOTM) that not only provides a<br />

partial supervision on cluster generation by forcing divisions<br />

away from the query class, but also presents a flexible verdict<br />

on resemblance of the input pattern as its tree structure grows;<br />

and modifying the current structure of the normalized graph<br />

cuts (Ncut) process by enabling the algorithm to determine<br />

appropriate number of clusters within an unknown dataset<br />

prior to its recursive clustering scheme through the principle of<br />

self-or ganizing normalized graph cuts (SONcut).<br />

Comprehensive comparisons with the Self-Organizing Feature<br />

Map (SOFM), SOTM, and Ncut algorithms demonstrate<br />

feasibility of the proposed methods.<br />

I INTRODUCTION<br />

Content-based image retrieval (CBIR) relies on<br />

characterization of images based on their visual contents.<br />

These visual contents consist of low-level features<br />

including colour, texture, shape, etc.- that offer a multidimensional<br />

vector representation of an image within the<br />

feature space.<br />

One of the major requirements for designing an effective<br />

CBIR system is to reduce the gap between low-level features<br />

and high-level concepts by tailoring the human perceptual<br />

subjectivity to the retrieval process. Human-computer<br />

interaction (HCI) systems have demonstrated a successful<br />

training the system with suitable samples (images). This<br />

dependency of the system on users' inputs may add<br />

excessive human errors to the adaptation process due to<br />

subjective interpretations of image contents by each<br />

individual.<br />

To overcome this problem, an unsupervised learning<br />

approach with a hierarchical architecture is required to guide<br />

these adaptations automatically and more toward relevant<br />

samples. SOTM has show an effective behavior in<br />

minimizing human interactions and automating the search<br />

process by efficiently classifying an unknow and non-<br />

uniform data space into more meaningful clusters.<br />

The main focus of this paper is as follows: a) introducing<br />

a new member of SOTM family, the Directed SOTM<br />

(DSOTM), that dynamically controls generation of new<br />

centres and decides on resemblance of input samples, with<br />

respect to the query, during the learning phase of the<br />

algorithm; and b) modifying the current structure of the<br />

normalized graph cuts (Ncut) algorithm [2] to make it more<br />

adaptive to the nature of the input pattern by adding a self-<br />

determination mechanism to its algorithm to decide on<br />

appropriate number of clusters prior to its iterative clustering<br />

process. We call the modifiedNcut, the self-organizing Ncut<br />

(SONcut).<br />

This paper provides some details on both DSOTM and<br />

SONcut algorithms in Section 2; Section 3 presents an<br />

overall description on the structure of CBIR system used in<br />

this work; a comprehensive comparison between the<br />

proposed classifiers and their conventional counterparts is<br />

also presented in Section 4; Section 5 concludes the paper<br />

with some remarks.<br />

11. UNSUPERVIXD CLUSTERING APPROACHES<br />

behavior for this purpose [I]. In such systems, users directly In this section, an overview of the proposed unsupervised<br />

supervise the learning process by constantly providing and hierarchical clustering algorithms using DSOTM and<br />

SONcut is presented.<br />

This work was supported by Natural Sciences andEngineering <strong>Research</strong> A. Drected Sekf-Organ~zmg Tree Map (DSOTM)<br />

Council of Canada (NSERC) and <strong>Ryerson</strong> Graduate Scholarship Program<br />

K Jarrah, barrak@e.ryerson.ca, is affiliated with the Multimedia<br />

<strong>Research</strong> Laboratory (RML) and <strong>Signal</strong> <strong>Analysis</strong> <strong>Research</strong> <strong>Group</strong> (<strong>SAR</strong>),<br />

L31 is an unsupervised machine learning 'gorithm<br />

and is inspired by principles found in Kohonen's self-<br />

<strong>Ryerson</strong> <strong>University</strong> ( mw ryerson ca), Toronto, Canada.<br />

S Knshnan, hisknan@e.ryersora.ca, is the char of Department of<br />

Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto,<br />

Canada, andis affiliatedwith the <strong>Signal</strong> halysis <strong>Research</strong> <strong>Group</strong> (<strong>SAR</strong>) at<br />

organizing feature map (SOFM) [4].<br />

The tree structure of the SOTM is constructed by<br />

randomly selecting an isolated root node (centre) and<br />

the same university<br />

L Gum, dguan@e.ryerson.ca, is the Canada <strong>Research</strong> Char, <strong>Ryerson</strong><br />

<strong>University</strong>, Toronto, Canada, and is affiliated with the Multimedia <strong>Research</strong><br />

repeatedly presenting the remaining of the patterns to the<br />

network. The pattern (sample) which is found to be closest<br />

to the centre with respect to current similarity measurement<br />

Laboratory (RML) at the same university<br />

0-7803-9490-9/06/$20.00/©2006 IEEE<br />

3532<br />

32<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.


Fig 1 Two-dimensional mapping (Left) clustering using SOFM and @ ight) clustering using SOTM. It is evident that redundant nodes in the<br />

lattice topology of SOFIA can produce unnecessary boundmes by having some of the centres trapped in low-density regions of the input pattern<br />

is declared to be the winner. Every such presentation of<br />

input patterns slightly modifies the winning node's position<br />

in the network: aposition that eventually evolves toward the<br />

centre of mass of the current class. This gradual adaptation<br />

of the node's position is controlled by an exponential<br />

decaying function called the kearnzng rate. The learning rate<br />

resets to its initial value each time a new centre is generated.<br />

Therefore, sufficient time is given to the network to adapt<br />

itself to the presence of new samples, thus, the tree grows<br />

larger and the similarity measurement tends to be more<br />

accurate. The generation of new centres (branches of the<br />

tree) is controlled by a hierarchy function, called the<br />

th~shokd~nction, which decreases over time. If an input<br />

node is encountered whose similarity exceeds this threshold<br />

function (i.e. is significantly different from all nodes in the<br />

current SOTM map) a new node is generated. The new node<br />

is attached as a leaf node of its closest representation in the<br />

current SOTM mapping, thus over time, a tree structure<br />

evolves [q.<br />

Similar to SOFM, SOTM aims at preserving the<br />

topological relationships between patterns in the original<br />

input space. However, unlike SOFM, SOTM classifies a<br />

large group of patterns by building and evolving a tree<br />

structure that tends to form neighborhood relationships by<br />

reflecting a degree of similarity between the new and<br />

already classified patterns.<br />

Although, image indexing with the SOFM was perceived<br />

to be a robust and effective solution that tolerates even very<br />

high input vector dimensionalities [5], the lattice topology of<br />

SOFM network makes it essentially undesirable for<br />

clustering purposes due to concentration of a fraction of<br />

nodes in the map resulting of the best-matching unit<br />

computation [6]. SOFM has nodes that can easily get<br />

trapped in regions of low density and, therefore, can simply<br />

lose its ability to represent the underlying topology of the<br />

input pattern. For instance, let us assume there are two high<br />

density regions in the input space, representing two distinct<br />

clusters. Let us also assume that there are maximum two<br />

nodes in the structure of SOFM to correctly allocate both<br />

regions. If those two nodes were separated by a third node<br />

and each converged to the two adjacent regions of high<br />

density, then the third node could easily get trapped in<br />

between the regions. As a result, it can change the true<br />

boundaries of high density classes by pulling some of the<br />

3533<br />

33<br />

samples from the two real clusters and allocating them to the<br />

middle node as is illustrated in Fig. 1. In this figure, the<br />

second node of the SOFM network has dragged some of the<br />

data points from the first cluster and has generated a new but<br />

redundant class. The tree structure of SOTM, however, is<br />

succesdul in determining the high-density regions.<br />

Problem with the SOTM algorithm is two-fold: it<br />

unsuitably decides on the relevant number of classes; and<br />

often loses track of the true query position. Decision on<br />

which clusters are relevant in the SOTM is postponed until<br />

after the algorithm has converged. This is because there is<br />

no innate controlling process available for the algorithm to<br />

influence cluster generation around the query centre (the<br />

SOTM clusters entirely independently). Losing a sense of<br />

query location within the input space can have an undesired<br />

effect on the true structure of the relevant class and can<br />

force the SOTM algorithm to spawn new clusters and form<br />

unnecessary boundaries within the query class as is<br />

illustrated in Fig. 2. In this figure, the SOTM forms a<br />

boundary near the query contaminating relevant samples,<br />

where as some supervision is maintained in the DSOTM<br />

case, preventing unnecessary boundaries fiom forming.<br />

Therefore, retaining some degree of supervision on the<br />

cluster generation around the query class seems to be vital.<br />

Due to the limitations of SOTM, we propose the<br />

Directed SOTM (DSOTM) algorithm in this work. In the<br />

DSOTM algorithm, decision on association of input pattern<br />

to query image is gradually made as each sample is<br />

presented to the system. It also contains a controlling<br />

mechanism that keeps track of the query centre by forcing<br />

the centre of relevant class to remain in the vicinity of the<br />

query position. Therefore, it can dynamically control<br />

generation of new centres and can determine the relevance<br />

of input samples, with respect to the query, as the tree<br />

structure grows. On the other hand, it limits the synaptic<br />

vector adjustments according to its reinforced learning rules<br />

and constrains cluster generation by preventing the<br />

spawning of redundant centres around the query position;<br />

since this part of the map is already occupied by relevant<br />

class centre.<br />

The DSOTM algorithm is summarized as follows:<br />

InihkEzrslion: Choose a mot node, {u3t fTom the<br />

available set of input vectors, { xk)f , in a random manner.<br />

J is the total number of centroids (initially set to 1) and K is<br />

the total number of inputs (i.e. images);<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.


Fig 2 Two-dimensional mapping (Left) Input pattern with 5 distinct clusters, (Middle) 14 centres are generatedusing SOTM, and (hght) 5 centres are<br />

generated using DSOTM Over-classification around the query (bangle) mll result in erroneous relevance identification<br />

Si&@ Meus~~remnt Randomly select anew data point, centre to the query, then mark x(t) as a relevant sample and<br />

x, find the best-matching (winning) centroid, j*, by update its centroid (winning neuron) toward the query<br />

minimizing a predefined Euclidean distance criterion in (1). position according to the degree of resemblance of the<br />

sample using:<br />

U m :<br />

If Ix(t)-wj*(t) i H(t), where H(t) is the<br />

hierarchy function used to control the levels of the tree and<br />

decays exponentially over time from its initial<br />

value,H(t,)>cx, according toH(t+l)=/l-H(t)-exp(-tlp),<br />

where p = max(t) llog,,[H(t)] and h is the threshold<br />

constant, 0


Fig 3: Two-dmensional mapping: (Left) Input pattern with 7 dstinct clusters, (Middle) 8 centres are generated using Ncut, and (fight) 7 centres are<br />

generated using SONcut. Over-classification around the quev (triangle) will result in enoneous classification of the relevant class.<br />

nodes in the input pattern,) and assoc(A, A) and where H(t) is defined similar to the hierarchy function used<br />

assoc(B, B) measure the total intra-cluster similarity in the DSOTM algorithm, then increment maximum allowed<br />

Ncut by 1;<br />

(association) in A and B, assoc(A,V) is the Continuation: Continue with the Similarity Measurement<br />

total connection from nodes in cluster A to all nodes in the ,tep until no noticeable changes in the feature map are<br />

graph; assoc(B,V)is defined similarly. w, is a nonnegative observed;<br />

weight function measuring degree of similarity between two Graph Generation: Given the input pattern, set up a<br />

samples of input patterns and is defined as: weighted graph, G = (V, E), compute the weight, wm, on<br />

('1<br />

each edge, Em, using (8) and create affinity, W, and<br />

diagonal, D, matrices;<br />

where d(p,q) is a pre-defined distance metric (i.e.<br />

Euclidean distance) and k is a user defined constant to<br />

control decreasing rate of the weight function and is<br />

empirically sets to 0.2. By using this function, the smallest<br />

eigenvector remains constant and Ncut can find relatively<br />

right partitions [2].<br />

As Shi and Malik have also discussed, the optimal<br />

partitioning (minimum possible Ncut) can be computed by<br />

solving the generalized eigenvalue system. The second<br />

smallest eigenvector of the generalized eigensystem is then<br />

used to partition the graph.<br />

In this paper we have used the Ncut algorithm [2] for the<br />

purpose of unsupervised data clustering. The Ncut<br />

partitioning method can be recursively applied on the input<br />

pattern to generate more than two clusters. Decision on<br />

maximum number of centres in the input pattern to stop the<br />

clustering process is a challenging problem. In this work, we<br />

Eigensystem Transformation: Solve (D- W)x = A- Dx for<br />

eigenvector with the smallest eigenvalue;<br />

Graph Bipartition: Use the eigenvector with the second<br />

smallest eigenvalue to bipartition the graph;<br />

Partitioning Continuation: Consider current partitions for<br />

supplementary subdivision. Continue with repartitioning<br />

until the Ncut value reaches to its maximum allowed.<br />

In summary, we have proposed an unsupervised<br />

hierarchical Ncut algorithm that is able to estimate the<br />

maximum number of allowed Ncuts by training the<br />

algorithm using the principles found in the DSOTM<br />

architecture. Thus, by dynamically adapting the Ncut<br />

algorithm to the nature of the input pattern, problem of overpartitioning<br />

the relevant class can be prevented. Fig. 3<br />

depicts importance of adapting such predictive mechanism<br />

for the Ncut clustering algorithm and illustrates<br />

effectiveness of employing such mechanism to avoid overclassification<br />

around the query centre.<br />

have integrated - the Ncut algorithm - with the principles found<br />

in DSOTM to automatically estimate appropriate number of<br />

clusters in the input pattern and set the maximum allowed<br />

Ncut accordingly. We call this Ncut algorithm with self-<br />

oriented centre detection, the Self-Organizing Normalized<br />

cuts, SONcut.<br />

The proposed SONcut algorithm is as follows:<br />

Initialization: Choose a root node, {I-,)~=~,<br />

from the<br />

available set of input vectors, { x,)L, in a random manner.<br />

N is the maximum allowed Ncut (initially set to 1) and K the<br />

total number of inputs;<br />

Similarity Measurement: Randomly select a new data point,<br />

x, and find the winning centroid, n*, by minimizing a<br />

predefined distance criterion in (1);<br />

Maximum Allowed Ncut Estimation: If Ilx(t) - I-,, 11 > H (t)<br />

3535<br />

35<br />

Previously in [7], we proposed an automatic CBIR engine<br />

that was structured around an unsupervised learning<br />

algorithm, the DSOTM. To reduce the gap between high-<br />

level concepts (semantics) and low-level statistical features<br />

and to evolve the search process according to what the<br />

system believes to be the significant content within the<br />

query, the above engine was integrated with a process of<br />

feature weight detection using genetic algorithms (GA) as<br />

illustrated in Fig. 4b. In this paper we use a relatively<br />

simpler CBIR architecture (see Fig. 4a and Fig. 5) to solely<br />

compare abilities of the proposed hierarchical clustering<br />

algorithms for the purpose of data classification with other<br />

three techniques, SOTM, SOFM, and Ncut.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.


p3<br />

Feature<br />

Extraction<br />

u<br />

Database<br />

+ Mrbal<br />

Search<br />

Unit<br />

+<br />

mlh<br />

Feature<br />

b qwp<br />

Extraction<br />

u<br />

Database<br />

Rehkval<br />

Search<br />

Unit<br />

/<br />

/<br />

queryimeiric<br />

adaptation<br />

DeteFted We@k<br />

query/meiri<br />

daptatnn<br />

Relevance<br />

Genetic<br />

Feedback<br />

Algorithm<br />

Fig 4a Machine controlled CBIR system [I] Fig 4b -Machine controlled GA-based CBIR system [7]<br />

1<br />

- - - - - - - - - Inifiaal - - - - Search - - - - - - - - - - - - Interface - - - - - - - - - - - - A&mafic - - - - - - - Search - - - - - - - - - - - - - - -<br />

Feature Extraction 1<br />

.....* Fmture Extraction 2<br />

4<br />

Unsupervised Ckilier<br />

I j<br />

I j<br />

I 4<br />

I j<br />

I<br />

Query Imge<br />

I Query Modfiation<br />

'<br />

L ------------------------ I I - -------------------------------- 2<br />

Fig 5 Another representation of the CBIR system of Fig 4a<br />

reaulta<br />

b<br />

The whole idea of the automatic image retrieval process is from the previous iterations, is then selected to substantially<br />

to unsupervisedly tailor the retrieval process to users' notion represent the relevant class through the Query Modz$cahon<br />

of similarity by utilizing an unsupervised learning technique module.<br />

to perform required decision makings about relevance of In this work, Colour Histograms, Colour Moments,<br />

retrieved images instead of the human user.<br />

Wavelet Moments, and Fourier Descriptors were used in the<br />

This automatic refinement mechanism is possible by Feature Extrachon I, whereas Hu's seven moment<br />

adapting a flexible architecture for the classifier to learn invariants (HSMI) and Gabor Descriptors accompanied with<br />

from and re-adjust to the nature of input pattern in a greater Colour Histograms and Colour Moments were used in<br />

extent using predefined competitive learning algorithms: a Feature Extract~on 2.<br />

method that is capable of giving a conforming ability to the<br />

network to perform avariety of computational tasks [q.<br />

Colour histograms and colour moments were computed in<br />

the HSV and RGB colour spaces respectively. Wavelet<br />

In Fig. 5, the second unit of Inzhak Sea~ch module, the Moments were extracted f?om mean, p, and standard<br />

Feature Extractzon I, deals with calculating features from a deviation, a, of three-level wavelet transform applied on an<br />

high volume image database. Consequently, standard set of image. Boundary-based Fourier shape parameters were<br />

content descriptors (an example might be MPEG-7), are extracted by converting the edge parameters from Cartesian<br />

extracted in this module to provide a more generic and rapid to Polar coordinates and, subsequently, applying the Fast<br />

interface to existing databases based on the chosen standard. Fourier Transform (FFT) to obtain top 10 low-frequency<br />

Extracted features are used to retrieve the most similar components. Texture features were computed from p and a<br />

images (in the relative sense) based on predefined distance<br />

metric. The top Q images are then displayed back to the user<br />

through an Interface block. Subsequently, user may request<br />

an automatic search which operates independently. Upon<br />

this request, control of the system is switched to the<br />

automatzc search module, wherein, another set of features<br />

with higher perceptual quality are extracted ffom the top Q<br />

of Gabor filtered images to construct 48 dimensional feature<br />

vectors, and finally, region-based HSMI shape parameters<br />

were extracted by converting the colour images into binary<br />

segmented ones and then extracting the shape parameters<br />

from those segmented images.<br />

retrieved images from the initial search, using Feature<br />

Extraction 2. Although computation of those features could A number of experiments were conducted to compare<br />

be intensive, this module allows for the use of more behaviors of the automatic CBIR engine in the Fig. 5 with<br />

proprietary or specialty features. Such a module may the presence of various unsupervised clustering algorithms<br />

enhance perceptual discrimination beyond that which might such as Ncut, SONcut, SOFM, SOTM, and DSOTM<br />

othenvise be possible through standard features alone. These classifiers. The simulations were carried out using a subset<br />

features are then used as seeds to train the Unsupemzsed<br />

Ckasszjep.. A new query, based upon the selected images<br />

of the Core1 image database consisting ofnearly 5 100 JPEG<br />

3536<br />

36<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.


TABLE I<br />

EXPERIMENTAL RESULTS IN TERMS OF RR FOR THE CBIR SYSTEM WITH<br />

NO FEATURE WEIGHT DETECTION MECHANISM FIG 4)<br />

Classifiera Set A Set B Set C Average<br />

Ncut 41 3% 47 3% 51 9% 46.8%<br />

SOFM 51 1% 44 3% 56 6% 50.7%<br />

SOTM 51 5% 51 1% 58 4% 53.7%<br />

SONcut 51.9% 52.8% 57.0% 53.9%<br />

TABLE 11<br />

EXPERIMENTAL RESULTS IN TERMS OF RR FOR THE CBIR SYSTEM WITH<br />

GA-BASED FEATURE WEIGHT DETECTION ALGORITHM FIG 5) [7]<br />

~lasslfier~ Set A Set B Set C Average<br />

Ncut 67 3% 64 5% 65 1% 61 6%<br />

SOFM 65 1% 66 7% 68 3% 66.7%<br />

SOTM 66 8% 72 1% 74 4% 71.1%<br />

SONcut 68 8% 72 8% 73 9% 71.8%<br />

DSOTM 78 3% 76 7% 80 5% 78.5%<br />

a Ncut Normalized Graph Cuts, SONcut Self-Organizing Normalized<br />

Graph Cuts, SOFM Self-Organizing Feature Maps, SOTM Self-<br />

Organizing Tree Maps, DSOTM Directed Self-Organizing Tree Maps<br />

colour images, covering a wide range of real-life photos,<br />

from 51 different categories. Each category consisted of 100<br />

visually associated objects to simplify measurements of the<br />

retrieval accuracy (RR) during the experiments. 3 sets of 5 1<br />

images were drawn from the database to form sets A, B, and<br />

C. Each set consists of randomly selected images such that<br />

no two images were from the same class. Retrieval results<br />

were statistically calculated ti-om each of the 3 sets. In the<br />

simulations, a total of 16 most relevant images were<br />

retrieved to evaluate precision of the retrieval.<br />

The experiment results are illustrated in Table 1. In the<br />

Ncut, SOFM, and SOTM algorithms, the maximum number<br />

of allowed cluster generation was set to P, P < Q, where Q<br />

is the total number of retrieved images from the initial<br />

search; P was empirically set to 8. A 4x2 grid topology was<br />

used in the SOFM structure to locate maximum 8 possible<br />

cluster centres. A hard decision on the resemblance of the<br />

input samples was made: if the sample is closer to one<br />

centre than any other centres, in terms of a predefined<br />

distance metric, it belongs to that centre. Table 2 illustrates<br />

results aRer feature weight detection using GA-based<br />

method described in [I.<br />

Although the Ncut algorithm is a top-down classification<br />

process and aims to extract global impressions of the input<br />

pattern and present a hierarchical description of it,<br />

employing a predictive mechanism to estimate true number<br />

of clusters prior to spawning new neurons is proven to be<br />

beneficial. This predictive mechanism will enforce afrontier<br />

on the classification process and inhibits unnecessary<br />

centres to be generated around the query position. As a<br />

result, a more accurate impression of relevance can be<br />

achieved by using SONcut algorithm.<br />

The SOTM algorithm not only extracts global intuitions<br />

of the input pattern, it also introduces some degree of<br />

localization into the discriminative process to achieve<br />

maximal discrimination at any given resolution (or number<br />

of classes.) Moreover, the ability of SOTM to span and force<br />

3537<br />

37<br />

division in the extremes of the data in the early, delaying<br />

division of most similar aspects until later stages of learning,<br />

and a flexible tree-like topologies (more plastic than SOFM)<br />

makes it essentially sensitive to the most dominant<br />

differences in the data and, thus, less prone to classification<br />

errors and more attractive to the retrieval applications.<br />

Despite all the above advantages of using SOW-based<br />

classifiers, retaining some degree of supervision to prevent<br />

unnecessary boundaries from forming around the query<br />

class seems to be crucial. The DSOTM algorithm not only<br />

provides a partial supervision on cluster generation by<br />

forcing divisions away ti-om the query class, it also makes a<br />

soft decision on resemblance of the input patterns by<br />

constantly modifying each sample's membership during<br />

learning phase of the algorithm. As a result, a more robust<br />

tree structure as well as a better sense of likeness can be<br />

finally achieved.<br />

V. CONCLUSION<br />

The framework for a novel unsupervised clustering<br />

algorithm based on DSOTM was introduced in this work. A<br />

modification on the current structure of Ncut algorithm was<br />

also proposed in this paper. This modification provides a-<br />

priori knowledge for the algorithm to determine appropriate<br />

number of clusters, based on principles found in DSOTM,<br />

prior to its hierarchical clustering operation. Performance of<br />

the proposed methods was compared with other<br />

conventional clustering methods (i.e Ncut, SOFM, and<br />

SOTM) by using a computer controlled CBIR system.<br />

SOTM outperforms both Ncut and SOFM and performs<br />

fairly close to SONcut even with its blind top-down data<br />

exploration. This is due to its flexible tree-shape structure as<br />

well as its competitive learning algorithm that injects some<br />

degree of localization into the discriminative process. The<br />

experimental results also illustrate promising performance of<br />

utilizing DSOTM in the structure of automatic CBIR<br />

engines.<br />

REFERENCES<br />

[I] P. Muneesawang and L Gum, "Minimizing user interaction by<br />

automatic and semiautomatic relevance feedback for Image retrieval,"<br />

Proc. IEEE I ~ Conj L on Image Processing, Rochester, USA, vol 2,<br />

pp 601-604, Sept 2002<br />

[2] J Shi and J Malik, "Normalized cuts and image segmentation," IEEE<br />

Trans. Pattern hadysas and Machine In&lligence, Vo1 22, Issue 8,<br />

pp 888 -905,Aug 2000<br />

[3] H S Kong, "The Self-Organizing Tree Map, and its Applications in<br />

Digital Image Processing," PhD Thesis <strong>University</strong> of Sydney,<br />

Australia, 1998<br />

[4] T Kohonen, "The self-organizing map," Proc. of rdae IEEE, Vol 78,<br />

Issue 9, pp. 1464 - 1480, Sept 1990<br />

[5] S. Haykin, Neural hbtworh: A Covprehensiw Foundufion, Prentice<br />

Hall, Inc , 1999, second edition.<br />

[6] J Randall, L Gum, X Li W Zhang, "Investigations of the selforganizing<br />

tree map," Proc. of6f.h International Conference on Neural<br />

Infirmataon Processing, Vol 2, pp 724 - 728, Nov 1999<br />

[7] K Jarrah, M Kyan, S Krishnan, and L Guan, "Computational<br />

intelligence techniques and their Applications in Content-Based Image<br />

Retrieval," IEEE Inf. Conj on Multamen'aa & Expo (ICME) , submitted<br />

for publication, 2006<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.


1­4244­0367­7/06/$20.00 ©2006 IEEE 33<br />

38<br />

ICME 2006<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.


34<br />

39<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

{ } <br />

<br />

{ } <br />

<br />

<br />

<br />

=<br />

<br />

<br />

<br />

<br />

=<br />

<br />

<br />

<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.


= <br />

<br />

− <br />

= K<br />

<br />

<br />

<br />

≤<br />

− <br />

<br />

> σ <br />

+ <br />

= λ ⋅ <br />

⋅−<br />

ρ<br />

<br />

< λ < ρ = <br />

<br />

<br />

<br />

<br />

<br />

+ <br />

= <br />

<br />

+ α ⋅ β <br />

<br />

⋅<br />

− <br />

<br />

<br />

<br />

α = α <br />

⋅ −<br />

<br />

<br />

≤ α ≤ α<br />

α = <br />

β <br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

⎛ ( ) ⎞<br />

∑ ∑⎜<br />

<br />

− <br />

β <br />

= <br />

<br />

− <br />

=<br />

⎟<br />

⎜<br />

− <br />

= <br />

= ⎟<br />

<br />

<br />

⎝ σ ⎠<br />

<br />

=<br />

<br />

− <br />

<br />

<br />

β <br />

<br />

β <br />

<br />

<br />

<br />

<br />

<br />

α <br />

<br />

<br />

<br />

<br />

<br />

<br />

− − ≤ <br />

<br />

<br />

<br />

<br />

<br />

<br />

+ <br />

= <br />

<br />

+ α ⋅ β <br />

<br />

⋅<br />

− <br />

<br />

<br />

<br />

35<br />

40<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

= <br />

<br />

≤ ≤ <br />

<br />

<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.


= × <br />

× <br />

× <br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

−<br />

= [ <br />

<br />

]<br />

<br />

<br />

<br />

= <br />

K<br />

K<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

( −) <br />

<br />

<br />

36<br />

41<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.


DISCRETE POLYNOMIAL TRANSFORM FOR DIGITAL IMAGE WATERMARKING<br />

APPLICATION<br />

Lam Le, Sridhar Krishnan, and Behnaz Ghoraani<br />

Dept. of Elec. & Comp. Eng., <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />

E-mail: (lle)(krishnan)@ee.ryerson.ca and bghoraan@rnet.ryerson.ca<br />

ABSTRACT<br />

In this study, we propose a new way to detect the image watermark<br />

messages modulated as linear chirp signals. The<br />

spread spectrum image watermarking algorithm embeds linear<br />

chirps as watermark messages. The phase of the chirp<br />

represents watermark message such that each phase corresponds<br />

to a different message. We extract the watermark message<br />

using a phase detection algorithm based on Discrete<br />

Polynomial Phase Transform (DPT). The DPT models the signal<br />

as polynomial and uses ambiguity function to estimate the<br />

signal parameters. The proposed method not only detects the<br />

presence of watermark, but also extracts the embeded watermark<br />

bits and ensures the message is received correctly. The<br />

robustness of the proposed detection scheme has been evaluated<br />

using checkmark benchmark attacks, and we found a<br />

guaranteed maximum bit error rate of 15%, which watermark<br />

message is correctly detected using DPT.<br />

Keywords: Image Watermarking, Spread Spectrum, Data Hiding,<br />

Discrete Phase Polynomial Transform, Hough-Radon Transform,<br />

Chirp Modulation<br />

1. INTRODUCTION<br />

Chirp signals are present ubiquitously in many areas of science<br />

and engineering, so the Discrete Polynomial Phase Transform<br />

(DPT) [1][2] has been extensively studied in recent years<br />

to estimate the phase parameters of the chirp signals. One of<br />

the recent applications of chirp signals is in data watermarking.<br />

The huge success of the Internet allows for the transmission,<br />

wide distribution, and access of electronic data in an<br />

effortless manner. Watermarking is one of the possible solutions<br />

to the problem that content providers are faced with<br />

the challenge of how to protect their electronic data. Thereby,<br />

multimedia data creators and distributors are able to prove<br />

ownership of intellectual property rights without forbidding<br />

other individuals to copy the multimedia content itself. In<br />

this study, we propose a chirp-based detection method to detect<br />

watermark messages in an image watermarking scheme<br />

[3][4] which embeds linear chirps as imperceptible and statistically<br />

undetectable watermark messages. Different chirp<br />

rates, i.e., phases, represent watermark messages such that<br />

each phase corresponds to a different message. The narrowband<br />

watermark messages are spread with a watermark key<br />

(PN sequence) across a wider range of frequencies before embedding.<br />

The resulting wideband noise is added to the perceptually<br />

significant regions of the original image. We use<br />

the block-based discrete cosine transform (DCT) scheme for<br />

inserting the watermark. As a result of image manipulations<br />

some message bits are extracted by the detector may be in<br />

error potentially resulting in the detection of the wrong watermark<br />

message. The proposed watermarking detection algorithm<br />

detects the presence of watermark, and extracts the<br />

embeded watermark message bits even at presence of bit error<br />

in the received watermark message. Our motivation to<br />

use DPT technique as watermark detector is to achieve high<br />

estimation accuracy with less computational complexity.<br />

2. DISCRETE POLYNOMIAL PHASE TRANSFORM<br />

(DPT)<br />

DPT is a parametric signal analysis approach for estimating<br />

the phase parameters of polynomial phase signals. The phase<br />

of many man-made signals such as those used in radar, sonar<br />

and communications can be modeled as a polynomial. The<br />

discrete version of a polynomial phase signal can be expressed<br />

as:<br />

<br />

M<br />

x(n) =b0exp j am(n∆) m<br />

<br />

(1)<br />

m=0<br />

where M is the polynomial order (M =2for chirp signal),<br />

0 ≤ n ≤ N − 1, N is the signal length and ∆ is the sampling<br />

interval. The principle of DPT is as follow. When DPT is<br />

applied to a mono-component signal with polynomial phase<br />

of order M, it produces a spectral line [1] . The position of<br />

this spectral line at frequency ω0 provides an estimate of the<br />

coefficient âM .AfterâM is estimated, the order of the polynomial<br />

is reduced from M to M −1 by multiplying the signal<br />

with exp −jâM (n∆) M . This reduction of order is called<br />

phase unwrapping. The next coefficient âM−1 is estimated<br />

the same way by taking DPT of the polynomial phase signal<br />

of order M − 1 above. The procedure is repeated until all the<br />

coefficients of the polynomial phase are estimated. DPT order<br />

1­4244­0367­7/06/$20.00 ©2006 IEEE 1569<br />

42<br />

ICME 2006<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.


M of a continuous phase signal x(n) is defined as the Fourier<br />

transform of the higher order DPM [x(n),τ] operator:<br />

DPTM [x(n),ω,τ] ≡F{DPM [x(n),τ]}<br />

=<br />

N−1 <br />

DPM [x(n),τ] exp −jωn∆ ,<br />

(M−1) τ<br />

where τ is a positive number and:<br />

(2)<br />

DP1 [x(n),τ]:=x(n) (3)<br />

DP2 [x(n),τ]:=x(n)x ∗ (n − τ). (4)<br />

DPM [x(n),τ]:=DP2 [DPM−1 [x(n),τ] ,τ] (5)<br />

The coefficients aM (a1 and a2) are estimated by applying the<br />

following formula [1]:<br />

âM =<br />

where<br />

and<br />

1<br />

M!(τM ∆) M−1 argmaxω {|DPTM [x(n),ω,τ]|} ,<br />

(6)<br />

â0 = phase<br />

DPT1 [x(n),ω,τ]=F{x(n)} , (7)<br />

DPT2 [x(n),ω,τ]=F{x(n)x ∗ (n − τ)} , (8)<br />

N−1<br />

<br />

n=0<br />

ˆb0 = 1<br />

N−1<br />

N<br />

n=0<br />

<br />

x(n)exp −j<br />

<br />

<br />

x(n)exp −j<br />

M<br />

am(n∆) m<br />

<br />

m=1<br />

M<br />

am(n∆) m<br />

<br />

m=1<br />

(9)<br />

(10)<br />

The estimated coefficients are used to synthesize the polynomial<br />

phase signal:<br />

ˆx(n) = ˆ <br />

M<br />

b0exp j âm(n∆) m<br />

<br />

(11)<br />

m=0<br />

3. CHIRP-BASED WATERMARKING<br />

The watermarking method used in this study is a novel watermarking<br />

method using a linear chirp based technique applied<br />

on image and audio signal[3][4]. The chirp signal x(t) (or<br />

m) is quantized and having value -1 and 1 as in m q . m q is<br />

then embedded into the multimedia files. The quantization<br />

process introduces harmonics in the time-frequency representation,<br />

but the slope of the quantized chirp is the same as that<br />

of the chirp signal x(t). The detail of the embedding and extracting<br />

of watermark is followed.<br />

1570<br />

43<br />

3.1. Watermark embedding<br />

Each bit m q<br />

k of mq is spread with a cyclic shifted version pk<br />

of a binary PN sequence called watermark key. The results<br />

are summed together and generate the wide band noise vector<br />

w:<br />

w =<br />

N −1<br />

k=0<br />

m q<br />

k pk, (12)<br />

where N is the number of watermark message bits in m q .<br />

Thewidebandnoisew is then carefully shaped and added to<br />

the audio or DCT block of the image so that it will cause imperceptible<br />

change in signal quality. In the audio watermarking<br />

application, to make the watermark message imperceptible,<br />

the amplitude level of the wideband noise w is scaled<br />

down to be about 0.3 of the dynamic range of the signal. In<br />

the image watermarking application, the length of w to be embedded<br />

depends on the perceptual entropy of the image. To<br />

embed the watermark into the image, the model based on the<br />

just noticeable difference (JND) paradigm was utilized. The<br />

JND model based on DCT was used to find the perceptual entropy<br />

of the image and to determine the perceptually significant<br />

regions to embed the watermark. In this method, the image<br />

is decomposed into 8×8 blocks. Taking the DCT on the<br />

block b results in the matrix Xu,v,b of the DCT coefficients.<br />

The watermark encoder for the DCT scheme is described as<br />

X ∗ u,v,b =<br />

Xu,v,b + t C u,v,b wu,v,b, if Xu,v,b >t C u,v,b ;<br />

Xu,v,b, otherwise<br />

(13)<br />

where Xu,v,b refers to the DCT coefficients, X∗ u,v,b refers to<br />

the watermarked DCT coefficients, wu,v,b is obtained from<br />

the wideband noise vector w, and the threshold tC u,v,b is the<br />

computed JND determined for various viewing conditions such<br />

as minimum viewing distance, luminance sensitivity and contrast<br />

masking. Fig.1 shows the block diagram of the described<br />

watermark embeding scheme.<br />

Original<br />

Image<br />

x i,j<br />

PN Sequence<br />

Linear Chirp<br />

Message<br />

Block based<br />

DCT<br />

p<br />

m q<br />

X u,v<br />

Calculate<br />

JNDs<br />

Circular<br />

Shifter<br />

p k<br />

Modulator<br />

w<br />

Watermark<br />

Insertion<br />

J u,v<br />

Fig. 1. Watermark embedding scheme.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.<br />

X* u,v<br />

Watermarked<br />

Image


3.2. Watermark detection<br />

Fig.2 shows the block diagram of the described watermark<br />

decoding scheme. The detection scheme for the DCT based<br />

watermarking can be expressed as<br />

ˆwu,v,b = Xu,v,b − ˆ X∗ u,v,b<br />

tC u,v,b<br />

<br />

ˆwu,v,b, if Xu,v,b >t<br />

ˆw =<br />

C u,v,b ;<br />

0 otherwise<br />

(14)<br />

(15)<br />

where ˆ X∗ u,v,b are the coefficients of the received watermarked<br />

image, and ˆw is the received wideband noise vector. Due to<br />

intentional and non-intentional attacks such as lossy compression,<br />

shifting, down-sampling the received chirp message ˆm q<br />

will be different from the original message mq by a bit error<br />

rate BER. We use the watermark key, pk to despread ˆw,<br />

and integrate the resulting sequence to generate a test statistic<br />

〈 ˆw, pk〉. The sign of the expected value of the statistic depends<br />

only on the embedded watermark bit m q<br />

k . Hence the<br />

watermark bits can be estimated using the decision rule:<br />

ˆm q<br />

k =<br />

<br />

+1, if 〈 ˆw, pk〉 > 0;<br />

(16)<br />

−1, if 〈 ˆw, pk〉 < 0.<br />

We repeat the bit estimation process until we have an estimate<br />

Fig. 2. The proposed watermark detection scheme.<br />

of all the transmitted bits. Though it is possible to form an<br />

estimate of the chirp sequence from the received bits, we improve<br />

the robustness of the detection algorithm by detecting<br />

the chirp using Discrete Polynomial Phase Transform (DPT)<br />

a phase detection algorithm.<br />

3.3. DPT-based watermark estimation method<br />

The embeded watermarks in this algorithm are linear chirps,<br />

and the received watermark can be represented as<br />

x(n) =exp(a1(n∆)j + a2(n∆) 2 j) (17)<br />

Since DPT method is able to estimate the polynomial coefficients<br />

of chirp signals with a very short computation time,<br />

1571<br />

44<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

−1<br />

0 20 40 60 80 100 120 140 160<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

−0.2<br />

−0.4<br />

−0.6<br />

−0.8<br />

(a)<br />

−1<br />

0 20 40 60 80 100 120 140 160<br />

(b)<br />

Fig. 3. Original and Estimated Watermark at (a)BER of<br />

13.6% and correlation coefficient of 0.9891 (b)BER of 19.3%<br />

and correlation coefficient of 0.9516<br />

we apply DPT to estimate a1 and a2 coefficients. Fig.3 shows<br />

the original and estimated watermark messages at bit error<br />

rates of 13.6% and 19.3%; the correlation coefficients of the<br />

original and estimated watermarks are 0.9891 and 0.9516 respectively.<br />

Our computer simulation shows that the required<br />

calculation time is 630 times faster than the other similar chirp<br />

watermark detection scheme[3][4][5].<br />

4. RESULTS AND DISCUSSION<br />

We evaluated the proposed scheme using 10 different images<br />

of size 512×512. The sampling frequency fsb of the watermarks<br />

equals to 1 kHz. Hence the initial and final frequencies,<br />

f0b and f1b of the linear chirps representing all watermark<br />

messages are constrained to [0-500] Hz. We embed these<br />

chirps into the images for a chip length of 10000 samples,<br />

In our tests, we used a single watermark sequence having 182<br />

message bits. To measure the robustness of the watermarking<br />

algorithm, we performed the attacks specified in the checkmark<br />

benchmark attacks[6]. Table 1 shows the watermark<br />

detection results for ten watermarked images after performing<br />

the attacks specified in the checkmark benchmark attacks.<br />

The numbers in the brackets under category ‘Attack’ represent<br />

the number of attacks in that particular class of attacks.<br />

The ‘Detection Average’ represents the percentage of attacks<br />

for which the watermark is detected under each class of attacks.<br />

We make decesion on the correct detection of watermark<br />

based on the correlation between the estimated chirp<br />

and the embedded watermark. Experimentally, the threshold<br />

for correlation coefficient is set to 0.9. The maximun BER<br />

for MAP attack with 100% detection rate is 15%, and for the<br />

case of JPEG attack, in which the maximum BER is 19.9%,<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.


DPT was able to detect 100% of watermark messages. Table<br />

2 shows DPT-based technique performance for one of the images<br />

under the specified attacks. The results demonstrate the<br />

fact that the proposed scheme based on DPT-based technique<br />

promises to estimate the watermark messages up to a BER of<br />

15%; also in many cases it shows the ability of watermark detection<br />

up to a BER of 19%.<br />

Attacks<br />

Detection Average(100%)<br />

DPT HRT<br />

Remodulation(4) 65 57.5<br />

Copy(1) 100 97<br />

MAP(6) 100 100<br />

Wavelet(10) 84 84<br />

JPEG(12) 100 97.5<br />

ML(7) 57 56<br />

Filtering(3) 100 100<br />

Resampling(2) 85 90<br />

Colour Reduce(2) 35 45<br />

Table 1. Watermark detection results of 10 images for checkmark<br />

benchmark attacks.<br />

Attacks<br />

image1<br />

BER(%) DPT<br />

dpr(3) 2.84 0.9988<br />

dpr(5) 11.93 0.9954<br />

dprcorr(3) 6.25 0.9871<br />

dprcorr(5) 14.77 0.9916<br />

medfilt(2) 1.7 0.9950<br />

medfilt(3) 3.4 0.9884<br />

medfilt(4) 23.3 0.1355<br />

trimmedmean(3) 13.63 0.9891<br />

trimmedmean(5) 31.82 0.1274<br />

midpoint(3) 3.98 0.9965<br />

midpoint(5) 23.4 -0.0008<br />

dither 6.25 0.9936<br />

thresh 19.31 0.9516<br />

Table 2. Bit error rates and the correlation coefficients of the<br />

proposed method respectively.<br />

The performance of the algorithm is compared with another<br />

similar approach, Hough-Radon Transform (HRT)[3] [4][5].<br />

Table 1 compares the detection result for DPT and HRT-based<br />

methods with the same watermarking capacity. The DPTbased<br />

algorithm results in higher or equal detection rate in the<br />

seven types of attacks, and has less computational complexity<br />

than HRT-based method. Typically, running time of DPTbased<br />

is 6000 time less than that of the HRT-based method in<br />

Matlab. The watermarking capacity of DPT-based technique<br />

1572<br />

45<br />

depends on the values of coefficients a1 and a2. As expected,<br />

using high resolution of coefficients a1 and a2 will increase<br />

the capacity of the watermarking. However, this would also<br />

reduces the robustness of the method. Compared to the previous<br />

method based on HRT, the proposed method has high<br />

capacity of 182×182; it is also more robust than the HRTbased<br />

method as indicated in Table 1.<br />

5. CONCLUSION<br />

In this paper, we proposed a watermark detection method applied<br />

in an image watermarking algorithm that embeds linear<br />

chirp as watermark messages. The watermark message<br />

is added to the perceptually significant regions of the image<br />

to ensure robustness of the watermark to common image<br />

processing attacks. A phase detection algorithm based on<br />

DPT detects the phase of the watermark message. The proposed<br />

technique has the ability to detect the chirp message<br />

embedded in signals and subjected to different BERs due to<br />

attacks on the image watermark and provides a fast deduction<br />

with high accuracy. Our studies confirm the robustness of the<br />

algorithm to checkmark benckmark attacks.<br />

6. REFERENCES<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.<br />

[1] S. Peleg, B. Friedlander, “The discrete polynomialphase<br />

transform,” IEEE Transactions on <strong>Signal</strong><br />

Processing, Vol. 43 Issue 8 , pp. 1901–1914. 1995.<br />

[2] S. Peleg, and B. Friedlander, “Multicomponent signal<br />

analysis using the polynomial-phase transform,” IEEE<br />

Transactions on Aerospace and Electronic Systems,<br />

Vol. 32 Issue 1 , pp. 378–387. 1996.<br />

[3] S. Erkucuk,S. Krishnan and M. Zeytinoglu, “Robust<br />

Audio Watermarking Using a Chirp Based Technique,”<br />

IEEE Intl. Conf. on Multimedia and Expo, vol. 2, pp.<br />

513–616, 2002.<br />

[4] A. Ramalingam and S. Krishnan, “A Novel Robust Image<br />

Watermarking Using a Chirp Based Technique,”<br />

IEEE Canadian Conf. on Electrical and Computer Engineering,<br />

vol. 4, pp. 1889–1892, 2004.<br />

[5] R.M. Rangayyan and S. Krishnan, “Feature identification<br />

in the time-frequency plane by using the Hough-<br />

Radon transform,” Trans. Pattern Recognition, vol. 34,<br />

pp. 1147–1158, 2001.<br />

[6] S. Pereira, S. Voloshynovskiy, M. Madueno S.<br />

Marchand-Maillet, and T. Pun, ‘Second Generation<br />

Benchmarking and Application Oriented Evaluation’,<br />

Information Hiding Workshop III, Pittsburgh, PA, USA,<br />

April 2001.


IMPROVING POSITION ESTIMATES FROM A STATIONARY<br />

GNSS RECEIVER USING WAVELETS AND CLUSTERING<br />

Mohammad Aram, Baice Li, Sridhar Krishnan, Alagan Anpalagan<br />

<strong>Ryerson</strong> <strong>University</strong> Multipath Mitigation (RUMM) Lab<br />

Department ofElectrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, ON, M5B 2K3<br />

maram bli krishnan alagan@ee.ryerson.ca<br />

Bern Grush<br />

Applied Location Corporation, 34 Dodge Rd, Toronto, ON, M1N 2A 7 bgrush@appliedlocation.com<br />

Abstract<br />

Many positioning applications utilize global navigation<br />

satellite systems (GNSS) derived position estimates for<br />

stationary positions. Inexpensive navigation-grade receivers<br />

provide estimates within a few meters in relatively open skies,<br />

while more specialized devices, typically distinguished by<br />

specialized antenna design and additional post processing can<br />

achieve sub-meter accuracy. These latter devices can be two<br />

orders of magnitude more expensive than navigation-grade<br />

receivers but are still subject to measurement error due to<br />

severe multipath in built-up areas.<br />

In our experiments we post-process positions computed by<br />

an inexpensive receiver by applying waveletfilteringfollowing<br />

by clustering and characterization. This produces a reliable<br />

and significant reduction in variance of the estimate, a<br />

normalization of the data scatter-distribution and a<br />

characterization of the estimate that is amenable to a wider<br />

range of statistical comparisons and tests than would be<br />

possible for unfiltered, highly non-Gaussian distributions,<br />

especially as occur in urban canyon circumstances.<br />

Keywords. GNSS; Urban canyon; Multipath mitigation;<br />

Wavelets; k-means; RAIM<br />

1. Introduction<br />

Ongoing developments in GNSS space segment (Galileo and<br />

GPS modernization) are poised to provide significantly more<br />

and better ranging signals for positioning applications. Recent<br />

innovation in high-sensitivity receiver technology (HSGNSS)<br />

enables the acquisition of attenuated and obstructed signals.<br />

These additional signals dramatically lower the probability of a<br />

gap [5,6,7] (loss of lock on enough signals to compute a<br />

position) in challenging signal environments such as in "urban<br />

canyon", heavy foliage, indoors, etc. While inertial navigation<br />

may fill in those gaps in dynamic applications (navigation,<br />

logistics tracking), it cannot help stationary or near-stationary<br />

applications such as survey, E9 11, asset or personnel location,<br />

or metered parking.<br />

HSGNSS signal measurements are biased and especially<br />

noisy due to excessive multipath and low-power signals [2].<br />

Taken together, GPS modernization, Galileo and HSGNSS,<br />

means the potential opportunity of many more applications, but<br />

generally in harsh signal environments. Specific noise sources<br />

are entirely dependent on conditions local to the antenna of the<br />

1-4244-0038-4 2006 7758<br />

IEEE CCECE/CCGEI, Ottawa, May 2006<br />

46<br />

receiver in question and are not addressable by augmentation<br />

such as differential GPS (DGPS) or wide area augmentation<br />

systems (WAAS), or broad-area correction, such as atmospheric,<br />

etc. Even traditional receiver autonomous integrity<br />

monitoring (RAIM) has diminished utility since it was<br />

developed for signal environments with an assumption of zero<br />

or one fault in a field of 5 to 11 signals. We can now project<br />

near-future, integrated GPS/Galileo applications with 4 to 22<br />

signals in harsh environments where many or all signals are<br />

disturbed.<br />

To tackle these harsher signal environments, new antenna<br />

designs [2] and new fault detection and elimination techniques<br />

(FDE) that extend RAIM approaches [6,7,8] are being<br />

developed. Specialized antennas add system costs and the FDE<br />

techniques are computationally complex so that they may be<br />

impractical for larger signal sets.<br />

This paper describes an alternate approach: a process that<br />

includes wavelet filtering, weighted clustering and<br />

characterization of position estimates from a stationary<br />

receiver. This approach results in reduced variance of the<br />

estimate and a normalization of the data-scatter which, in turn,<br />

provides an inexpensive method for applications that require<br />

accuracy of 1-2m for short-dwell readings (under ten minutes)<br />

in many multipath circumstances. As space segment<br />

improvements (Galileo, GPS modernization) and receiver<br />

design improvements (high sensitivity) continue to come onstream,<br />

multipath mitigation such as we propose here tends to<br />

reduce the relative difference in accuracy between open skies<br />

and urban canyon.<br />

The next section of this paper describes our experimental<br />

methods, including data collection and processing algorithms.<br />

The third section describes and demonstrates our results for<br />

each of wavelet filtering, widowing and clustering.<br />

2.1. Data Collection<br />

2. Experimental Methods<br />

To support a variety of experiments, we gathered streetlevel,<br />

urban canyon, carrier phase and position data at multiple<br />

locations in downtown Toronto (Canada). Four sites were<br />

selected to represent distinct levels of urban canyon effects<br />

ranging from moderate to extreme multipath interference. At<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.


each location we collected ten 15-minute samples over five<br />

sidereal days, for a total of forty 900-second data sets.<br />

Figure 1 shows the data collection setup we used.<br />

For this particular experiment, we simply used the 3D<br />

position estimates generated by the receiver without<br />

consideration for outlier removal. Fig 2 shows a typical sample<br />

showing high positioning variability. Fig 3 demonstrates that<br />

even the geometric mean of a 15-minute sample can be highly<br />

variable in severe multipath. We wish to mitigate both forms of<br />

variability.<br />

Figure 1: Data collection equipment consisted of: u-blox TM-LP 15<br />

(not HS) evaluation kit with u-center ANTARIS software and a<br />

laptop. An active antenna was mounted on a portable antenna mount<br />

Im above the ground. We did not use an external ground plane.<br />

40<br />

30<br />

20<br />

. 10 w _<br />

10<br />

0)<br />

C~~~~~~~~~~~~~<br />

"-10<br />

0<br />

C<br />

-20<br />

-30<br />

-40<br />

-40 -30 -20 -10 0 1 0<br />

easting deviation (m)<br />

. I<br />

0<br />

-i,= 0 0<br />

) a<br />

*_ E<br />

.-D<br />

tO)t<br />

En<br />

40Ui<br />

U I I<br />

Af<br />

- ALn. .<br />

-40<br />

E asti n g dev i atio n of ge ometric<br />

means of 15-min samples (m)<br />

Figure 3: We sampled the same locations in 15-minute samples over<br />

5 sidereal days. The geometric mean of each of these samples can<br />

drift considerably. At this location, a spread of about 40 meters in<br />

both Easting and Northing is apparent over the 10 samples taken.<br />

2.2. Processing Algorithms<br />

Our process comprises of two fundamental steps: filtering<br />

using wavelet analysis, and an inverse-variance weighted<br />

estimate of the mean position using either a moving window or<br />

a k-means algorithm to cluster the data.<br />

0 moving window Gaussified<br />

from Wavelet variance weighting data scatter<br />

from filtring or with<br />

receiver K-means lower<br />

variance weighting variance<br />

Since multipath error is a time varying process, wavelet<br />

analysis can be used effectively to mitigate its effects. We<br />

tested various wavelets including Daubechies, Coiflets,<br />

Symlets, Morlet, and Meyer. Although the results from<br />

Symlets and Daubechies were very similar, the analysis was<br />

carried out using the 'Daubechies order 7 (db7)' filter and<br />

wavelet coefficients were modified based on thresholding [4].<br />

Outlier removal was applied to the wavelet output by<br />

fi: excluding all points exceeding 3G from the mean of the filtered<br />

data (where a is the standard deviation).<br />

Rather than simply computing the geometric mean of the<br />

wavelet filtered data as a new position estimate, a subsequent,<br />

independent process was applied to the position data output<br />

from the wavelet filter. Noting that the variance of positioning<br />

data, especially in urban canyon, is non-stationary (varies with<br />

time), we reasoned that weighting each datum inversely with<br />

20 30 40 its local variance would tend to suppress the contribution to the<br />

mean estimated position from high-velocity data segments.<br />

Figure 2: 15 minutes of data collected with the ecluipment<br />

in Fig 1. Such data segments<br />

This is typical of about half of our street-level read<br />

Toronto.<br />

can be caused by a satellite rising or<br />

lings in downtown falling at the horizon or changing from line-of sight to non-line<br />

of sight multipath (or vice-versa) and other biasing effects.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.<br />

759<br />

47<br />

0<br />

4<br />

40


_ . r W-. n r<br />

We tried two ways of estimating local variance: temporal<br />

and spatial. Temporal variance weighting is easily achieved by<br />

computing the variance of short temporal data segments<br />

(windows) and then by inversely weighting the local means of<br />

those temporal windows relative to their local variance.<br />

Spatial variance weighting can be achieved via spatial data<br />

clustering. Over a 15 minute sample in a harsh signal<br />

environment one can observe the spatial non-stationarity of the<br />

position estimate as two or more clusters of points in the<br />

scatter (fig 4). If we use a statistical clustering algorithm, such<br />

as k-means, we would tend to group spatially similar estimates<br />

regardless of whether they are temporally adjacent. The mean<br />

of each such cluster can then be weighted by the inverse of its<br />

variance. k-means is more computationally intensive than a<br />

moving temporal window, but it can be expected to perform<br />

somewhat better. This is because a cluster is unlikely to span a<br />

positioning discontinuity, while a temporal window is more<br />

likely to do so.<br />

3. Experimental Results<br />

3.1. Wavelet filtering<br />

The effect of our wavelet filtering was to always reduce<br />

variance (fig 4) and to often Gaussify a sample (normalize its<br />

data scatter) by reducing both skew and kurtosis (table 1).<br />

+ o<br />

+<br />

,-0<br />

f<br />

.s 10<br />

10<br />

.<br />

V<br />

0


The output of this wavelet filtering is consistent: lower<br />

variance (fig 5) and Gaussifed data (fig 6), centered very<br />

nearly at the same geometric mean. However we know that the<br />

mean itself "wanders" over time (fig 3), so that the first<br />

moment still retains a bias effect that we now wish to reduce.<br />

3.2. Windowing<br />

Recognizing that the variance process for these data sets is<br />

non-stationary, we wish to weigh more heavily data segments<br />

that are momentarily stationary (low instantaneous velocity)<br />

and weigh less heavily data segments that exhibit high<br />

instantaneous velocity. While such a process can not<br />

necessarily discriminate between multipath and non-multipath<br />

contaminated data, it does take advantage of the fact that<br />

ranging signals undergoing rapid change in multipath<br />

circumstances exhibit more high-velocity bursts, hence we can<br />

reduce the impact of these momentary data subsets for a<br />

stationary receiver.<br />

For our temporal windowing process, we experimented with<br />

several window lengths and window overlaps. We report here<br />

using windows of 20 seconds that overlap by 10 seconds. We<br />

then inversely weighted each local mean by the local variance<br />

and computed a new weighted mean for the full sample as:<br />

N X<br />

2<br />

i=l 07<br />

This process has the effect of causing a set of means from a<br />

single location to show reduced scatter. In other words, this<br />

process tends to remove noise from the mean of a 15-minute<br />

data set collected in urban canyon (fig 7).<br />

40 1<br />

10<br />

-1l<br />

0<br />

-40 L<br />

-40 -20 0<br />

Easting (m)<br />

20 40<br />

Figure 7: Shows the relative shift in final position estimates for all 40<br />

15-minute datasets in our experiment. There is one black and one red<br />

ellipse representing the 3cy bounds for each of four locations and 10<br />

means each calculated from 900 per-second samples for each<br />

location. The black points and ellipses are for the wavelet output and<br />

the red are the same for the output after the windowing process.<br />

761<br />

49<br />

3.3. k-means clustering<br />

When examining raw GPS data plots, especially the noisier<br />

ones, one often sees areas of two or more clusters of data<br />

connected by sparse, high-velocity segments. We reasoned that<br />

if we could isolate those clusters, calculate local means and<br />

once again weight them by their inverse variance we would see<br />

an even greater improvement in the ability to reduce the spread<br />

in the geometric means, sample-over-sample.<br />

To do this, we applied a k-means clustering algorithm<br />

(k=15), calculated the mean and variance for each of these 15<br />

clusters and computed a weighted mean for the entire dataset,<br />

as before.<br />

The overall result of this latter approach (fig 8) provides a<br />

further improvement over the windowing approach (fig 7)<br />

reducing the variance in the "wandering means" (fig 3). By<br />

examining the concentric black-red 3G ellipses one can see a<br />

reduction of 20-35%.<br />

4. Conclusions<br />

Wavelet filtering can be used to reduce variance, skew and<br />

kurtosis in GPS position data collected by a stationary receiver.<br />

Temporal windowing and spatial clustering of that output<br />

can be used to further reduce data biases in urban canyon that<br />

tend to make even aggregated mean estimates "wander" about<br />

their true position.<br />

These experiments, while successful, leave considerable<br />

room for refinement. Future work includes: setting dynamic<br />

thresholds for wavelet filtering, non-linear treatment of the<br />

inverse-weighting for the moving windows, determining k<br />

dynamically for the k-means algorithm, or using fuzzy cmeans.<br />

Indeed, the fixed, 15-minute sampling period of this<br />

experiment can also be dynamic allowing greater accuracy<br />

when time/cost permits and more rapid results in locations of<br />

modest multipath.<br />

-40 -20 0 20 40<br />

Easting (m)<br />

Figure 8: Shows the same information as in fig 7, except that the red<br />

data and ellipses represent the output after the k-means process. It is<br />

evident in individual results as it is in these summary plots that kmeans<br />

out-performs the moving window process in our experiment.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.


Acknowledgements<br />

This work of <strong>Ryerson</strong> <strong>University</strong> Multipath Mitigation<br />

(RUMM) labs was supported by <strong>Ryerson</strong> <strong>University</strong> (Toronto,<br />

Canada) and a grant from the Natural Sciences and<br />

Engineering <strong>Research</strong> Council of Canada (NSERC).<br />

References<br />

[1] M. Aram, S. Krishnan, A. Anpalagan, and B. Grush, "Wavelet<br />

<strong>Analysis</strong> and Data Processing of GPS <strong>Signal</strong>s for High Precision<br />

Position Computation", unpublished.<br />

[2] T. H. D. Dao. H. Kuusniemi, and G. Lachapelle, "HSGPS<br />

Reliability Enhancements Using a Twin-Antenna System",<br />

Proceedings of The European Navigation Conference GNSS<br />

2004, Rotterdam, 17-19 May 2004.<br />

762<br />

50<br />

[3] I. Daubechies, "Ten Lectures on Wavelets", CBMSNSF Regional<br />

Conference Series in Applied Mathematics, vol. 91, SIAM,<br />

Philadelphia, 1992.<br />

[4] D. L. Donoho, "De-noising by Soft-Thresholding" IEEE Trans.<br />

on Information Theory, Volume 41, Issue 3, May 1995 p.613.<br />

[5] Y. Feng, "Predictions Using GPS with a Virtual Galileo Constellation<br />

- Future GNSS Performance", GPS World, March 2005<br />

[6] H. Kuusniemi, "User-Level Reliability and Quality Monitoring<br />

in Personal Satellite Navigation", PhD Thesis, Tampere<br />

<strong>University</strong> of Technology, Finland, 2005.<br />

[7] A. Morrison, S. Krishnan, A. Anpalagan, and B. Grush,<br />

"Receiver Autonomous Mitigation of GPS Non Line-of-Sight<br />

Multipath Errors ", Institute ofNavigation, National Technical<br />

Meeting (ION-NTM) 2006<br />

[8] R. Puri, A. El Kaffas, A. Anpalagan, S. Krishnan, and B. Grush,<br />

"Multipath Mitigation of GNSS Carrier Phase <strong>Signal</strong>s for an On-<br />

Board Unit for Mobility Pricing," CCECE, 2004.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.


KEYSTROKE IDENTIFICATION BASED ON GAUSSIAN MIXTURE MODELS<br />

Danoush Hosseinzadeh, Sridhar Krishnan, April Khademi<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON - M5B 2K3 Canada<br />

Email: (danoushh@hotmail.com) (krishnan@ee.ryerson.ca) (akhademi@ieee.org)<br />

ABSTRACT<br />

Many computer systems rely on the username and password<br />

model to authenticate users. This method is widely used, yet it<br />

can be highly insecure if a user’s login information has been<br />

compromised. To increase security, some authors have proposed<br />

keystroke patterns as a biometric tool for user authentication;<br />

they can be used to recognize users based on how<br />

they type. This paper introduces a novel method that applies<br />

GMMs to keystroke identification. The major benefit of this<br />

method is the ability to update the user’s model each time he<br />

or she is authenticated. Therefore, as time goes on, each user<br />

model accurately reflects the changes in that user’s keystroke<br />

pattern. Using this method, a FAR and a FRR rate of approximately<br />

2% was achieved. However, it should be noted<br />

that 50% of the test subjects were the traditional ”two finger”<br />

typists and therefore, this had a disproportionately negative<br />

impact on the results.<br />

1. INTRODUCTION<br />

Undeniably, computers have become an essential part of daily<br />

life for many people around the world. One of the main reasons<br />

for this trend is that computers allow us to access information<br />

from any part of the globe. Additionally, they allow<br />

us to perform many functions that would otherwise require a<br />

physical presence else where, such as banking, shopping and<br />

some personal tasks such as online chatting and so on.<br />

Despite their importance, computer systems are generally<br />

protected with primitive techniques, such as usernames and<br />

passwords. Since passwords can be stolen, accidentally revealed<br />

or even cracked by dictionary programs, there has been<br />

a great number of electronic crimes in recent years. In fact,<br />

reports indicate that in 2002, online retailers lost an estimated<br />

US$1.64 billion dollars in fraudulent sales and an additional<br />

US$1.82 billion in legitimate sales that looked suspicious [1].<br />

To prevent crime and increase security, access should only<br />

be given to the correct users. To achieve this goal, some authors<br />

have suggested the use of keystroke identification as a<br />

method of preventing unauthorized users from accessing a<br />

computer system [2][3][4][5]. Keystroke identification is a<br />

biometric tool based on the principle that every person has<br />

a unique typing pattern, similar to a hand written signature<br />

[2][5]. Particularly, for regularly typed strings, this pattern<br />

can be very consistent and therefore, it can be effective for<br />

user identification. Furthermore, we argue that a person’s<br />

keystroke pattern would be harder to duplicate than a signature<br />

because an intruder does not have an unlimited number of<br />

tries to practice it. In a commercial system, a user who cannot<br />

successfully log in after a predetermined number of attempts,<br />

could be locked out from the system, therefore, limiting the<br />

intruder’s practice time. Studies have also shown that even<br />

among professional typists there is a great deal of variability<br />

in the keystroke patterns [6]. This makes user forgery very<br />

difficult.<br />

By exploiting these keystroke patterns, we can add an additional<br />

layer of security to the username/password model.<br />

Even if authorized persons reveal their passwords, no unauthorized<br />

user can gain access to the system. This idea has<br />

many internet-based applications, especially for online banking,<br />

email and user account protection, just to name a few. In<br />

fact, we can completely change the username/password security<br />

model to a model which only relies on keystroke patterns.<br />

Aside from increased security, this model would benefit users<br />

because they will not have to remember many different usernames/passwords<br />

pairs for different accounts. Also, the possibility<br />

of a user forgetting their password or a user having a<br />

password that is easy to decipher would be reduced.<br />

In this paper, a brief review will be presented on what features<br />

could be extracted from keystroke patterns and under<br />

what conditions good features can be acquired. Also, a new<br />

method for modeling these features based on Gaussian Mixture<br />

Models (GMMs) is proposed. For completeness, a brief<br />

review of GMMs is presented before describing the novel<br />

algorithm used. Lastly, the results and conclusions are presented.<br />

2. KEYSTROKE FEATURES<br />

2.1. Features From Keystrokes<br />

It has been shown that for a given user at least two unique features<br />

can be extracted from keystroke patterns [6]. Keystroke<br />

patterns, which are produced by the user during typing, exhibit<br />

unique timing characteristics. One such characteristic is<br />

the keystroke latencies (KL), which is the time between strik-<br />

1­4244­0469­X/06/$20.00 ©2006 IEEE III ­ 1144<br />

51<br />

ICASSP 2006<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.


ing two consecutive keys. Another characteristic (feature) is<br />

the key down time (KD), which is the time a particular key<br />

is held down. These features have been used in previous research<br />

to produce good results in user identification.<br />

For a string of length N, thereareN − 1 KL data points<br />

and N KD data points. These data points can be used to create<br />

two feature vectors. Fig. 1 shows the KL and KD plot<br />

for a particular user (one of the authors) that has typed his<br />

name repeatedly. Fig. 1 is included to illustrate the stability<br />

and strong correlation that exists between each of the feature<br />

vectors, KD and KL.<br />

220<br />

200<br />

180<br />

160<br />

140<br />

120<br />

100<br />

80<br />

200<br />

150<br />

100<br />

KL Features Vector<br />

da an no ou us s_ _h hh ho os ss se ei in nz za ad de eh<br />

KD Features Vector<br />

50<br />

d a n o u s h _ h o s s e i n z a d e h<br />

Fig. 1. Several plots of the keystroke latency (KL) and key<br />

down time (KD) feature vectors for one user. The bold line is<br />

the average of the vectors. The space character is represented<br />

by “ ”.<br />

2.2. Designing Good Features<br />

For keystroke identification, a robust feature pattern is one<br />

that is stable over repeated trials. To produce a stable feature<br />

pattern, the typist should be able to type the given text<br />

without any hesitation. Strings that require the typist to stop<br />

and think about the next letter or cause them to pause and<br />

search for a certain key, will result in an unstable pattern. As<br />

mentioned before, research has shown that the best results are<br />

obtained when users type familiar text such as, their first and<br />

last names. Such features are intuitively easy to type because<br />

they have been used for many years. Therefore, a distinct pattern<br />

can be seen when users type their name.<br />

Another important consideration when selecting appropriate<br />

text, is the number of characters. Shorter text tends to<br />

increase classification error because it can be more easily re-<br />

III ­ 1145<br />

52<br />

produced by others [5]. This is true because fewer number of<br />

characters have a less complex patterns and can be imitated<br />

more easily by imposters. The same problem exists with hand<br />

written signatures, where short and simple signatures are often<br />

easy to copy.<br />

In previous work, it has been suggested that no less than<br />

ten characters should be used for keystroke identification [5].<br />

In this work, the user is required to enter at least ten characters,<br />

which can be easily accomplished with the first and last<br />

name of the individual. At the same time it will be said that no<br />

additional effort was made to increase the minimum character<br />

length, because it might be difficult or annoying for some<br />

users to meet the requirement. This would also pose a strict<br />

requirement if the user’s full name does not meet the minimum<br />

character requirement, or if the user chooses a different<br />

string. These factors could have a negative impact on false<br />

acceptance rates (FAR) and false rejection rates (FRR).<br />

2.3. Data Acquisition Model<br />

To collect timing information, a data acquisition application<br />

named ’KbApp’ was designed for the Windows operating system.<br />

With this application, keystroke timing error was minimized<br />

to less than 0.5 milli-seconds, with the option of reducing<br />

it to 100 nano-seconds. However, this error will not have<br />

a significant impact on the results because the average feature<br />

has a time value that is to the order of 100 milli-seconds.<br />

2.4. Review of GMMs<br />

GMMs are a well known method for modeling the probability<br />

distribution of random events. By several weighted L dimensional<br />

Gaussian functions, it is possible to closely approximate<br />

any distribution, provided that enough training data is<br />

available. The complete GMM can be expressed by the mean<br />

vector µi, covariance matrix Σi and mixture weights wi as<br />

given below:<br />

λ = {wi, µi, Σi}, i =1, ...., K (1)<br />

Using the model λ, we can obtain the the likelihood that<br />

x belongs to the model λ by<br />

K<br />

p(x|λ) = wibi(x), (2)<br />

i=1<br />

where bi is given by an L-dimensional Gaussian PDF as shown<br />

below:<br />

1<br />

bi(x) =<br />

(2π) L/2 |Σi| −1/2 <br />

exp − 1<br />

2 (x − µ)t Σ −1<br />

<br />

i (x − µ)<br />

(3)<br />

GMMs can be very effective in modeling the type of distributions<br />

found in keystroke patterns, which are shown in Fig. 1.<br />

To verify the likelihood that a given feature vector x belongs<br />

to the a model λ, the natural logarithm of the associated<br />

probability is used. This value, which we call the Log-<br />

Likelihood (LL) is given below: <br />

K<br />

<br />

LL = log {p(x|λ)} = log wibi(x) (4)<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.<br />

i=1


3. A NOVEL KEYSTROKE IDENTIFICATION<br />

METHOD<br />

The novel method proposed in this paper uses GMMs to model<br />

keystroke timing information and uses the log-likelihood measure<br />

to the authenticate the user based on a threshold.<br />

3.1. GMM Training and Verification<br />

To produce a GMM, the user is first required to enroll into the<br />

system by typing their full name ten times. These ten samples<br />

produce twenty feature vectors; ten KL vectors and ten<br />

KD vectors. From these two sets of ten sample vectors, two<br />

GMMs can be trained, one for the KD feature and one for the<br />

KL feature. The expectation maximization (EM) algorithm<br />

was used to train the GMMs.<br />

Upon verification, the user is required to re-enter their full<br />

name. From this test vector, the KL and KD feature vectors<br />

are extracted and compared with the user’s model, which is<br />

obtained from the enrolment session. Equation 4, is used to<br />

calculate the log-likelihood that the test vector (x) belong the<br />

the given model. This result is then compared with the user’s<br />

threshold before access is granted or denied.<br />

The results show the statistics for the system when access<br />

is based on using the KD feature, the KL feature and a combination<br />

of KL and KD features. In the later case, the test<br />

vector is compared with both user models (KL model and KD<br />

model) before access is granted. Also, each time the user is<br />

authenticated successfully, both GMM models and thresholds<br />

are updated with the new information.<br />

3.2. Calculating Model Thresholds<br />

To obtain the user’s threshold, the Leave-One-Out-Method<br />

(LOOM) is used. The LOOM is as follows: for N feature<br />

vectors, N − 1 vectors are used to train the model and the<br />

last vector is used to test the likelihood that it belongs to that<br />

model, using Equation 4. This test can be performed N times,<br />

where each time a different vector is used to test the model.<br />

The final results of the LOOM produces N likelihood measures<br />

and can be expressed by<br />

LLj = log {p( xj|λ)} , j =1, 2, .., N (5)<br />

where λ is a GMM that has been trained with N − 1 vectors<br />

not including the jth vector and xj is the test vector.<br />

These N log-likelihood results are further processed before<br />

selecting the models threshold. From these likelihood<br />

values, the minimum value that falls within the range of three<br />

standard deviations away from the mean is set as the model<br />

threshold, as given below:<br />

Threshold= min<br />

∀j<br />

LLj | ( LLj − LL ) < 3σ <br />

where LL is the mean and σ is the standard deviation of the<br />

LL values obtained from the leave-one-out-method.<br />

(6)<br />

III ­ 1146<br />

53<br />

The model generation and threshold calculation procedures<br />

are repeated every time the user has been verified so that<br />

the model and threshold are adaptive and can change with the<br />

user over time.<br />

3.3. Authenticating The User<br />

User authentication is the main goal of this work. To achieve<br />

this, the keystroke model should be robust enough to produce<br />

a low false rejection rate (FRR) and a low false acceptance<br />

rate (FAR). FAR is the rate at which intruders can gain access<br />

to a valid user’s account, and FRR is the rate at which valid<br />

user’s are denied access to their own account. Obviously, both<br />

these measures should be as low as possible.<br />

In this approach, authentication is performed in two stages.<br />

In the first stage, if the user is denied access, they are given<br />

a second chance to entre their name. By using this method a<br />

significant improvement was be seen in the FRR and is discussed<br />

in the results section.<br />

4. EXPERIMENTAL RESULTS<br />

Before presenting the results, the reader is reminded that the<br />

number of initial training vectors used to calculate the model<br />

thresholds was ten. Because it is desired to have an accurate<br />

threshold based on the training vectors, the LOOM was used,<br />

as described in Section 3.2. It has been shown that the LOOM<br />

provides the least unbiased estimate for small databases [7].<br />

Therefore, the model thresholds used to authenticate the users<br />

are optimal given the size of the database.<br />

The results for FRRs and FARs for three different cases<br />

are presented in Table 1. It should be noted by the reader<br />

that the algorithm should also function well in terms of FRR<br />

and FAR, over time. The main reason for this behavior is that<br />

the proposed method adaptively selects the threshold that best<br />

suits the individual user, based on the LOOM. Also, the algorithm<br />

has shown that using a two stage verification process<br />

(ie. the user is given a two chances for authentication), decreases<br />

the FRR significantly.<br />

To perform imposter tests, two typists were chosen to observe<br />

and imitate the other users’ typing pattern. The results<br />

indicate an average FAR and FRR of about 2% using both<br />

features. These figures are comparable to other techniques<br />

however, a direct comparison with other methods cannot be<br />

justified because in each experiment a different database has<br />

been used. In our database, four out of the eight typists were<br />

the traditional ”two finger” typers. We believe this led to<br />

poor performance in both the FAR and the FRR because these<br />

types of users do not produce a very stable keystroke pattern<br />

and at the same time can be copied easily. Therefore, because<br />

their finger patterns can be easily seen and imitated by the<br />

imposter users, the FAR results presented here are skewed. In<br />

terms of FRR, these users also do not perform well because<br />

they have a lot of variation in their typing pattern. In fact,<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.


Table 1. Experimental Results for FRR and FAR<br />

KL Feature KD Features KL&Kd Features<br />

User FRR(%) FAR(%) FRR(%) FAR(%) FRR(%) FAR(%)<br />

1 0 0 0 0 0 0<br />

2 0 0 0 0 0 0<br />

3 5.3 14.3 0 14.3 5.3 7.1<br />

4 0 9.5 0 0 0 0<br />

5 0 0 0 0 0 0<br />

6 5.6 0 5.6 0 8.3 0<br />

7 0 50 0 10 0 10<br />

8 0 0 5.9 20 5.9 0<br />

Average(%) 1.4 9.2 1.4 5.5 2.4 2.1<br />

more users should be enrolled before the performance can be<br />

fully evaluated.<br />

This experiment obtained two features from the keystroke<br />

data and performed three similarity tests. The combination<br />

of the KL and KD features should produce a lower FAR and<br />

higher FRR compared to using either of the features individually.<br />

This is due to the fact that a user must correctly produce<br />

both features simultaneously. These trends were observed in<br />

the results, as can be seen from Table 1. A major benefit of<br />

this method over existing techniques is the ability to update<br />

the user model each time as he/she is successfully authenticated.<br />

Therefore, as time goes on, each user’s model accurately<br />

reflects the changes in that person’s keystroke pattern.<br />

5. CONCLUSIONS<br />

A novel method for authenticating computer users based on<br />

keystroke identification was presented. Upon verification, the<br />

keystroke latencies and key hold-down times for the user’s<br />

keyboard inputs were recorded and compared with a predefined<br />

individualistic model. Access was granted if the user’s<br />

input reached a certain threshold. A new method for calculating<br />

model threshold was also introduced using the LOOM<br />

and log-likelihood of the feature vectors.<br />

Ideally the FAR and the FRR should be very small with<br />

more emphasis give to former because a security breach is<br />

more critical than a valid user being forced to re-authenticate.<br />

Based on this logic, the best results were obtained using both<br />

the KL and KD features simultaneously; which produced a<br />

FRR of 2.4% and a FAR of 2.1%.<br />

Despite the fact that these results are based on a small<br />

database, it has been shown by this work that GMMs can be<br />

used effectively to identify users based on their keystroke patterns.<br />

Furthermore, despite the fact that 100% classification<br />

accuracy was not achieved, more users should be enrolled<br />

using this approach before a definitive answer can be given<br />

about the capability of the system. As mentioned earlier, the<br />

results presented are skewed because of the type of users enrolled<br />

(50% of users were two finger typers). This technique<br />

III ­ 1147<br />

54<br />

could be further improved to incorporate the varied nature of<br />

the different typists.<br />

GMMs may be used with other metrics to improve both<br />

the FAR and FRR, or the threshold procedure can be modified<br />

to produce more accurate resutls. In future works, we<br />

intend to investigate these areas with a larger database.<br />

6. REFERENCES<br />

[1] Alen Peacock, Xian Ke, and Matthew Wilkerson, “Typing<br />

patterns: A key to user identification,” IEEE Security<br />

& Privacy Magazie, vol. 2, no. 5, pp. 40–47, Oct. 2004.<br />

[2] Rick Joyce and Gopal Gupta, “Identity authentication<br />

based on keystroke latencies,” Communication of the<br />

ACM, vol. 33, no. 2, pp. 168–176, February 1990.<br />

[3] Oscar Coltell, Jose M. Dabia, and Guillermo Torres,<br />

“Biometric identification system based on keyboard filtering,”<br />

in Proc. IEEE 33rd Int. Carnahan Conf. on<br />

Secutrity Technology, Madrid, Oct. 1999, pp. 203–209.<br />

[4] Saleh Bleha, Charles Slivinsky, and Bassam Hussien,<br />

“Computer-access security systems using keystroke dynamics,”<br />

Pattern <strong>Analysis</strong> and Machine Intelligence,<br />

IEEE Transactions on, vol. 12, no. 12, pp. 1217–1222,<br />

December 1990.<br />

[5] Livia C. F. Araujo, Luiz H. R. Sucupira Jr., Miguel G.<br />

Lizarraga, Lee L. Ling, and Joao B. T. Yabu-Uti, “User<br />

authentication through typing biometrics features,” <strong>Signal</strong><br />

Processing, IEEE Transactions on, vol. 53, no. 2, pp.<br />

851–855, Feb. 2005.<br />

[6] R. Gaines, W. Lisowski, S. Press, and N. Shapiro, “Authentication<br />

by keystroke timing: Some preliminary results,”<br />

Tech. Rep. R-256-NSF, Rand Corporation, Santa<br />

Monica, CA, USA, May 1980.<br />

[7] Keinosuke Fukunaga, Introduction to Statistical Pattern<br />

Recognition (2nd ed.), Academic Press Professional Inc.,<br />

San Diego, CA, USA, 1990.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.


SOCCER VIDEO RETRIVAL USING ADAPTIVE TIME-FREQUENCY METHODS<br />

Jonathan Marchal*, Cornel Ioana*, Emanuel Radoi*, André Quinquis*, Sridhar Krishnan**<br />

* : ENSIETA, E3I2 Laboratory, 2 rue François Verny, Brest - FRANCE<br />

E-mails : marchajo@ensieta.fr, ioanaco@ensieta.fr, radoiem@ensieta.fr, quinquis@ensieta.fr<br />

** : Dept. Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto – CANADA<br />

E-mail : krishnan@ee.ryerson.ca<br />

ABSTRACT<br />

The retrieval of soccer highlights is a suitable technique<br />

for video indexing, required by the multimedia database<br />

management or for the development of television on<br />

demand. For these purposes, it should be interesting to have<br />

an automatic annotation of events happened in soccer<br />

games. One solution consists in analyzing the audio<br />

soundtrack associated to the soccer video and to detect the<br />

interesting frames.<br />

In this paper we use the adaptive time-frequency<br />

decomposition of the soundtrack as a feature extraction<br />

procedure. This decomposition is based on the Matching<br />

Pursuit concept and a dictionary composed of Gabor<br />

functions. The parameters provided by these<br />

transformations constitute the input of the classification<br />

stage. The results provided for real soccer video will prove<br />

the efficiency of the adaptive time-frequency representation<br />

as a feature extraction stage.<br />

1. INTRODUCTION<br />

Soccer video highlights retrieval is not only a subject of<br />

research but also a need considering the huge amount of<br />

data that we can find on the internet. Most of the video<br />

archives are not indexed and an automatic parsing approach<br />

is a marketable issue. The development of television on<br />

demand has also this need of video index. Viewers could<br />

have access to the information they need, without having to<br />

watch hours and hours of videos.<br />

Several methods have been proposed, based on the<br />

information provided by video frames, such as camera<br />

motion, court lines detection, motion vectors, location and<br />

movements of the players as in [1], others on audio/video<br />

features extraction [2] or only on audio features, for instance<br />

dominant and excited speech [3].<br />

In this paper, we propose a method based on audio<br />

feature extraction. There are typically a tremendous amount<br />

of crowd activity, which differs depending on the type of<br />

highlight in a game. For instance, when a goal is scored, the<br />

crowd cheers are increasing progressively before it, and<br />

continues for a few seconds after. For a penalty or free-kick<br />

goal, the crowd cheers are sudden, whereas when a goal is<br />

missed, crowd cheers begin and stop soon after. Finally,<br />

during a normal game sequence, crowd activity is usually<br />

not particularly intense. Considering these observations, we<br />

assert that if the human ear is able to distinguish the crowd<br />

reaction, signal processing tools would be able to do so.<br />

The idea behind this work is to use an adaptive time<br />

frequency decomposition (ATFD) [4,5] on the audio<br />

soundtrack of the sequences as a starting point for the<br />

feature extraction and classification.<br />

The paper is organized as follows. In Section 2 a brief<br />

presentation of adaptive time-frequency decomposition<br />

concept is done. The classification of the soccer sequences,<br />

based on the parameters provided by the ATFD, is described<br />

in Section 3. The efficiency of the proposed method is<br />

analyzed trough the results in the Section 4. We conclude<br />

our discussions in Section 5.<br />

2. ADAPTIVE TIME-FREQUENCY TECHNIQUES<br />

Most of the natural signals are non-stationary. Since<br />

their structure is generally complex, some transformations<br />

in a more intuitive representation spaces are usually well<br />

suited. Linear expansions in a single parameter basis,<br />

whether it is a Fourier, wavelet, or any other basis are not<br />

flexible enough. A Fourier basis provides a poor<br />

representation of functions well localized in time, and<br />

wavelet basis are not well adapted to represent functions<br />

whose Fourier transforms have a narrow high frequency<br />

support. Thus, a flexible decomposition technique can be<br />

considered for representing signal components whose<br />

localization in time and frequency vary widely.<br />

Matching pursuit (MP), introduced in [4], is a technique<br />

that decomposes a signal into a linear expansion<br />

of waveforms that belong to a redundant dictionary of<br />

functions. These waveforms, selected in order to best match<br />

the signal structure are selected among a dictionary of timefrequency<br />

atoms. The aim of the algorithm is to obtain a<br />

parsimonious description in order to estimate the original<br />

signal with as fewer coefficients as possible. Generally,<br />

considering a signal x and a dictionary<br />

1­4244­0469­X/06/$20.00 ©2006 IEEE V ­ 509<br />

55<br />

ICASSP 2006<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.


D<br />

as<br />

{ gγ<br />

γ ∈ Γ,<br />

g = 1}<br />

= γ<br />

, the signal decomposition is expressed<br />

= ∑<br />

γ ∈Γ<br />

x λ g , (1)<br />

where the decomposition coefficients λ are obtained by<br />

γ<br />

the inner product between the signal x and the function gγ. :<br />

λγ = x, g . γ<br />

The MP builds up the signal decomposition one element<br />

at a time, picking up the most energy dominant component<br />

first. The MP begins by projecting the signal x on a function<br />

gγD 0<br />

∈ and computes the residue Rx = x − x, gγ g .<br />

0 γ0<br />

Thus, the Rx is orthogonal to gγ . The MP algorithm<br />

0<br />

chooses gγ∈ D such that xg , is maximum:<br />

γ<br />

0<br />

γ<br />

γ α 0<br />

γ<br />

γ ∈Γ<br />

γ<br />

0<br />

xg , ≥ sup xg , , (2)<br />

where α ∈ ( 0,1]<br />

is an optimal factor. The MP iterates this<br />

procedure by decomposing the residue. If we suppose the<br />

m th order residue R m x has been computed, the next iteration<br />

chooses<br />

g ∈ D such that :<br />

γ<br />

and deduces R m+1 x by :<br />

m<br />

R x g ≥ R x g , (3)<br />

m m<br />

, γ α sup ,<br />

0<br />

γ<br />

γ∈Γ<br />

m 1 m m<br />

R x R x R x, gγ g<br />

m γm<br />

+ = − (4)<br />

Summing for m between 0 and M-1 yields<br />

M−1 M−1<br />

m M M<br />

, γm γm m γm<br />

m= 1 m=<br />

1<br />

∑ ∑ (5)<br />

x = R x g g + R x = a g + R x<br />

where am (m=0,..,M-1) are the decomposition coefficients.<br />

There are two major factors which guarantee the<br />

success of a MP algorithm. The first one is the choice of<br />

stop criteria. Since the MP is an iterative decomposition,<br />

establishing the number of iterations, M, is very important<br />

for the considered application. One of the most used criteria<br />

is the choice of M such that the residual energy is smaller<br />

than a percentage of the signal energy:<br />

2 2 M + 1<br />

2<br />

M ε x − R − minimal (6)<br />

This criterion is not well adapted when a signal to noise<br />

ratio (SNR) is relatively small [5]. In this case, the signal<br />

energy contains the noise contribution and, consequently the<br />

correct setup of M is almost impossible.<br />

However, since in our application the signals of interest<br />

are the soccer soundtracks we can assume that the noise is<br />

relatively small and, more importantly, its level and<br />

properties are almost the same for all signals. In these<br />

V ­ 510<br />

56<br />

conditions, we can use the criterion (6) whose ratio ε is<br />

empirically setup.<br />

The second factor which guaranties the efficiency of the<br />

MP is the choice of the elementary functions g . Intuitively,<br />

γ<br />

the parameters of these functions should ensure a good<br />

matching on the signal’s time-frequency structures. A<br />

common choice is to design a function with as much<br />

degrees of freedom (i.e., control parameters) as possible. On<br />

the other hand, the time-frequency resolution is another<br />

property of interest, especially for feature extraction<br />

applications. According to these requirements, we consider<br />

for our application an elementary function defined as<br />

⎛t−u ⎞<br />

g () t = g⎜ ⎟e<br />

γ m<br />

1 m j( 2π<br />

fmt+ φm)<br />

sm<br />

⎝ sm<br />

⎠<br />

These atoms are issued by dilatation (sm), modulation<br />

(fm) and translation (um) of the Gaussian window<br />

2<br />

1/4 t<br />

g() t 2 e π −<br />

= (8)<br />

The fourth parameter, φm, stands for the initial phase.<br />

According to this definition, inspired from [4], the atoms<br />

(called Gabor functions) are characterized by three<br />

parameters : um, sm, fm, φm. This type of elementary functions<br />

allows to define an adaptive time-frequency tilling, unless<br />

the Gabor or wavelet transform (figure 1).<br />

Fig. 1. T-F tilling : MP versus wavelet decompositions<br />

The choice from an arbitrary time-frequency tilling<br />

constitutes the main property interesting for characterization<br />

purposes. It will be used in the next section for the<br />

separation of soccer events based on the MP analysis of the<br />

corresponding soundtracks.<br />

3. CLASSIFICATION OF SOCCER HIGHLIGHTS<br />

Most of the time, what interests a soccer watcher are the<br />

highlights, such as goals, of course, but also missed goals<br />

and scored free-kicks and penalties are of interest. Hence,<br />

we have chosen sequences of theses three types, adding a<br />

"normal game" class, which is relevant in order to<br />

differentiate an interesting sequence (from the 3 classes<br />

above) from an "uninteresting" one, in terms of highlight<br />

retrieval. This defines 4 classes as illustrated in Fig. 2 :<br />

goals, missed goals, penalties/free-kicks, normal game.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.<br />

(7)


Fig. 2. Video sequences isolated for soccer retrieval experiments<br />

Once the classes’ definition has been done rigorously,<br />

the sequences were grabbed form the Internet and from TV<br />

recordings. We retained duration of 5 seconds to analyse<br />

each highlight video sequence. Indeed, when watching these<br />

sequences, we can note that usually, the crowd cheers are<br />

growing during the 2 seconds before the ball crosses the<br />

goal line, and continues until 3 seconds after, in most cases.<br />

All the audio soundtracks of the video sequences have been<br />

extracted in a mono, 8 bits, 44,1kHz format. It corresponds<br />

to 220500 samples per sequence. The scheme for the feature<br />

extraction and classification of the audio soccer sequences<br />

shown in Fig. 3 and is as follows:<br />

Soccer<br />

sequences<br />

Soundtrack<br />

extraction<br />

MP<br />

decomposition<br />

Dimensionality<br />

Classification<br />

reduction<br />

Fig. 3. Scheme of the soccer event classification<br />

The first step is the extraction of the soccer sequence<br />

soundtrack. The extracted signals are inputs of the MP<br />

decomposition. Knowing that we consider N=220500<br />

samples per sequence we setup the parameters of the<br />

elementary function dictionary as follows :<br />

- the time parameter, un, ranges from 0 to 220499;<br />

- the scale parameter, sn, ranges from 2 to the closest<br />

integer of log2(N) (17 in our case);<br />

- the frequency parameter, fn, ranges from 0 to 22050 Hz<br />

(the half of sampling frequency). The number of frequency<br />

parameters is given by the required spectral resolution. In<br />

our application we consider 8192 values which corresponds<br />

to a spectral resolution 2.65 Hz.<br />

With the parameters ranging the intervals given<br />

previously, the experimental results provided for real data<br />

show that the stop criteria (6) needs less than 2200<br />

iterations. For this reason we will limit the number of the<br />

iterations to this value. The decomposition parameters are<br />

organized in a matrix structure as indicated in (9)<br />

V ­ 511<br />

57<br />

Iteration index<br />

Energy Octave Frequency Time Phase<br />

E1<br />

E2<br />

.<br />

.<br />

.<br />

.<br />

.<br />

s1<br />

s2<br />

.<br />

.<br />

.<br />

.<br />

.<br />

Experimentally, we observed that these parameters have<br />

different discrimination efficiency when applied on audio<br />

signals. For example, the time parameter is difficult to be<br />

used in our case since it is impossible to synchronize the<br />

crowd reaction of all sequences at a fixed sample.<br />

Therefore, only the frequency and scale parameters will be<br />

used for the classification purpose of audio sequences. It<br />

constitutes a first step of data size reduction. Nevertheless,<br />

while we work with 2500 iterations, the number of<br />

classification parameters is about 5000. In order to reduce<br />

the dimensionality of input data, the linear discriminant<br />

analysis (LDA) technique is applied [6]. LDA is a<br />

supervised learning projection that uses information on the<br />

within-class scatter and between-class scatter to construct a<br />

projection matrix in the reduced space. It maximizes the<br />

ratio of between-class variance to the within-class variance<br />

in any particular data set thereby guaranteeing maximal<br />

separability. As it will be shown in the next section the LDA<br />

improves the classification performances compared to the<br />

classification in the original space.<br />

Finally, using the feature vectors provided by LDA we<br />

used the nearest neighbors classifier for the classification<br />

task [6]. This operation will be performed in two phases.<br />

The first one, learning stage, consist in processing a training<br />

set of features with apriori known class. The second step,<br />

testing, is based on the computation of the distance between<br />

a new unknown feature vector and each feature vector from<br />

the training set. The short distance corresponds to the<br />

nearest neighbor. This algorithm will be more<br />

computationally intensive as the size of the training set<br />

grows but the performances will be improved.<br />

The method proposed in this section has been used for<br />

the classification of the soccer sequences for a significant<br />

dataset. The most important results will be presented in the<br />

next section.<br />

4. RESULTS<br />

In this section we present the results obtained for a data<br />

set which consists of 47 sequences, composed of 10 goals<br />

sequences, 9 missed goals, 21 normal game sequences and 7<br />

penalties. The sequences are decomposed applying MP<br />

algorithm, returning parameters as illustrated in the matrix<br />

(9).<br />

The main idea behind the classification process is to<br />

compare the modulation frequencies of the Gabor functions<br />

with comparable scales for each sequence. This principle<br />

has been establishing by comparing the histogram of the<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.<br />

f1<br />

f2<br />

.<br />

.<br />

.<br />

.<br />

.<br />

t1<br />

t2<br />

.<br />

.<br />

.<br />

.<br />

.<br />

φ1<br />

φ2<br />

.<br />

.<br />

.<br />

.<br />

.<br />

(9)


frequency parameters issued from the MP algorithm applied<br />

for data corresponding to each class . This is illustrated in<br />

the Fig. 4 for the scale 13.<br />

Fig. 4. Mean frequency distributions for the 4 classes<br />

As feature parameters we use the vectors of bins centers.<br />

Empirically, we found that the best bins centers vector is of<br />

size 12. Applying the LDA algorithm for the vectors<br />

obtained for our dataset, three non-zero eigenvalues are<br />

obtained. We choose these 3 eigenvectors as a 3D<br />

projection space. The classification is then performed using<br />

the nearest neighbors (NN) method coupled to the<br />

Mahalanobis distance. The LOO (Leave-One-Out) crossvalidation<br />

technique [7] has been used because of the<br />

reduced number of examples in the database. The results are<br />

shown in figure 5...<br />

Fig. 5. Clusters given by classification in the reduced space<br />

Note that the four classes are close but properly<br />

separated in fig. 5.a, whereas the points corresponding to<br />

penalties sequences are spread among the other classes in<br />

fig. 5.b This comes from the fact that all the penalty<br />

sequences chosen in the set are also scored penalties, so<br />

although the crowd cheers are more sudden than for a goal<br />

V ­ 512<br />

58<br />

sequence, they are very similar. Table 1 provides the<br />

classification results with and without dimensionality<br />

reduction (provided by LDA and PCA – principal<br />

component analysis).<br />

Table 1. Classification rates<br />

The classification accuracies obtained clearly shows that<br />

the LDA is well adapted to transform the data provided by<br />

the MP algorithm.<br />

5. CONCLUSION<br />

In this paper, we have proposed a new technique for<br />

soccer events classification based on the Matching Pursuit<br />

algorithm. The dictionary of elementary functions has been<br />

manipulated according to the application in hand.<br />

After MP decomposition the feature parameters have<br />

been projected via a dimensionality reduction stage. The<br />

LDA technique based on the nearest neighbor method yields<br />

better classification performances and improves the<br />

computational efficiency of the classification stage. In the<br />

future works, we intend to use other parameters of the<br />

functions with a larger dataset<br />

ACKNOWLEDGEMENTS<br />

The authors would like to thank Lastwave Software<br />

developers and Karthi Umapathy of <strong>Ryerson</strong> <strong>University</strong> for<br />

providing the Matching Pursuit routines.<br />

6. REFERENCES<br />

[1] Y.Gong, L.T.Sin, C.H.Chuan, H.Zhang, M.Sakauchi,<br />

“Automatic parsing of TV soccer programs”, Proc. ICMCS95,<br />

Washington DC, USA, 1995.<br />

[2] K. Wan, C. Xu, “Efficient multimodal features for automatic<br />

soccer highlight generation”, 17th ICPR04, 2004.<br />

[3] K. Wan, C. Xu, “Robust soccer highlight generation with a<br />

novel dominant speech feature extractor”, IEEE International<br />

Conference on Multimedia Expo ICME, 2004.<br />

[4] S. Mallat and Z. Zhang, “Matching pursuit with time-frequency<br />

dictionaries”, IEEE Trans. <strong>Signal</strong> Processing vol. 41, pp. 3397-<br />

3415, Dec. 1993.<br />

[5] S. Mallat, A Wavelet Tour of signal processing, Academic<br />

Press, 1998.<br />

[6] R.O. Duda, P.E. Hart, D.H. Stork, Pattern Classification (2nd<br />

ed.), Wiley Interscience, 2000.<br />

[7] K. Fukunaga, Introduction to statistical pattern recognition<br />

(2nd ed.), Academic Press Professional, Inc. San Diego, CA, USA,<br />

1990.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.


SUPPORT VECTOR MACHINES BASED APPROACH FOR CHEMICAL<br />

PHOSPHORUS REMOVAL PROCESS IN WASTEWATER TREATMENT PLANT<br />

Talieh Seyed Tabatabaei<br />

Department of<br />

Electrical and Computer<br />

Engineering, <strong>Ryerson</strong><br />

<strong>University</strong>, Toronto<br />

tseyedta@ee.ryerson.ca<br />

Abstract<br />

Tahir Farooq<br />

Department of<br />

Electrical and Computer<br />

Engineering, <strong>Ryerson</strong><br />

<strong>University</strong>, Toronto<br />

tfarooq(ee.ryerson.ca<br />

In this research, support vector machine (SVM) is<br />

investigated to model the uncertainty in chemical phosphorus<br />

removal processes in wastewater treatment plants. SVM is a<br />

machine-learning method based on the principle of structural<br />

risk minimization, which performs well when applied to data<br />

outside the training set. The prediction whether or not the<br />

concentration of total phosphorus as P in the effluent will<br />

exceed the maximum allowable limit (1.0 mg1L) for a certain<br />

input is considered a supervised-learning problem. Least<br />

Squares Support vector machines (LS-SVMs) algorithm, which<br />

is a reformulation of standard SVMs, is used to design the<br />

class ifier. Performance of radial basis function (RBF),<br />

polynomial and multi-layer perceptron (MLP) kernels has<br />

been evaluated and a high classification rate of 88.520 was<br />

achieved using radial basisfunction (RBF) kernel.<br />

Keywords: Wastewater, phosphorus removal, SVM<br />

1. Introduction<br />

Nature has an amazing ability to cope with small amounts<br />

of water wastes and pollution, but it would be overwhelmed if<br />

we did not treat the billions of gallons of wastewater and<br />

sewage produced every day before releasing it back to the<br />

environment. Treatment plants reduce pollutants in wastewater<br />

to a level that nature can handle.<br />

Wastewater can be defined as the liquid, or water-carried<br />

wastes removed from residence, institutions, and commercial<br />

and industrial establishments, together with such ground<br />

water, surface water, and storm water [1].<br />

Collecting, treating and reusing the wastewater are<br />

receiving an increasing interest these days. In addition to its<br />

aesthetic and sanitary advantages, we can look at it as a big<br />

financial aid, since we can reuse the treated wastewater in<br />

many applications (i.e. agriculture irrigation, urban irrigation,<br />

industrial reuses, ground water recharge, street cleaning , car<br />

washing, toilet flushing, and many more [2] ).<br />

Wastewater consists of physical, chemical, and biological<br />

components. Some of the contaminants of concern in<br />

1-4244-0038-4 2006<br />

IEEE CCECE/CCGEI, Ottawa, May 2006<br />

Aziz Guergachi<br />

Department of<br />

Information Technology<br />

Management, <strong>Ryerson</strong><br />

<strong>University</strong>, Toronto<br />

a2guerga(ee.ryerson.ca<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />

318<br />

59<br />

Sridhar Krishnan<br />

Department of<br />

Electrical and Computer<br />

Engineering, <strong>Ryerson</strong><br />

<strong>University</strong>, Toronto<br />

krishnan@ee.ryerson.ca<br />

wastewater to be removed are suspended solids, biodegradable<br />

organics, pathogens, nutrients, priority pollutants, refractory<br />

organics, heavy metals, and dissolved inorganics. Nutrients<br />

(i.e. nitrogen and phosphorus) are one of the most important<br />

contaminants of the wastewater. Both nitrogen and<br />

phosphorus are essential nutrients for growth [1, 2]. When<br />

discharged in the aquatic environment, these nutrients can lead<br />

to the growth of undesirable aquatic life. When discharged in<br />

excessive amount on land, they can also lead to the pollution<br />

of groundwater.<br />

Phosphorus is essential to the growth of algae and other<br />

biological organisms. The usual forms of phosphorus found in<br />

aqueous solutions include the orthophosphate, polyphosphate,<br />

and organic phosphate [1, 2]. Due to the presented negative<br />

affects of the phosphorus existed in wastewaters, along with<br />

the stringent discharged limits imposed on wastewater<br />

treatment plants, recently there has been increasing demand to<br />

achieve very low effluent total phosphorus. According to the<br />

phosphorus removal requirements that have been imposed (by<br />

International Joint Commission's Phosphorus Management<br />

Strategies Task Force) in Ontario, the typical effluent<br />

concentration limit should be 1.0 mg/L, based on total<br />

phosphorus [3]. However all provinces site-specific conditions<br />

may dictate more stringent requirements in terms of effluent<br />

total phosphorus limit.<br />

The process of phosphorus removal can be done either<br />

biologically or chemically. The data used in this paper is based<br />

on Ashbridges Bay Treatment Plant in Toronto, which uses<br />

the chemical method. Chemicals that are used in chemical<br />

phosphorus removal process include metal salts and lime. The<br />

most commonly used metal salts are ferric chloride, ferrous<br />

chloride, and aluminum sulfate. In the mentioned treatment<br />

plant ferrous chloride (FeCl2) is being used as the chemical<br />

precipitation for phosphorus removal.<br />

The theory of chemical precipitation reactions is very<br />

complex. There are many uncertainties that underlie all the<br />

chemical reactions. Due to the existence of numerous other<br />

particles other concurrent side reactions may happen in<br />

wastewater as well [1]. All these uncertainties bring about the


necessity of prediction, controlling and therefore some kind of<br />

intelligent system.<br />

In the last few years, numerous studies were carried out<br />

dealing with applications of Artificial Neural Networks and<br />

Fuzzy Neural Networks for modeling biological nutrient<br />

removal systems [18], fuzzy-logic based control strategies for<br />

biological nitrogen removal and dynamic enhanced biological<br />

phosphorus removal [20, 21], fuzzy controller for the level of<br />

biogas in the treated wastewater [19], whereas the amount<br />

work targeting the applications of chemical processes in<br />

wastewater treatment, especially chemical phosphorus<br />

removal, has been insufficient<br />

In this paper a novel approach based on support vector<br />

machines (SVMs) is proposed to control and classify the<br />

quality of the final effluent of wastewater treatment plants<br />

according to the International Joint Commission (IJC)<br />

phosphorus concentration standards.<br />

The paper is organized as follows: in section 2 the theory of<br />

Support Vector Machines in both linear case and non-linear<br />

case is discussed. Section 3 explains the data set preparation<br />

for classification. Section 4 demonstrates the results of the<br />

classifications and the graphs, and section 5 gives the<br />

conclusion.<br />

2. Support Vector Machines<br />

SVM introduced first by Vapnik and co-workers, and it is<br />

such a powerful method that in the few years since its<br />

introduction outperformed most other systems in a wide<br />

variety of applications. SVM has different applications.<br />

However it is mostly used as a binary classifier. Given a classlabeled<br />

training set, which in this work is a set of labeled input<br />

feature vector composed of input and control variables, the<br />

boundary between two classes is learnt using SVM.<br />

2.1. Linear Support Vector Machine<br />

Consider a binary classification problem with xi E Rd as<br />

its feature vector andyi E {-1, +1} the class labels (i.e.<br />

( x1, Yi ) I.. .,(inI Yn ) are the training sets). The hyperplane<br />

which separates the two classes is<br />

Tf(i) =Tw x+b = 0 (1)<br />

The function of SVM is based on choosing the hyperplane<br />

which minimize the margin between two classes (figure (1))<br />

[4, 5, 6]. Thus, the hyperplane ( w, b ) that solves the<br />

optimization problem<br />

subject to<br />

minimize wv, b 1 _1 112<br />

2<br />

yi(< vXi> +b)21 i=1l. In<br />

(2)<br />

319<br />

60<br />

realizes the<br />

margin 1/2 ||<br />

maximal margin hyperplane<br />

V || . The primal Lagrangian is<br />

with geometric<br />

L(wV, b,<br />

1n<br />

V= > ai [yi (< w xi~j> +b) -1I]<br />

2 ~~~~i=l<br />

(3)<br />

where ai > 0 are the Lagrange multipliers.<br />

The corresponding dual is found by differentiating<br />

respect to wv' andb:<br />

with<br />

subject to<br />

n I n<br />

W(a) = Yai -t Yi yj ai aj < xi-xj ><br />

i=l 2i,j=l<br />

n<br />

XYaiy =0 ai20 I= . 1 ....> In<br />

i=t<br />

But in many real-world problems the data is noisy; therefore<br />

there will in general be no linear separation. In this case<br />

instead of hard margin, we use soft margin (the noise tolerant<br />

version), and slack variables denoted by X, can be introduced<br />

to relax the constrains [4, 5, 6].<br />

So our optimization problem would be<br />

subject to<br />

minimize w, b - V112 + C i<br />

2 i=l<br />

yi(< w.xj >+b) >1-.i Ji 2 0, i=I, ..... In<br />

Where C is regularization parameter which is a trade off<br />

between the empirical risk (reflected by the second term in<br />

(5)) and model complexity (reflected by the first term in (5)).<br />

The dual form of this case is the same as (4) except that the<br />

constraint is different:<br />

0+bo<br />

i=l<br />

where N, is the number of support vectors.<br />

This result shows that points that are not support vectors<br />

have no influence on the solution.<br />

2.2. Non-linear Support Vector Machines<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />

In most of the real-world cases the data points are not<br />

linearly separable. In this case we use a non-linear<br />

operator y (.) to map the data to a higher dimensional space F<br />

(feature space), where it can be classified linearly (Figure 2).<br />

(4)<br />

(5)<br />

(6)


Figure 1. A linear SVM classifier. Support vectors are<br />

those elements of the training set which are on the<br />

boundary hyperplanes of two classes.<br />

So the hypothesis in this case would be<br />

T)<br />

f(x~)= (px.) + b (7)<br />

which is linear in terms of the mapped data (y(o)).<br />

Now we can extend all the presented optimization problems<br />

for the linear case, for the transformed data in the feature<br />

space.<br />

If we define the Kernel function as<br />

K(x,y) = < (o). (o) > (8)<br />

where q is a mapping from input space to an (inner product)<br />

feature space F.<br />

Then the corresponding dual form is<br />

subject to<br />

margin<br />

o \\\\ P~Y Support<br />

O O ' /\ Hyperplane<br />

o o \ \\"<br />

o \'<br />

n I n<br />

W(a) =Yai -- Y i yjai ajK(.i.Xi)<br />

i=1 2 i, j=l<br />

n<br />

YXaiyi=0 a.i>0,i= 1 ..... I n<br />

i=<br />

And our final decision rule can be expressed as<br />

f (,bo ) yi ai K(i .x) + bo (10)<br />

i=<br />

whereN and ai denote number of support vectors and the<br />

non-zero Lagrange multipliers corresponding to the support<br />

vectors respectively. Note that we don't have to know the<br />

underlying mapping function, however it is necessary to<br />

define the Kernel function. Among the different kernel<br />

functions the most common kernels are polynomial, Gaussian<br />

radial basis function (RBF) and multi-layer perception (MLP).<br />

In LS-SVMs the inequality constrains in SVM are replaced<br />

with equality constrains. As a result the solution follows from<br />

solving a set of linear equations instead of a quadratic<br />

(9)<br />

0 10++ 00<br />

00<br />

0+<br />

I O0<br />

Figure 2. Mapping data from input space to a higher<br />

dimensional feature space by a non-linear operator (p (.), in<br />

order to classify them by a linear function<br />

programming problem which we have in original SVM<br />

formulation of Vapnik and obviously we can have a faster<br />

algorithm.<br />

The primal problem of the LS-SVMs is defined as<br />

subject to<br />

minwb(e<br />

Jp(w,b,e) = 1/212 + y1/2XYe7 (11)<br />

i=l<br />

-To<br />

= yi[w o(xi)+b] -ei, i = 1 ..... d<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />

320<br />

61<br />

where y is a parameter analogous to SVM's regularization<br />

parameter. The main characteristic of LS-SVMs is the low<br />

computational complexity comparing to SVMs without quality<br />

loss in the solution.<br />

3. Dataset Preparation and Processing<br />

The dataset used in this study was obtained from<br />

Ashbridges Bay Wastewater Treatment Plant, Toronto. This<br />

dataset consists of 123 records. Each record is an observation<br />

of the input, control and output variables. Every record<br />

represents the average values of the variables over a period of<br />

one day. The input and control variables used in this study<br />

were selected after consultation with senior plant<br />

management. Total daily volume treated, peak flow rate,<br />

carbonaceous biochemical oxygen demand (CBOD),<br />

suspended solids (SS) and total phosphorus as P in influent are<br />

used as input variables. Ferrous chloride is used as control<br />

variable and is included in the input feature vector for training<br />

and testing of LS-SVM and LDA classifiers. Concentration of<br />

total phosphorus as P in effluent is used as the output variable.<br />

The dataset was randomly divided into two separate subsets.<br />

One of the subsets having 62 examples was used exclusively<br />

for training algorithms and the other one having 61 examples<br />

was used exclusively for testing. Any example from the<br />

training set was never used during testing phase and vice<br />

versa. A class label yi E {-1,+ 1} was assigned to every output<br />

value based on the threshold value of 1.0 mg/L. If the output<br />

variable exceeds the threshold, {+1 } class label is assigned to<br />

the output value, otherwise {-1 } class label is assigned. Class<br />

label assignment was done for both of the training and testing<br />

datasets before designing the LS-SVM and LDA classification<br />

algorithms.<br />

d


4. Experimental Results<br />

The objective of the LS-SVM and LDA classification<br />

algorithms is to correctly classify, whether or not the<br />

concentration of total phosphorus as P in effluent will exceed<br />

the threshold for a given set of yet-to-be-seen input patterns.<br />

Classification rate was used as a figure of merit. The<br />

classification rate was defined as the total number of correctly<br />

classified examples divided by the total number of examples<br />

classified times one hundred. The results of LS-SVM<br />

classification have been obtained using three different kernel<br />

functions; polynomial kernel, (KQ.~,j) = (9VT~+ I)d' where<br />

d is the degree of polynomial kernel), radial basis function<br />

kernel, (K(.~J) = exp(- IJ -11 2/ 2), where c is the width<br />

of RBF kernel) and multilayer perceptron kernel (MLP),<br />

(K(i~,j) = tanh(kiT'j + 0) ). MLP kernel does not satisfy<br />

Mercer condition for all k and 6.<br />

Fig. 4 shows the estimated classification rate achieved by<br />

LS-SVM classifier using RBF kernel with kernel width c=<br />

0.5, 1 and 2.5. The best classification rate achieved was<br />

88.520o when c= 0.5 and C is between 0.5 and 0.8. A<br />

similar classification rate was achieved when c 1 and C=<br />

0.5. The classification rate dropped to 86.880o when the value<br />

of c was changed to 2.5 and 0. 1.<br />

For polynomial kernel a consistent classification rate of<br />

86.8800 was achieved for a wide range of parameter settings.<br />

Although polynomial kernel did not achieve as good<br />

performance as RBF kernel, its performance was insensitive<br />

for a very wide range of parameter settings.<br />

Fig. 5 represents the classification rate obtained by MLP<br />

kernel with k =0.5, 1 and 2.5. The value of & was kept<br />

constant at 1.<br />

MLP kernel achieved the best classification rate of 86.880o<br />

for all the three values of k at different values of C. However,<br />

the results obtained by MLP kernel were very sensitive to the<br />

U~0<br />

90<br />

85<br />

80<br />

75 101- 1 00 1 01 1 02 0<br />

C<br />

Figure 4. Plot of LS-SVM classification rate versus<br />

regularization parameter C using RBF kernel with<br />

,c<br />

= 0.5,l1and 2.5<br />

321<br />

I0<br />

80<br />

(r)<br />

80<br />

0<br />

20<br />

60-----T -i<br />

40 --<br />

60 --<br />

.Ilu<br />

-101<br />

-<br />

--<br />

80 T I -- r- , rr -- -I-<br />

60<br />

0.k2.5<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />

62<br />

1 nn.<br />

-J<br />

1 00 1 0<br />

1 02<br />

Figure 5. Plot of LS-SVM classification rate versus<br />

regularization parameter C using MLP kernel with<br />

k = 0.5, 1 and 2.5<br />

parameter settings. Hence polynomial kernel could be a better<br />

choice over MLP kernel.<br />

The same training and testing dataset was used to design<br />

and test LDA classifier and the best classification rate<br />

achieved with optimal parameter settings over the testing<br />

dataset was 68.850o. These results indicate the strong<br />

generalization ability of LS-SVM classifier.<br />

C<br />

5. Conclusion<br />

We have presented SVM based approach that utilizes the<br />

principle of structural risk minimization to model the<br />

uncertainty that underlies the chemical phosphorus removal<br />

process in wastewater treatment plants. A real dataset of 123<br />

examples has been obtained from Ashbridges Bay Wastewater<br />

Treatment Plant, Toronto. A Classifier based on LS-SVM has<br />

been designed through supervised learning to classify whether<br />

or not the concentration of total phosphorus as P in the<br />

effluent will exceed the maximum allowable limit.<br />

Performance of different kernel functions has been evaluated<br />

and all the three kernel functions performed well, especially<br />

the RBF kernel achieved a very promising classification rate<br />

of 88.5200 over the unseen testing dataset. For comparison the<br />

LDA classifier was also used in the study. The classification<br />

results showed that LS-SVM based approach outperformed the<br />

LDA method.<br />

Acknowledgements<br />

We are thankful to Mark Rupke, Chris Monteith, Colin<br />

Marshall and Filemon Basa at Ashbridges Bay Treatment<br />

Plant, Toronto for providing us with valuable information.<br />

..<br />

1 T


References<br />

[1] Metcaf and Eddy, Wastewater Engineering-treatment and<br />

reuse. NewYork: McGraw-Hill, 1991.<br />

[2] M. J. Hammer and M. J. Hammer Jr., Water and<br />

wastewater technology. New Jersey, Columbus: Prentice<br />

Hall, 2003.<br />

[3] N. W. Schmidtke and Assoc. Ltd. And D. I. Jenkins and<br />

Assoc. Inc., Retrofitting municipal wastewater treatment<br />

plantsfor enhanced biological phosphorus removal.<br />

Canada: Minister of supply and services Canada, 1986.<br />

[4] N. Cristianini and J. SH. Taylor, An introduction to<br />

Support Vector Machines and other kernel-based<br />

methods. United Kingdom: Cambridge <strong>University</strong> Press,<br />

2000.<br />

[5] C.J. Burges, "A tutorial on support vector machine for<br />

pattern recognition," Knowledge Discovery and Data<br />

Mining, vol. 2, pp. 12 1-167, June, 1998.<br />

[6] I.E. Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and<br />

R. M. Nohikawa, "A support vector machine approach for<br />

detection of microcalifications," IEEE trans. Med. Imag.,<br />

vol.21, NO. 12, December, 2002.<br />

[7] P.H. Chen, C. J. Lin, and B. Scholkopf, "A tutorial on<br />

v- support vector machines," unpublished.<br />

[8] J. Salmon, S. King, and M.Osborne,"Framewise phone<br />

classification using support vector machines,"<br />

unpublished.<br />

[9] S. Z. Li and G. Guo, "Content-based audio classification<br />

and retrieval using SVM learning," unpublished.<br />

[10] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B.<br />

Scholkopf, "An introduction to kernel-based learning<br />

algorithm," IEEE Trans. Neural Networks, vol. 12, pp.<br />

181-201, Mar. 2001.<br />

[11] B. Scholkopf and A. J. Smola, Learning with kernels -<br />

support vector machines, regularization, optimization,<br />

and beyond. Cambridge, MA: MIT press, 2002.<br />

322<br />

63<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />

[12] V. Kecman, Learning and soft computing- support<br />

vector machines, neural networks, andfuzzy logic.<br />

Cambridge, MA: MIT press, 2001.<br />

[13] U. Jeppsson, Modeling aspects ofwastewater treatment<br />

process. Sweden, Lund: Reprocentralen, Lund university,<br />

1996.<br />

[14] J. C. Principe, N. R. Euliano, and W. C. Lefebvre,<br />

Neural and adaptive systems -fundamentals through<br />

simulation. United States of America: John Wiley & sons<br />

Inc., 1999.<br />

[15] S. Haykin, Neural Networks - a comprehensive<br />

foundation. Hamilton, ON., Canada: Prentice Hall, 1999.<br />

[16] K. Pelckmans et al, "LS-SVMlab toolbox user's guide,"<br />

unpublished.<br />

[17] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D.<br />

Moor, and J. Vandewalle, Least square support vector<br />

machines. Singapore: World scientific publishing Co. Pte.<br />

Ltd., 2002.<br />

[18] D. S. Lee, J. M. Park, "Neural Networks Modeling for<br />

On-line Estimation of Nutrient Dynamics in a<br />

Sequentially- operated Bach Reactor," Journal of<br />

Biotechnology, vol. 75, pp. 229-239, June, 1999.<br />

[19] 0. C. Pires, C. Palma, J. C. Costa, I. Moita, M. M. Alves,<br />

and E. C. Ferreira, "Knowledge-based fuzzy system for<br />

diagnosis and control of an integrated biological<br />

wastewater treatment process," the 2nd IWA conference on<br />

instrumentation, control, and automation, June, 2005.<br />

[20] S. T. Yordaova, " Fuzzy two-level control for an aerobic<br />

wastewater treatment," proceedings. 2nd international<br />

IEEE conference, vol. 1, pp. 348-352, June, 2004.<br />

[21] S. Marsili and L Giunti, "Fuzzy predict control for<br />

nitrogen removal in biological wastewater treatment,"<br />

IWA conference on water science technology, vol. 45, pp.<br />

37-44, June, 2002.


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />

Data Embedding in µ-law Speech with Spread<br />

Spectrum Techniques<br />

Libo Zhang, Heping Ding<br />

Institute for Microstructural Sciences,<br />

National <strong>Research</strong> Council,<br />

Ottawa, Ontario, Canada<br />

heping.ding@nrc-cnrc.gc.ca<br />

Abstract⎯This paper explores data embedding in G.711 µ-law<br />

speech signals with the spread spectrum techniques. Based on an<br />

optimized spread spectrum scheme, a simple but effective solution<br />

is presented for high-capacity embedding. Simulations show that<br />

the proposed scheme, when incorporated with the measure of the<br />

frequency masking effects, can achieve an embedding rate of<br />

about 100 bits per second with a 7% Bit Error Rate (BER), or<br />

1000 bps with a 10% BER.<br />

Keywords- µ-law speech, data embedding, speech coding, spread<br />

spectrum communication<br />

I. BACKGROUND<br />

The techniques to embed additional digital information into<br />

host signals imperceptibly can have many applications. For<br />

example, in digital watermarking, the digital copyright<br />

information is embedded into audio signals imperceptibly to<br />

protect the intellectual property. In another example shown in<br />

[1], the wide-band components are embedded into narrow-band<br />

speech signals to enhance the quality and intelligibility.<br />

The µ-law companded signal format, which is defined in<br />

ITU-T G.711, is the telephony standard in North America. For<br />

high capacity embedding in such signals, it is required to<br />

reliably transmit the embedded information, along with the host<br />

speech signal, across both the analog and digital telephony<br />

channels. Thus, the embedding should be robust against both<br />

band-pass filtering and Additive White Gaussian Noise<br />

(AWGN), which occur in normal telephony channels.<br />

In general, three conflicting criteria are used to evaluate such<br />

embedding systems. Imperceptibility means that the composite<br />

signal should be perceptually equivalent to the host signal;<br />

robustness refers to a reliable extraction even if the composite<br />

signal is degraded; and embedding rate is a measure of how<br />

much information can be embedded and transmitted. For our<br />

research in µ-law embedding, the embedding rate is more<br />

emphasized than with other research areas.<br />

Little was published on this research topic. Currently two<br />

categories of techniques could be used for this kind of data<br />

embedding, namely, the ones based on spread spectrum<br />

techniques [2] and those based on quantization-bin techniques<br />

[3]. Usually the conventional spread spectrum techniques could<br />

not achieve high embedding rates because a long spreading<br />

sequence is required just to reduce the host impact. [4] proposed<br />

a modified spread spectrum embedding algorithm that can<br />

Sridhar Krishnan<br />

Electrical and Computer Engineering Department,<br />

<strong>Ryerson</strong> <strong>University</strong>,<br />

Toronto, Ontario, Canada<br />

krishnan@ee.ryerson.ca<br />

inherently suppress the host impact. The scheme shows a very<br />

high robustness when applied to digital audio watermarking.<br />

In this paper, we optimize this modified scheme for the<br />

purpose of high capacity embedding in µ-law speech signals.<br />

The rest of the paper is organized as follows. Section II presents<br />

a generalized view of spread spectrum embedding schemes,<br />

with the modified scheme and its optimization being special<br />

cases. Section III incorporates the frequency masking effect to<br />

implement the proposed scheme. Section IV presents the<br />

simulation results and Section V gives a summary.<br />

II. SPREAD SPECTRUM SCHEMES<br />

Supposing that one bipolar information bit b ∈ ± 1 is to be<br />

embedded into x, an N–sample time or transform domain<br />

sequence of the host signal, the generalized spread spectrum<br />

embedding can be expressed as<br />

y = x - β ( x • w)<br />

w + αb w,<br />

0 ≤ β ≤ 1.<br />

(1)<br />

where y represents the composite signal; the pseudo-random<br />

spreading sequence w is of length N and zero-mean; the scalar<br />

α > 0 controls the embedding strength; the scalar β = 0 and<br />

β = 1 result in the conventional and the modified schemes,<br />

respectively; and the dot operator represents the normalized<br />

correlation of two length-N sequences and is defined as<br />

1<br />

u • v ≡ u iv . (2)<br />

i<br />

N<br />

∑ N i=<br />

1<br />

Degraded by the additive noise n during transmission, the<br />

received signal can be expressed as<br />

r = y + n = x - β ( x • w)<br />

w + αbw<br />

+ n . (3)<br />

The normalized correlation between the received and the<br />

spreading sequences can be found as<br />

c = r • w = αb + ( 1−<br />

β )( x • w)<br />

+ x • n . (4)<br />

Assuming that both the host signal and the noise are<br />

2<br />

2<br />

Gaussian, with x ~ N ( 0,<br />

σ x ) and n ~ N ( 0,<br />

σ n ) . According to<br />

(3), the correlation is also Gaussian and with<br />

2 2 2<br />

( 1 − β ) σ x + σ n<br />

c ~ N ( αb,<br />

) . (5)<br />

N<br />

IEEE Globecom 2005 2160 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />

64<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />

Thus, the embedded information bit b can be estimated by<br />

b ˆ = sign(<br />

c)<br />

, and the performance, in term of Bit Error Rate<br />

(BER), is<br />

p<br />

where Q(<br />

x)<br />

=<br />

⎛<br />

2<br />

µ ⎞<br />

c ⎜ Nα<br />

Q(<br />

) = Q<br />

⎟ , (6)<br />

⎜<br />

2 2<br />

σ<br />

⎟<br />

c ⎝<br />

( 1 − β ) σ x + σ n ⎠<br />

= 2<br />

1<br />

2π<br />

∫ ∞<br />

2<br />

u<br />

+ −<br />

2 e<br />

x<br />

du is the tail error function.<br />

As shown in (6), in the conventional scheme ( β = 0 ), both<br />

the host and the external noise degrade the extraction, which<br />

results in the poor performance of this scheme. In the modified<br />

scheme ( β = 1),<br />

the host impact is totally suppressed. However,<br />

the total embedding power, determining the perceptual<br />

distortion, grows from 2<br />

2<br />

2 σ<br />

α to<br />

x α + , which can be<br />

N<br />

deduced directly from (1). In high capacity embedding, where a<br />

2<br />

σ<br />

small N is preferred, even the minimal embedding power x ,<br />

N<br />

obtained by setting α = 0 , may not be small enough to<br />

guarantee imperceptibility.<br />

In this paper, we propose to use a less-than-unity β to<br />

2 2<br />

2 β σ<br />

reduce the total embedding power to<br />

x<br />

P = α + . The<br />

N<br />

optimal β should minimize p, the extraction BER, while<br />

satisfying the power constraints to assure imperceptibility.<br />

We start with<br />

⎧0<br />

≤ β ≤ 1;<br />

⎪<br />

⎨ ⎛<br />

2 ⎞ ⎛<br />

⎜ N ⋅α<br />

⎟ ⎜<br />

⎪ p = Q<br />

= Q<br />

⎜<br />

2 2 2 ⎟ ⎜<br />

⎩ ⎝<br />

( 1 − β ) σ x + σ n ⎠ ⎝<br />

2 2 ⎞<br />

, (7)<br />

N ⋅ P − β σ x ⎟<br />

2 2 2<br />

( 1 − β ) σ + ⎟<br />

x σ n ⎠<br />

and, with the “embedded data to signal” and “signal to noise”<br />

2<br />

ratios defined as P<br />

DSR = and σ x SNR = , respectively, the<br />

2<br />

σ<br />

σ<br />

2<br />

x<br />

BER in (7) can be expressed by<br />

⎛<br />

⎜<br />

p = Q<br />

⎜<br />

⎝<br />

n<br />

2 ⎞<br />

N ⋅ DSR − β ⎟ . (8)<br />

2<br />

⎟<br />

( 1 − β ) + 1<br />

SNR ⎠<br />

Next, we want to find *<br />

β , the β that minimizes (8) - or<br />

maximizes what is in the square root sign in (8). Since the noise<br />

n is not known at the time of embedding and normally<br />

2 2<br />

2<br />

σ 1 , one can choose β = 1 . When<br />

N ⋅ DSR < 1,<br />

*<br />

β can be found by letting<br />

2<br />

∂ ⎡ N ⋅ DSR − β ⎤ 2 ( N ⋅ DSR − β )<br />

=<br />

= 0 (10)<br />

⎢<br />

2 ⎥<br />

3<br />

∂ β ⎣ ( 1 − β ) ⎦ ( 1 − β )<br />

*<br />

therefore, β = N ⋅ DSR . To summarize, we have<br />

*<br />

β = min( N ⋅ DSR,<br />

1)<br />

(11)<br />

*<br />

The corresponding α is then<br />

* 2 2<br />

* ( β ) σ x<br />

α = P − . (12)<br />

N<br />

When N ⋅ DSR < 1,<br />

the best achievable BER with no noise<br />

considered is, according to (9),<br />

⎛ * ⎞<br />

* ⎜ β<br />

p = Q ⎟ . (13)<br />

⎜<br />

* ⎟<br />

⎝<br />

1 − β ⎠<br />

Equation (13) can be used to estimate the maximal<br />

embedding rate of the proposed scheme. For example, given a<br />

*<br />

required BER p ≤ 3%<br />

, β = N ⋅ DSR must be at least 0.8 as<br />

per Fig. 1 (approximate to this “no noise” case and to be<br />

discussed later). Thus, the maximum rate is limited by<br />

f s f s ≤ DSR (bps), with f being the sampling frequency.<br />

s<br />

N 0.<br />

8<br />

For example, the rate limit is 100 bps when DSR=-20 dB. It will<br />

be decreased by the inherent noise from µ-law companding and<br />

external noise in the telephony channel. In the sequel, the term<br />

N ⋅ DSR is called the composite embedding power for<br />

simplicity.<br />

Figure 1. BER of spread spectrum embedding (SNR=30 dB)<br />

To show the improvement due to the optimization of β , (8)<br />

with SNR=30 dB and different composite embedding powers<br />

are shown in Fig.1. It can be seen that when the power is of the<br />

intermediate values, the performance can be improved<br />

significantly, e.g. from p = 18%<br />

of the conventional spread<br />

spectrum scheme to p = 3%<br />

when N ⋅ DSR = 0.<br />

8 . In the case of<br />

watermarking, where a large N can be used, the composite<br />

embedding power is normally large enough such that the<br />

optimization is not necessary. However in high capacity<br />

embedding, the composite power is often smaller because N<br />

IEEE Globecom 2005 2161 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />

65<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />

could not be very large; therefore, such optimization is<br />

necessary to achieve high capacity.<br />

By observing the optimizations in Fig. 1, we can see that,<br />

*<br />

although with noise ignored, β given by (11) is still a simple<br />

and reasonable approximation for the case of SNR=30 dB.<br />

III. MDCT DOMAIN IMPLEMENTATION<br />

Discussed in [5], frequency masking of human auditory<br />

system refers to the masking phenomenon between two<br />

simultaneously occurring components that are close enough in<br />

frequency; the stronger component may make the weaker one<br />

imperceptible. A masking model uses this effect to derive a<br />

masking threshold from the signal power spectrum. The<br />

amplitude changes made by embedding are perceptually<br />

irrelevant as along as they are under the threshold at each<br />

frequency. Thus, one can use the frequency masking effect to<br />

imperceptibly maximize the embedding power.<br />

The frequency masking effect is normally described in<br />

Fourier frequency domain. The Modified Discrete Cosine<br />

Transform (MDCT) with 50% overlapping between successive<br />

frames can perfectly reconstruct the original signal. It was<br />

shown in [6] that MDCT coefficients can be approximated by<br />

the corresponding Fourier ones with a modulating term. This<br />

similarity indicates that a masking model based on MDCT can<br />

be borrowed from that with the Fourier transform without<br />

causing too much error.<br />

In this research, the MDCT domain is chosen for embedding<br />

and the frequency masking effect is used. Being a scaled-down<br />

version of Model 1 of Layer 3 in MPEG-1 [5], our model<br />

consists of merely 18 non-uniform critical bands – to<br />

accommodate the 0~4 kHz range only.<br />

The block diagram of embedding/extraction is shown in Fig.<br />

4. Each 128-sample frame of the µ-law signal is first expanded<br />

to 16-bit linear PCM and then transformed into MDCT<br />

coefficients.<br />

The global masking threshold is computed from the MDCT<br />

power spectrum using the masking model discussed above. One<br />

further modification to that model is to relax the threshold in [5]<br />

by flattening the slopes of each component’s spreading function<br />

on both sides; therefore, the global threshold is raised and the<br />

embedding capacity is increased. As a result, we come up with<br />

the following two settings with different aggressiveness,<br />

• Perceptible but not annoying embedding artifacts with<br />

SDR≈17.0 dB, and;<br />

• Imperceptible embedding artifacts with SDR≈22.5 dB.<br />

Each to-be-embedded bipolar bit is spread by a<br />

pseudo-random spreading sequence of length N, which is<br />

determined by the required embedding rate, e.g., with a higher<br />

rate required, we need to embed more bits into a frame; therefore,<br />

a smaller N is adopted so that more N-sample spread sequences<br />

can be fitted into the frame. The resultant spread sequences are<br />

embedded into MDCT coefficients between 0.3~3.3 kHz<br />

according to (1). For each of the 18 critical bands, *<br />

β and<br />

*<br />

α are computed by using (11) and (12), respectively, where P<br />

is the masking threshold in that critical band. The inverse<br />

MDCT and the µ-law compression result in a µ-law signal that is<br />

embedded with the additional information then impaired by the<br />

µ-law compression. The extraction is simply based on the<br />

polarity of (4), as discussed earlier.<br />

IV. SIMULATIONS<br />

The measured relationship between the BER and the<br />

embedding rate can characterize the embedding capability and<br />

the robustness of the scheme. All information sequences are at<br />

least 200-bits long and the results are averaged over 10 runs, so<br />

the BERs are actually computed from at least 2000 bits to assure<br />

a high accuracy. The telephony channel is simulated by AWGN<br />

with SNR=35 dB and 0.3~3.3 kHz band-pass filtering.<br />

Simulation results are shown in Fig. 2 and Fig. 3, for<br />

SDR≈17.0 dB and SDR≈22.5 dB, respectively. It can be seen<br />

that the optimization of β does increase the performances of<br />

both the conventional and modified schemes. With slightly<br />

perceptible embedding artifacts, i.e., the case in Fig. 3, the<br />

proposal, with an optimal β , can achieve 100 bps with a BER<br />

less than 7%.<br />

Figure 2. Rate-Distortion at SDR=17.0 dB<br />

Figure 3. Rate-Distortion at SDR=22.5 dB<br />

IEEE Globecom 2005 2162 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />

66<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.


This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />

V. CONCLUSIONS<br />

In this research, we explored the possibility of using spread<br />

spectrum techniques for high capacity data embedding in µ-law<br />

speech signals. Our proposal can achieve about 7% and 10%<br />

BERs at 100 and 1000 bps, respectively.<br />

We like to make two observations here. First, the<br />

rate-distortion curves are quite flat especially for the low<br />

embedding power case as shown in Fig. 3, e.g. a BER decrease<br />

of only less than 5% with a large rate decrease - from 1000 bps<br />

to 100 bps, shown in both Fig. 2 and Fig. 3. In other words,<br />

increasing the spreading length N does not improve the BER<br />

significantly. Second, it is understood that the large quantization<br />

noise caused by the µ-law compression plays a major role in<br />

limiting the performance. Thus, it can be a future research topic<br />

to quantitatively study this signal-dependent noise and to find<br />

ways to compensate for its adverse impact in data embedding.<br />

ACKNOWLEDGMENT<br />

L. Zhang thanks for the generous support from the Institute<br />

for Microstructural Sciences, National <strong>Research</strong> Council of<br />

µ-law<br />

speech<br />

µ to linear<br />

expansion<br />

Spreading sequence<br />

Estimated<br />

information<br />

Extraction<br />

Linear<br />

speech<br />

MDCT<br />

Decomposition<br />

Masking<br />

<strong>Analysis</strong><br />

MDCT<br />

Decomposition<br />

Figure 4. Block diagram of speech embedding<br />

Canada while doing this research as a visiting researcher at the<br />

Acoustics & <strong>Signal</strong> Processing <strong>Group</strong>. He would also thank the<br />

Electrical and Computer Engineering Department of <strong>Ryerson</strong><br />

<strong>University</strong>, for the continuous support in his master program.<br />

REFERENCES<br />

[1] H. Ding, “Backward compatible wideband voice over narrowband<br />

low-resolution media,” Acoustics <strong>Research</strong> Letters Online<br />

(http://scitation.aip.org/ARLO), vol. 6, issue 1, pp. 41 – 47, January 2005.<br />

[2] D. Kirovski and H. S. Malvar, “Spread spectrum watermarking of audio<br />

signals,” IEEE Transactions on <strong>Signal</strong> Processing, vol. 51, no. 4, pp.<br />

1020-1033, April 2003.<br />

[3] J. Eggers, R. Buml, R. Tzschoppe and B. Girod, “Scalar Costa scheme for<br />

information embedding,” IEEE Transactions on <strong>Signal</strong> Processing, vol.<br />

51, no. 4, pp. 1003-1019, April 2003.<br />

[4] L. Zhang, “Perceptual data embedding in audio and speech signals,”<br />

Master Thesis, <strong>Ryerson</strong> <strong>University</strong>, Toronto, September 2004.<br />

[5] T. Painter and A. Spanias, “Perceptual coding of digital audio,” IEEE<br />

Proceedings, vol. 88, no. 4, pp. 451-515, April 2000.<br />

[6] H. V. Azghandi and P. Kabal, “Improving perceptual coding of<br />

narrowband audio signals at low rates,” IEEE International Conference on<br />

Acoustics, Speech, and <strong>Signal</strong> Processing, vol. 2, pp. 913-916, March<br />

1999.<br />

Masking<br />

Threshold<br />

Extra information<br />

Embedding<br />

Channel noise<br />

µ to linear<br />

expansion<br />

MDCT<br />

Reconstruction<br />

Linear to µ<br />

Compression<br />

IEEE Globecom 2005 2163 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />

67<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.


Proceedings of the 2005 IEEE<br />

Engineering in Medicine and Biology 27th Annual Conference<br />

Shanghai, China, September 1-4, 2005<br />

COMPARISON OF JPEG 2000 AND OTHER LOSSLESS COMPRESSION SCHEMES FOR<br />

DIGITAL MAMMOGRAMS<br />

April Khademi and Sridhar Krishnan<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON M5B 2K3 Canada<br />

E-mail: akhademi@ieee.org, krishnan@ee.ryerson.ca<br />

Abstract<br />

In this study, we propose JPEG 2000 as an algorithm for<br />

the compression of digital mammograms and the proposed<br />

work is the first real-time implementation of JPEG 2000 on<br />

a mammogram image database. Only the lossless compression<br />

mode of JPEG 2000 was examined to ensure that the<br />

mammogram is delivered without distortion. The performance<br />

of JPEG 2000 was compared against several other<br />

lossless coders: JPEG-LS, lossless-JPEG, adaptive Huffman,<br />

arithmetic with a zero order and a first order probability<br />

model and Lempel-Ziv Welch (LZW) with 12 and 15<br />

bit dictionaries. Each compressor was supplied the identical<br />

set of 50 mammograms, each having a resolution of<br />

8bits/pixel and dimensions of 1024×1024. Experimental<br />

results indicate JPEG 2000 and JPEG-LS provide comparable<br />

compression performance since their compression ratios<br />

differed by 0.72% and both compressors also superseded the<br />

results of the other coders. Although JPEG 2000 suffered<br />

from a slightly longer encoding and decoding delay than<br />

JPEG-LS (0.8s on average), it is still preferred for mammogram<br />

images due to the wide variety of features that aid<br />

in reliable image transmission, provide an efficient mechanism<br />

for remote access to digital libraries and contribute to<br />

fast database access.<br />

Keywords: JPEG 2000, mammogram image compression,<br />

lossless compression, medical images<br />

1. INTRODUCTION<br />

A particular technology which has proved to be a vital diagnostic<br />

tool for doctors and other healthcare workers is<br />

mammography, which provides x-ray images of the breast.<br />

Mammogram images allow the trained interpreter to detect<br />

any abnormal growths or changes within the breast tissue,<br />

which could be an indication of breast cancer [1]. Since<br />

early detection of breast cancer is the leading way to reduce<br />

mortality rates, it is imperative that the diagnosing professional<br />

has efficient means of accessing and viewing a patient’s<br />

mammogram [2].<br />

0-7803-8740-6/05/$20.00 ©2005 IEEE.<br />

3771<br />

68<br />

By digitizing mammograms and applying a series of signal<br />

processing techniques to them, it is possible to utilize<br />

technological devices and methods to make the necessary<br />

diagnostic tools more readily available to healthcare workers,<br />

potentially speeding up the diagnosis.<br />

Since digital mammograms are used for diagnosis, high<br />

resolution images are required to ensure that even the smallest<br />

irregularities are represented. As a consequence, mammogram<br />

sizes are large and are requiring significant amounts<br />

of bandwidth for transmission and a lot of memory for storage.<br />

To accommodate for this large file size, it is imperative<br />

to identify and make use of an optimal source encoding<br />

scheme dedicated to medical images.<br />

Primarily, this paper investigates JPEG 2000, the latest<br />

data compression technology, and applies it to mammogram<br />

images to provide lossless compression in a novel way.<br />

2. JPEG 2000<br />

This paper investigates the compression performance of<br />

JPEG 2000 on mammographic images and its rich feature<br />

set for a medical imaging application.<br />

Only JPEG 2000’s lossless compression mode was used<br />

since the application of interest is pertinent to mammograms<br />

that are to be used for diagnosis. For lossless compression<br />

of grayscale mammograms, JPEG 2000’s encoder and decoder<br />

are shown in Fig.1.<br />

A) Tiling: Tiling is performed to significantly reduce the<br />

computational overhead and memory requirements of some<br />

of the more demanding components within the JPEG 2000<br />

codec, since future processing is performed on the smaller<br />

tile components. This allows maximum interchange between<br />

devices with limited memory resources, like a Personal<br />

Digital Assistant (PDA), giving healthcare workers<br />

more versatility to manage, transmit and receive mammograms<br />

with little effort. Furthermore, each tile component<br />

can be extracted and reconstructed independently, permitting<br />

random access to portions of the bitstream. This is<br />

useful to doctors if a specific region within a mammogram<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 21, 2009 at 13:20 from IEEE Xplore. Restrictions apply.


GAUSSIAN MIXTURE MODELING USING SHORT TIME FOURIER TRANSFORM<br />

FEATURES FOR AUDIO FINGERPRINTING<br />

ABSTRACT<br />

In audio fingerprinting, an audio clip must be recognized by<br />

matching an extracted fingerprint to a database of previously<br />

computed fingerprints. The fingerprints should reduce the<br />

dimensionality of the input significantly, provide discrimination<br />

among different audio clips, and at the same time,<br />

invariant to the distorted versions of the same audio clip. In<br />

this paper, we design fingerprints addressing the above issues<br />

by modeling an audio clip by Gaussian mixture models<br />

(GMM) using a wide range of easy-to-compute short time<br />

Fourier transform features such as Shannon entropy, Renyi<br />

entropy, spectral centroid, spectral bandwidth, spectral flatness<br />

measure, spectral crest factor, and Mel-frequency cepstral<br />

coefficients. We test the robustness of the fingerprints<br />

under a large number of distortions. To make the system robust,<br />

we use some of the distorted versions of the audio for<br />

training. However, we show that the audio fingerprints modeled<br />

using GMM are not only robust to the distortions used<br />

in training but also to distortions not used in training. Using<br />

spectral centroid as feature, we obtain the highest identification<br />

rate of 99.1 % with a false positive rate of 10 −4 .<br />

1. INTRODUCTION<br />

An audio fingerprint is a compact representation of perceptually<br />

relevant portion of the audio content. An audio fingerprint<br />

should be able to identify audio files even if they<br />

are severely distorted by perceptual coding or common signal<br />

processing operations. The type of distortions a fingerprint<br />

should withstand depends on the application. For example,<br />

audio fingerprints designed for broadcast monitoring<br />

should withstand distortions such as time compression,<br />

dynamic range compression, and equalization. An Audio<br />

fingerprinting system has two principle components: fingerprint<br />

extraction and matching algorithm. The fingerprint<br />

requirements include computational simplicity, robustness<br />

to distortions, smaller size, and discrimination power over a<br />

Arunan Ramalingam and Sridhar Krishnan<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON, Canada, M5B 2K3<br />

E-mail: (aramalin)(krishnan)@ee.ryerson.ca<br />

We would like to acknowledge Micronet for their financial support.<br />

0-7803-9332-5/05/$20.00 ©2005 IEEE<br />

large number of other fingerprints [1]. The matching algorithms<br />

should be efficient to able to identify an audio item<br />

from a database of hundreds of thousands of audio songs in<br />

a few seconds. A large number of fingerprinting schemes<br />

have been proposed. For some recent work, please see [2] –<br />

[5].<br />

The overview of the proposed fingerprinting scheme is<br />

shown in Fig. 1. First the incoming audio clip is preprocessed<br />

and features are extracted from them. Then using<br />

these features, the audio clip is modeled using Gaussian<br />

mixtures. In the training phase, the mixture models of all the<br />

audio clips are stored in the database along with the metadata<br />

information. In the identification phase, the features<br />

from an unknown audio clip are used to evaluate the likelihood<br />

of all the models in the database. Then the model<br />

that is most likely to generate the features is identified as<br />

the correct audio clip.<br />

Audio<br />

Input<br />

Training<br />

Preprocessing Framing<br />

Identification<br />

Feature<br />

extraction<br />

GMM<br />

modeling<br />

Likelihood<br />

estimation<br />

Identification<br />

result<br />

Fig. 1. Proposed Fingerprinting System<br />

2. FEATURE EXTRACTION<br />

Fingerprint<br />

database<br />

In this work, we use the following features extracted from<br />

the short time Fourier transform (STFT) of the signal for<br />

fingerprint extraction. Let Fi = fi(u),u ∈ (0,M) be<br />

the Fourier transform of the i th frame, where M is the index<br />

of the highest frequency band. To increase the robustness<br />

of the fingerprint, the features are not extracted on<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />

72


the whole spectrum but on non-overlapping logarithmically<br />

spaced bands. Let Fi,b = fi(ub),ub ∈ (lb,ub) where lb and<br />

ub are the lower and upper edges of the band b. In each of<br />

the frame, the following features are extracted. These features<br />

have been used successfully in audio fingerprinting [6]<br />

and music classification [7].<br />

1. Spectral Centroid (SC): The spectral centroid is the<br />

center of gravity of the magnitude spectrum of the<br />

STFT and is a measure of spectral shape and “brightness”<br />

of the spectrum. Spectral centroid is defined as<br />

ub 2<br />

u. |fi(u)| u=lb<br />

SCi,b = ub 2 . (1)<br />

|fi(u)| u=lb<br />

2. Spectral Bandwidth (SB): The spectral bandwidth is<br />

measured as the weighted average of the distances between<br />

the spectral components and the spectral centroid.<br />

Spectral bandwidth is defined as<br />

SBi,b =<br />

ub u=lb (u − SCi,b) 2 . |fi(u)| 2<br />

ub u=lb |fi(u)| 2 . (2)<br />

3. Spectral Band Energy (SBE): The spectral band energy<br />

is the energy in the frequency bands normalized<br />

by the energy in the whole spectrum. Spectral band<br />

energy is calculated as<br />

ub SBEi,b =<br />

2<br />

|fi(u)| u=lb<br />

M u=0<br />

|fi(u)| 2 . (3)<br />

4. Spectral Flatness Measure (SFM): The spectral flatness<br />

measure quantifies the flatness of the spectrum<br />

and distinguishes between noise-like and tone-like signal.<br />

Spectral flatness measure is defined as<br />

SFMi,b =<br />

ub<br />

u=lb<br />

1<br />

ub−lb+1<br />

|fi(u)| 2<br />

1<br />

u b −l b +1<br />

ub 2 . (4)<br />

|fi(u)| u=lb<br />

5. Spectral Crest Factor (SCF): The spectral crest factor<br />

is also a measure of the tonality of the signal. Spectral<br />

crest factor is defined as<br />

<br />

max |fi(u)|<br />

SCFi,b =<br />

2<br />

1 ub 2 . (5)<br />

ub−lb+1 |fi(u)| u=lb<br />

6. Shannon Entropy (SE): The Shannon entropy of a signal<br />

is a measure of its spectral distribution of the signal.<br />

Shannon entropy is defined as<br />

SEi,b =<br />

ub <br />

u=lb<br />

|fi(u)| log 2 |fi(u)| . (6)<br />

7. Renyi Entropy (RE): The Renyi entropy of a signal is<br />

also a measure of its spectral distribution. Renyi entropy<br />

is defined as<br />

REi,b = 1<br />

1 − r log<br />

<br />

ub <br />

|fi(u)| r<br />

<br />

. (7)<br />

u=lb<br />

We used Renyi Entropy of order r =2.<br />

8. Mel-frequency Cepstral Coefficients (MFCC): MFCC<br />

are perceptually motivated features based on the STFT.<br />

After taking the log-amplitude of the magnitude spectrum,<br />

the FFT bins are grouped and smoothed according<br />

to the perceptually motivated Mel-frequency scaling.<br />

Finally, in order to decorrelate the resulting feature<br />

vectors a discrete cosine transform is performed.<br />

In this work, we used 13 coefficients since this parameterization<br />

has been shown to be quite effective for<br />

speech recognition and speaker identification [8].<br />

Let Xi be the set of features extracted for the frame i. Xi<br />

can be any one of the features described above. In order to<br />

better characterize the temporal variations of the signal, the<br />

first derivatives of the above features<br />

δi = δi − δi−1<br />

are also used included in the feature matrix. In an audio clip,<br />

successive frames are related in time. To include this time<br />

dependency, a time vector is added to the feature matrix.<br />

This time vector is taken as an incremental counter from 0<br />

to 1. Thus the feature matrix of the entire audio clip can be<br />

described as<br />

F ′ M =<br />

⎡<br />

⎤<br />

X1,δ1,t1<br />

⎢ X2,δ2,t2 ⎥<br />

⎢<br />

⎥<br />

⎢<br />

⎣ .<br />

⎥<br />

(9)<br />

.<br />

⎦<br />

XN,δN ,tN<br />

where N is the number of frames in the audio clip. Finally<br />

the feature matrix is mean subtracted and component wise<br />

variance normalized to get a normalized feature matrix FM.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />

73<br />

(8)


3. GAUSSIAN MIXTURE MODELS<br />

Gaussian Mixture Models (GMM) have been successfully<br />

used in audio classification [7] and content based retrieval<br />

[9]. In this work, the technique is used to model an audio<br />

fingerprint as a probability density function (PDF), using a<br />

weighted combination of Gaussian component PDFs (mixtures).<br />

During the training phase, the GMM parameters of<br />

an audio fingerprint are estimated to maximize the probability<br />

of the audio frames present in the audio fingerprint.<br />

We use the Baum-Welch (Expectation-Maximization) algorithm<br />

to estimate the GMM parameters with initialization by<br />

k − means clustering. As the feature vectors in this work<br />

have reasonably uncorrelated components, computationally<br />

convenient diagonal covariance matrices can be used. We<br />

used GMM with 16 mixtures. Thus in the fingerprint extraction<br />

phase, each audio clip is modeled by GMM. During the<br />

matching phase the fingerprint from an unknown recording<br />

is compared with the database of pre-computed GMM and<br />

the GMM that gives the highest likelihood for the fingerprint<br />

is identified as correct match.<br />

4. RESULTS<br />

We used a database containing 250 five-second audio clips<br />

chosen from the categories of rock, pop, country, classical,<br />

and jazz. The audio clips are chosen from random portions<br />

of songs from Compact Discs.<br />

4.1. Robustness to Distortions<br />

We used several distorted versions of the audio clips to test<br />

the robustness of the proposed scheme. We used the following<br />

distorted versions in our tests.<br />

I. Compression – 1) MP3 at 32 kbps, 2) AAC at 32<br />

kbps, 3) WMA at 32 kbps, 4) Real encoding at 32<br />

kbps.<br />

II. Amplitude distortion – 1) 3 : 1 Compression above<br />

30 dB, 2) 3 : 1 Expander below 10 dB, 3) 3 : 1 compression<br />

below 10 dB, 4) Limiter at 9 dB, 5) ‘Superloud’<br />

amplitude distortion, 6) Noise gate at 20 dB, 7)<br />

De-esser, 8) Nonlinear amplitude distortion.<br />

III. Frequency distortion – 1) Nonlinear bass distortion,<br />

2) Midrange frequency boost, 3) Notch Filter, 750 -<br />

1800 Hz, 4) Notch Filter 430 - 3400 Hz, 5) Telephone<br />

bandpass, 135 - 3700 Hz, 6) Bass cut, 7) Bass boost.<br />

IV. Change in pitch – 1) Lower pitch 2 - 6 %, 2) Raise<br />

pitch 2 - 6 %.<br />

V. Change in speed – 1) Linear speed increase 2 - 6%,<br />

2) Linear speed decrease 2 - 6%.<br />

VI. Resampling at 8 kHz<br />

VII. Echo addition<br />

To increase the robustness of the fingerprints, in addition<br />

to the original audio, some distorted versions of the<br />

audio are also used in training. We used the following distorted<br />

versions in our training: 1) Undistorted audio, 2) 3<br />

: 1 Compression above 30 dB, 3) Nonlinear amplitude distortion,<br />

4) Nonlinear bass distortion, 5) Midrange frequency<br />

boost, 6) Notch Filter, 750 - 1800 Hz, 7) Notch Filter 430<br />

- 3400 Hz, 8) Raise Pitch 1%, 9) Lower Pitch 1%. The<br />

log-likelihood of the test clips are evaluated for all the models<br />

in the database. Then the model that gives the highest<br />

log-likelihood is taken as the correct match. Table 1 shows<br />

the percentage of clips that are correctly identified for different<br />

features for distortions used in training as well as for<br />

distortions not used in training. The results show that it is<br />

not necessary to train the model for all possible distortions.<br />

By training the model to some representative distortions, we<br />

can obtain robustness to a wide variety of distortions.<br />

Table 1. Mean Recognition rate for distortions<br />

Train Test Mean<br />

MFCC 99.0 98.5 98.7<br />

Spectral centroid 99.4 99.1 99.2<br />

Spectral bandwidth 99.4 98.9 99.1<br />

Spectral band energy 98.8 98.8 98.8<br />

Spectral flatness measure 99.4 98.6 98.9<br />

Spectral crest factor 99.2 98.6 98.8<br />

Shannon Entropy 99.4 98.8 99.0<br />

Renyi Entropy 99.4 98.9 99.0<br />

4.2. False Positive <strong>Analysis</strong><br />

In the previous section it was assumed that the test clip is<br />

present in the database. Hence the model that gives the<br />

highest log-likelihood value is identified as the correct match.<br />

However it is possible that the test clip may not be in the<br />

database. So there should be a criteria to reject the audio<br />

clips that are not in the database. A suitable threshold<br />

for log-likelihood can be used to vary the false positive and<br />

false negative rates. The false positive and the corresponding<br />

identification rate are shown in Figs. 2 and 3. The percentage<br />

of audio clips correctly identified at different false<br />

positive rates are shown in Table 2. Among the different<br />

features used, spectral centroid gives the highest identification<br />

rate of 99.1% with a false positive rate of 10 −4 .MFCC<br />

performs poorly with an identification rate of 13 %. All the<br />

features except the spectral flatness measure give an identification<br />

rate of more than 90 % with a false positive rate of<br />

10 −3 .<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />

74


Identification Rate<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

10 −7<br />

0<br />

10 −6<br />

10 −5<br />

10 −4<br />

10 −3<br />

False Positive Rate<br />

MFCC<br />

Spectral centroid<br />

Spectral bandwidth<br />

Spectral band energy<br />

Fig. 2. Identification rates at different false positive rates for<br />

MFCC, Spectral centroid, Spectral bandwidth, and Spectral<br />

band energy<br />

Table 2. Identification Rate at different false positive rates<br />

10 −4 10 −3 10 −2<br />

MFCC 13.5 98.4 99.3<br />

Spectral centroid 99.1 99.5 99.8<br />

Spectral bandwidth 93.2 98.4 99.3<br />

Spectral band energy 69.2 94.3 99.2<br />

Spectral flatness measure 31.8 56.4 96.6<br />

Spectral crest factor 93.0 98.4 99.3<br />

Shannon Entropy 71.1 93.9 99.4<br />

Renyi Entropy 64.0 99.3 99.7<br />

5. CONCLUSION<br />

Gaussian Mixture Models have been successfully used in<br />

many classification and identification problems in audio. In<br />

this work, we modeled audio recordings for audio fingerprinting<br />

by Gaussian mixtures using features extracted from<br />

the STFT of the signal. Even though we use some distorted<br />

samples of the audio during training, the system is robust to<br />

distortions not used in training. Using spectral centroid as<br />

feature, we obtain the highest identification rate of 99.1 %<br />

with a false positive rate of 10 −4 .<br />

6. REFERENCES<br />

[1] P. Cano, E. Batle, T. Kalker, and J. Haitsma, “A review<br />

of algorithms for audio fingerprinting,” in IEEE<br />

Workshop on Multimedia <strong>Signal</strong> Processing, 2002,December<br />

2002, pp. 169–173.<br />

[2] J. Herre, O. Hellmuth, and M. Cremer, “Scalable robust<br />

audio fingerprinting using MPEG-7 content de-<br />

10 −2<br />

10 −1<br />

10 0<br />

Identification Rate<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

10 −7<br />

0<br />

10 −6<br />

10 −5<br />

10 −4<br />

10 −3<br />

False Positive Rate<br />

Sprectral flatness measure<br />

Spectral crest factor<br />

Entropy<br />

Renyi’s Entropy<br />

Fig. 3. Identification rates at different false positive rates<br />

for Spectral flatness measure, Spectral crest factor, Shannon<br />

Entropy and Renyi Entropy<br />

scription,” in IEEE Workshop on Multimedia <strong>Signal</strong><br />

Processing, 2002, December 2002, pp. 165–168.<br />

[3] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting<br />

system,” in Proc. of the 3rd Int. Symposium on<br />

Music Information Retrieval,, October 2002, pp. 144–<br />

148.<br />

[4] V. Venkatachalam, L. Cazzanti, N. Dhillon, and<br />

M. Wells, “Automatic identification of sound recordings,”<br />

IEEE <strong>Signal</strong> Processing Magazine, vol. 21, no.<br />

2, pp. 92 – 99, March 2004.<br />

[5] C.J.C. Burges, J.C. Platt, and S. Jana, “Distortion<br />

discriminant analysis for audio fingerprinting,” IEEE<br />

Transactions on Speech and Audio Processing, vol. 11,<br />

no. 3, pp. 165–174, May 2003.<br />

[6] E. Allamanche, B. Frba, J. Herre, T. Kastner,<br />

O.Hellmuth, and M. Cremer, “Cotent-based identification<br />

of audio material using MPEG-7 low level description,”<br />

in Proceeding of the International Symposium<br />

on Music Information Retrieval (ISMIR), Indiana,<br />

USA, October 2002.<br />

[7] G. Tzanetakis and P. Cook, “Musical genre classification<br />

of audio signals,” IEEE Tran. on Speech and Audio<br />

Processing, vol. 10, no. 5, pp. 293 – 302, July 2002.<br />

[8] L. R. Rabiner and B. H. Juang, Fundamentals of Speech<br />

Recognition, Prentice-Hall, Englewood Cliffs, NJ,,<br />

1993.<br />

[9] D. Pye, “Content-based methods for the management of<br />

digital music,” in Proceedings of ICASSP, 2000, vol. 4,<br />

pp. 24–27.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />

75<br />

10 −2<br />

10 −1<br />

10 0


MULTIPATH MITIGATION OF GNSS CARRIER PHASE SIGNALS<br />

FOR AN ON-BOARD UNIT FOR MOBILITY PRICING<br />

Ronesh Puri, Ahmed El Kaffas, Alagan Anpalagan, Sridhar Krishnan<br />

Department of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, ON, M5B 2K3<br />

rpuri | aelkaffa | alagan | krishnan @ee.ryerson.ca<br />

Bern Grush<br />

Applied Location Corporation, 34 Dodge Rd, Toronto, ON, M1N 2A7. bgrush@appliedlocation.com<br />

Abstract<br />

Inexpensive navigation-grade receivers are insufficiently<br />

accurate for the task of building a Global Navigation Satellite<br />

System [GNSS]-based parking meter for urban multipath<br />

conditions. Survey-grade instruments that demonstrate cm<br />

accuracy are inappropriate, and are two orders of magnitude<br />

too expensive, for this mass application. We identify three ways<br />

in which a digital signal processor added to a stationary,<br />

navigation-grade receiver can add considerable accuracy (in<br />

the range of 1-2 m, down from 5-10 m) for a parking meter.<br />

First, we apply a pseudo-multipath-based filter and a modified<br />

Receiver Autonomous Integrity Monitoring [RAIM]-derivative<br />

filter to the received carrier phase signals, allowing us to infer<br />

which signals are most affected by noise processes and to<br />

compute receiver position with the remaining signals for<br />

greater accuracy. Second, we take advantage of receiver<br />

stationarity to dwell on these signals for several minutes,<br />

allowing us to acquire a signal characterization metric that is<br />

more stable than might be possible with a non-stationary<br />

receiver. This is intended for non-repudiation. As a third step<br />

we will later experiment with ways to monitor the multipath<br />

behaviour of individual signals on approach to a parking event<br />

in a way that may allow us to more effectively weigh our initial<br />

signal selection criteria. Independent of these three<br />

opportunities, we also take advantage of dual GPS/Galileo<br />

receivers, a capability that we simulate in this experiment.<br />

Testing of the multipath mitigation filters described in this<br />

paper on two simulated GPS/Galileo datasets yielded<br />

reductions in the standard deviation of the position estimate<br />

that ranged from -4% to 61.6% (avg:34.4%) when compared<br />

to the control (unfiltered) position calculation.<br />

Keywords: GPS; Galileo; GNSS; Multipath Mitigation; RAIM;<br />

Urban Canyon; Parking; Parklog; Road-Pricing.<br />

1. INTRODUCTION<br />

A number of countries seek solutions for reliable and cost<br />

effective metering for zone-based road pricing. For economic<br />

and other reasons, GNSS signals are the prime target for this<br />

solution [1-4]. An alternative to the commonly expected<br />

“tracklog” is the use of a “parklog” a log of parking events<br />

with a minimal amount of data describing the intervening trip<br />

0-7803-8886-0/05/$20.00 ©2005 IEEE<br />

CCECE/CCGEI, Saskatoon, May 2005<br />

1578<br />

segments. The parklog is less data intensive, more accurate<br />

(i.e. more non-repudiatiable), and is a good proxy to a full<br />

tracklog in zone-pricing applications. In addition, whenever the<br />

accuracy of the endpoints of the trips the parking events is<br />

sufficient, this same meter could be used as a parking meter for<br />

that parking event, yielding a way to meter for any<br />

combination of road use, parking use and pay-as-you-drive<br />

insurance in a single system. The principle advantage of a<br />

three-in-one meter is the distribution of infrastructure costs<br />

over three sectors (road, parking and insurance) making it<br />

possible for a road-pricing meter to “pay for itself” in parking<br />

and insurance discounts from the motorist’s perspective [5].<br />

To enable a highly effective device, we have set a design<br />

goal of 1.5m-2m accuracy, 99% of the time in 75% of the<br />

parking lots in a city with the building density of Toronto,<br />

Canada. Even when a parking location cannot be known<br />

accurately enough to assess a fee, both road-pricing and<br />

insurance-pricing can still proceed. This gives the “parklog”<br />

the flavor of disruptive technology disrupting both dedicated<br />

short range communication [DSRC] and the tracklog for roadpricing.<br />

2.1 Multipath Mitigation<br />

2. METHODS<br />

Among the noise sources contributing to positioning error,<br />

multipath is the most difficult to characterize in a way that<br />

allows unambiguous mitigation. When other error sources are<br />

controlled, multipath can become the largest remaining<br />

contributor to unmodeled noise/interference. The causes and<br />

properties of this process are well described elsewhere [6-8].<br />

Of the four classes of mitigation techniques: antenna<br />

positioning, hardware compensation (antenna design), software<br />

mitigation and static antenna arrays with signal correlators [6],<br />

software mitigation is the only feasible approach for an onboard<br />

parking application. Antenna siting will seldom be<br />

optimal. Increased hardware size, complexity and expense are<br />

aesthetically, operationally and economically unacceptable,<br />

since the antenna for the target meter must be mounted in or on<br />

many millions of private vehicles.<br />

2.2 Simulating Galileo<br />

Collecting GNSS signals in densely built-up areas (“urban<br />

canyon”) often results in a diminished number of usable<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />

76


signals. On some occasions, when using only a single system<br />

such as GPS, there may be too few to calculate a horizontal<br />

position (a minimum of four signals is best). This accounts for<br />

the frequent loss of position lock requiring ancillary aids such<br />

as inertial navigation. In a parking application, we must rely on<br />

GNSS signals only, so that our process would frequently fail to<br />

fulfill our stated design goal without a redundant system,<br />

which in our case is the European Union’s Galileo, expected to<br />

be operational in 2008.<br />

Dual GPS/Galileo receivers are expected to improve<br />

position availability and accuracy considerably. As recently<br />

reported by Feng [9], dual receivers “will increase service<br />

coverage from 55% to 95% notably in the urban areas where<br />

most mass-market applications are developed.” The following<br />

table details the expected improvement.<br />

<strong>Analysis</strong><br />

Scenario<br />

& Constellation<br />

Availability of<br />

20m 95% 2D<br />

accuracy<br />

28 GPS<br />

only<br />

(%)<br />

28 GPS<br />

+27Gal<br />

(%)<br />

Accuracy and<br />

availability –<br />

satellites only<br />

28 GPS<br />

only<br />

(m/%)<br />

28 GPS<br />

+27Gal<br />

(m/%)<br />

Accuracy<br />

availability<br />

differential<br />

28 GPS<br />

+27Gal<br />

(m/%)<br />

Open Sky 90% 100% 7/95 4/95 1.5/95<br />

Suburban 70% 100% 32/90 8/95 4/95<br />

Low-rise 30% 90% 17/50 14/95 7/95<br />

High-rise 15% 80% - 42/90 25/90<br />

Table 1: Performance improvements resulting from both GPS<br />

and Galileo constellations for urban operations (Table adopted<br />

from [9])<br />

In our work we simulate a dual GPS/Galileo receiver by<br />

combining two sets of GPS data collected with a<br />

uBlox/ANTARIS TIM-LP receiver separated by three or more<br />

hours (i.e. not within three hours of an integer multiple of a<br />

sidereal day) so that the two visible satellite sub-constellations<br />

are essentially independent. A data sample from an actual dual<br />

receiver would exhibit at least as good a geometric distribution<br />

as would this “poor-man’s” simulation, hence we argue that<br />

this simulation technique does not unduly favor our approach.<br />

SW corner 10:25 and 13:58 (3.5 hr separation)<br />

Figure 1: An example of two GPS constellations viewed by a<br />

stationary receiver and separated by 3.5 hrs. See also Figure 3.<br />

2.3 Software Mitigation<br />

The key assumption in software mitigation is that it is<br />

possible to infer, in near realtime, which of the pseudo-range<br />

signals available at a given moment are contributing more error<br />

1579<br />

than others. Extensive work in this area, Receiver Autonomous<br />

Integrity Monitoring (RAIM), focuses on real-time<br />

determination of failure of one of several SVs (space vehicles)<br />

in safety of life applications [10], This work has been extended<br />

to include multiple failures and has led to the development of<br />

related techniques to determine which signals may be more<br />

subject to multipath disturbance in a dynamic, unaided manner.<br />

Our work relies on some of these extensions.<br />

Bisnath and Langley [6] extend earlier methods to compute<br />

an inferred GPS observable they call pseudo-multipath,<br />

incorporating pseudorange multipath, tracking error and<br />

receiver noise, making it a good indicator of unmodeled error<br />

and noise for position estimation, the predominant component<br />

of which is multipath. Related work by Nayak, et al [7]<br />

develops this same measure, which they call code-carrier<br />

residual (r). We use this formulation to weigh each signal in a<br />

data sample to determine whether to include that signal in the<br />

position calculation.<br />

Misra and Bednarz [10] extend RAIM techniques to deal<br />

with multiple SV failures. Their method, referred to, in this<br />

paper, as Misra04, provides for randomly selecting numerous<br />

subsets of 6 or 7 signals from a larger set of available signals,<br />

such as would be available to a dual GPS/Galileo receiver.<br />

Pseudo-random selection is constrained to minimize dilution of<br />

precision (horizontal dilution of precision (HDOP) in our<br />

case), and repeated selection and position calculations are<br />

clustered and outliers are observed to de-weight SVs. We use<br />

this algorithmic approach to exclude noisy signals that were<br />

not filtered out by the code-carrier residual (pseudo-multipath).<br />

These two methods in combination select the least noisy<br />

signals, constrained by a requirement for a constellation subset<br />

yielding good geometry for subsequent position calculation.<br />

Merge two, several-minute GPS readings, sufficiently<br />

separated in time to simulate a dual receiver<br />

<br />

Drop signal(s) based on Code-carrier residual filter [6,7]<br />

<br />

Drop signal(s) based on Misra04 (RAIM-derivative) filter [10]<br />

<br />

Compute LAT, LON using remaining signals<br />

<br />

Compute associated characterization<br />

Figure 2: The filtering and position calculation process is set<br />

up as illustrated here and detailed in the following section.<br />

The reader might question the efficacy of this degree of<br />

filtering given that the receivers are stationary. However,<br />

consider that in a complex multipath environment in which<br />

signals are collected for several minutes, the movement of the<br />

SVs, the movement of tree crowns, and passing vehicles might<br />

each effect the relative degree of multipath of each SV from<br />

moment-to-moment as it impinges on the stationary antenna.<br />

We will be exploring this further in subsequent work.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />

77


3. THE PROPOSED ALGORITHM<br />

Following from the previous section, we detail each stage<br />

in the process: two filter stages, signal selection and final<br />

position calculation.<br />

Input to this process is carrier-phase data, captured every<br />

second.<br />

3.1 Pseudo-Multipath based filter<br />

For the first stage of our dual-filter method, we compute the<br />

pseudo-multipath observable, r, and its standard deviation, r,<br />

for each visible SV.<br />

r = 2dion - N + (p) + (),<br />

where:<br />

dion ionospheric delay (m)<br />

wavelength of L1 carrier (m)<br />

N the integer cycle ambiguity (cycles)<br />

(p) code noise (receiver noise + multipath) (m)<br />

() carrier phase noise (receiver + multipath) (m)<br />

The SVs are ranked in ascending order of the magnitude of<br />

r. Since we suspect signals with higher variance, we simply<br />

discarded the single most suspect signal. in this first<br />

experiment.<br />

The full derivation of r is developed in [7], and is described<br />

therein as containing:<br />

“twice the atmospheric error, the carrier phase ambiguity, code receiver<br />

noise, and code multipath. Carrier receiver noise and multipath can be<br />

neglected since they are very small compared to the code values. The<br />

ambiguity term is a constant if there are no cycle slips whereas the<br />

ionospheric error generally varies slowly over time. A piece-wise linear<br />

regression model can therefore be implemented to remove terms due to the<br />

ionosphere and ambiguity. Since the ionospheric error changes with time, a<br />

regression model [could be] implemented over predefined averaging<br />

intervals. … The resulting code-carrier residual, [r], will contain multipath<br />

and receiver noise and can be used for further analysis. Subtracting out the<br />

mean removes not only the integer ambiguity, but also the bias components<br />

present in all of the remaining terms. Code multipath is a nonzero mean<br />

process and this technique only isolates relative multipath effects and not<br />

the absolute multipath because the regression process removes the portion<br />

of multipath with nonzero mean.<br />

In our application, we are using this as one of two<br />

“advisors” to help us select the signals least disturbed by<br />

multipath. Hence, the fact that this is only a relative indicator<br />

and that it also incorporates minor components of other error<br />

sources does not detract from its value as a way to identify the<br />

noisiest signals.<br />

3.2 Modified Misra04 (RAIM-derivative) filter<br />

The steps we used in our adaptation of the Misra04<br />

algorithm [10], are as follows:<br />

1. Set K as the number of SVs in view less the one<br />

rejected by the pseudo-multipath filter;<br />

2. Divide the sky into six bins as shown in Figure 3;<br />

3. Characterize each SV (space vehicle) as belonging to<br />

one of the bins, based on its elevation and azimuth;<br />

4. Select 4K subsets of SVs from the original set of SVs<br />

as follows:<br />

select one SV randomly from each of the six bins.<br />

1580<br />

Select two more SVs from the remaining SVs<br />

If PDOP > 3, select one more from those remaining.<br />

5. Compute, then cluster 4K positions using these<br />

selections.<br />

6. Compute the mean of the cluster of positions;<br />

7. Compute the distance of each computed position from<br />

the mean of the cluster<br />

8. Divide the cluster into 5 concentric rings around the<br />

cluster mean each ring incremented by d=0.2M where<br />

M is the distance of the farthest position from the<br />

cluster mean. Hence the rings are at d, 2d, … 5d from<br />

the cluster mean.<br />

9. For each of the 4K positions, assign a value from 1 to<br />

5 to every contributing SV, based on the concentric<br />

rings that position falls in.<br />

10. Sum these assigned values from each of the K SVs<br />

11. Discard the signal for the highest ranked SV<br />

Horizon<br />

Elevation<br />

40°<br />

Figure 3: We divided the sky into six bins as described in [10].<br />

The two symbols represent SVs from the two constellation<br />

configurations shown in Figure 1.<br />

3.3 Static Position calculation<br />

N<br />

We are now left with the original, merged dataset (i.e., the<br />

dataset that simulates a dual GPS/Galileo receiver) less the<br />

signals from the two SVs that were rejected as the least<br />

trustworthy signals. We compute receiver position at each<br />

second from this filtered dataset, then compute mean and<br />

covariance for these position sets. The mean is our new<br />

position estimate for the position of the stationary receiver and<br />

the covariance matrix is an element of characterization data.<br />

4. EXPERIMENTAL RESULTS<br />

In order to gauge the efficacy of our processing we will<br />

need to compare position calculations with and without our<br />

process. Since we are reading carrier phase signals with a<br />

commercial receiver (TIM-LP) prior the application of<br />

proprietary filters to which we have no access, we are required<br />

to perform our own position calculations, both for our<br />

approach and for the control approach. This means that our<br />

position estimates may not be as accurate as those produced by<br />

the commercial receiver. However, it is the relative<br />

improvement in which we are interested.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />

78


For our first test, we recorded two 15-minute data sets from<br />

a stationary receiver, 3.5 hours apart. The location was an older<br />

neighborhood, 3 or 4 meters from two 2-storey houses with a<br />

large-canopied tree about 6 meters away and other houses and<br />

mature trees somewhat further away. The effect of filtering this<br />

first GPS dataset can be seen in Figure 4.<br />

Figure 4: The larger scatter represents unfiltered position<br />

calculations, while the smaller scatter represents positions after<br />

filtering. The two ellipses represent the 3 distance (in<br />

degrees) from the mean of each cluster.<br />

0.3289x10 -8 0.0838x10 -8<br />

-0.5489x10 -8 1.3408x10 -8 -0.0516x10 -8 0.1977x10 -8<br />

Covariance: unfiltered scatter Covariance: filtered scatter<br />

Table 2: Covariance Matrices from two scatters in Figure 4<br />

(each element represents 2 ; hence, units are degrees squared).<br />

The covariance matrices from these two scatters are shown<br />

in Table 2. By comparing the ratios of standard deviations for<br />

LAT and LON taken from these matrices we get a sense of the<br />

percentage level of reduction achieved by these filters.<br />

To illustrate with the first element the variance in degrees<br />

LAT (LAT 2 ) we calculated the percentage change in standard<br />

deviation value as:<br />

1- (LAT-filtered / LAT-unfiltered)<br />

Hence the percentage change in standard deviation for LAT<br />

and LON are: 49.5% and 61.6%, respectively, representing a<br />

considerable reduction in sigma values.<br />

A second test, recorded similarly, with the two data subsets<br />

7 hours apart and several meters away, endured less multipath<br />

effects and provided good, but less dramatic results, shown in<br />

Figure 5.<br />

Figure 5: The second data set<br />

1581<br />

0.0330x10 -8<br />

-0.0051x10 -8<br />

0.0159x10 -8<br />

0.0519x10 -8 -0.0031x10 -8<br />

unfiltered filtered<br />

Table 3: Covariance Matrices from two scatters in Figure 5.<br />

0.0561x10 -8<br />

The percentage change in standard deviation values for<br />

LAT and LON are: 30.6% and -4%, respectively, representing<br />

a significant but mixed reduction in sigma values (LON did not<br />

improve, and may have worsened).<br />

5. CONCLUSIONS and FUTURE WORK<br />

We have shown that it is feasible to reduce positioning<br />

variance due to multipath in the case of a static GNSS receiver.<br />

In these two experiments the higher multipath data showed the<br />

greatest improvement of course much more testing is needed.<br />

By gathering signals for a modest amount of time (we propose<br />

7 to 10 minutes) and using techniques to isolate signals that<br />

contribute relatively more noise than others, and by taking<br />

advantage of the expected dual GPS/Galileo receivers, we are<br />

optimistic we can specify a processor that would be the<br />

positioning engine for a reliable in-vehicle meter for roadpricing,<br />

pay-as-you drive insurance, and most parking-pricing.<br />

For our first experiment with this approach to reduce<br />

variation in position error for a stationary GNSS receiver, we<br />

have successfully adapted and simplified two existing results<br />

from the literature. Clearly, making decisions to drop the least<br />

trustworthy signals help, but it is also understood that which<br />

signals are best at any one moment can change, even for a<br />

stationary receiver. For this reason, we are currently exploring<br />

with good success several additional ideas. These include<br />

time-slicing the signals into numerous smaller windows,<br />

iterative removal of 0 or more SVs (rather than removal of<br />

exactly one SV per filter), dynamic thresholds, and others. We<br />

expect to be able to improve considerably on the current<br />

results.<br />

REFERENCES<br />

[1] “Feasibility Study of Road Pricing in the UK: A Report to the Secretary<br />

of State, UK,” Department for Transport 2004.<br />

[2] T. Grayling, J. Foley, and N. Sansom, “In the Fast Lane,” Institute for<br />

Public Policy <strong>Research</strong> (IPPR) – UK, June 2004.<br />

[3] “Fair Payment for Infrastructure Use,” Commission of European<br />

Communities, 1998<br />

[4] H. Appelbe, “Taking Charge,” Traffic Technology International,<br />

October/November 2004, pg 52.<br />

[5] B. Grush, “The Delicate Art of Tolling (Part 1),” Tolltrans, 2004, pg 52;<br />

and Part 2, Traffic Technology International, Dec ‘04/Jan ’05. pg 58.<br />

[6] S. Bisnath and R. Langley, “Pseudorange Multipath By Means of<br />

Multipath Monitoring and De-Weighting,” KIS 2001, June, 2001.<br />

[7] R. Nayak, M. Cannon, C. Wilson, and G. Zhang, “<strong>Analysis</strong> of Multiple<br />

GPS Antennas for Multipath Mitigation in Vehicular Navigation,”<br />

Institute of Navigation National Technical Meeting, Jan 2000.<br />

[8] P. Dana, “Global Positioning System (GPS) Time Dissemination for<br />

Real-Time Applications”, Real-Time Systems, 12, 9-40. 1997<br />

[9] Y. Feng, “Combined Galileo and GPS: A Technical Perspective,”<br />

Journal of Global Positioning Systems Vol. 2, No.1: 67-72, (2003).<br />

[10] P. Misra and S. Bednarz, “Navigation for Precision Approaches”,<br />

GPSWorld, April 2004.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />

79


A SIGNAL CLASSIFICATION APPROACH USING TIME-WIDTH VS FREQUENCY BAND<br />

SUB-ENERGY DISTRIBUTIONS<br />

Karthikeyan Umapathy<br />

Dept. of Electrical and Computer Engg.,<br />

The <strong>University</strong> of Western Ontario,<br />

London, ON N6A 5B8, Canada<br />

Email: kumapath@uwo.ca<br />

ABSTRACT<br />

Time-frequency (TF) signal decompositions provide us with ample<br />

information and extreme flexibility for signal analysis. By applying<br />

suitable processing on the TF decomposition parameters,<br />

even subtle signal characteristics can be revealed. In many real<br />

world applications, identification of these subtle differences make<br />

a significant impact in signal analysis. Particularly in classification<br />

applications using TF approaches, there may be situations where<br />

a localized high discriminative signal structure is diluted due to<br />

the presence of other overlapping signal structures. To address<br />

this problem we propose a novel approach to construct multiple<br />

time-width vs frequency band mappings based on the energy decomposition<br />

pattern of the signal. These mapping are then analyzed<br />

to locate the highly discriminative features for classification.<br />

Initial results with two real world biomedical signal databases (1)<br />

Vibroarthrographic (VAG) signals and (2) Pathological speech signals,<br />

indicate high potential for the proposed technique.<br />

1. INTRODUCTION<br />

Time-frequency (TF) transformations have significantly contributed<br />

towards complex signal analysis and automatic classification. In<br />

classification applications using TF approach, it is often a small<br />

area or pockets of areas in the TF plane that actually exhibit the<br />

difference between the classes of signals. Within these small areas,<br />

there may be overlapping multiple signal components with varying<br />

discriminative characteristics. The overall discriminative power of<br />

the area is normally decided by the high energy signal components<br />

which dilute the discriminative characteristics of less energy signal<br />

components. It may so happen that a high discriminative but less<br />

energy component is masked by a less discriminative but high energy<br />

component. Typical biomedical signals contain a mixture of<br />

coherent and non-coherent signal structures with varying localized<br />

overlaps. Using some criteria, if we can separate these localized<br />

overlapping structures, it may lead to a better understanding of the<br />

signal thereby to extract high discriminative features for classification<br />

applications.<br />

In general, all real world signals contain both coherent and<br />

non-coherent structures. Coherent structure have definite TF localization<br />

unlike the non-coherent structures. Any iterative decomposition<br />

algorithm such as matching pursuits with TF dictionaries<br />

model the coherent structures during the initial iterations as<br />

they correlate well with the dictionary elements. The non-coherent<br />

Thanks to NSERC for funding this research work.<br />

Sridhar Krishnan<br />

Dept. of Electrical and Computer Engg.,<br />

<strong>Ryerson</strong> <strong>University</strong>,<br />

Toronto, ON M5B 2K3, Canada<br />

Email: krishnan@ee.ryerson.ca<br />

structures on the other hand are broken into finer and finer structures<br />

till the information is diluted across the whole dictionary [1].<br />

The contribution of coherent and non-coherent structures in a signal<br />

decide the energy capture pattern of the decomposition algorithms.<br />

The previous work [2] of the authors, introduced a novel timewidth<br />

vs frequency band mapping (constructed from the decomposition<br />

parameters) to identify the high discriminative TF tilings<br />

between different classes of signal using Local Discriminant Bases<br />

(LDB) algorithm. The proposed work uses a similar mapping,<br />

however splitting it into multiple mappings for identifying better<br />

discriminatory features.<br />

The paper is organized as follows: Section II covers methodology<br />

consisting of adaptive time-frequency transform, multiple<br />

TFD slices, multiple sn vs fn mappings, databases, feature extraction<br />

and pattern classification. Results and discussion are given in<br />

Section III and conclusions in Section IV.<br />

2. METHODOLOGY<br />

2.1. Adaptive Time-frequency Transform (ATFT)<br />

The signal decomposition technique used in this work is based on<br />

the matching pursuit (MP) [1] algorithm. MP is a general framework<br />

for signal decomposition. The nature of the decomposition<br />

varies according to the dictionary of basis functions used. When<br />

a dictionary of TF functions is used, MP yields an adaptive timefrequency<br />

transformation [1]. In MP any signal x(t) is decomposed<br />

into a linear combination of K TF functions g(t) selected<br />

from a redundant dictionary of TF functions as given by:<br />

K−1 <br />

<br />

an t − pn<br />

x(t) = √ g<br />

exp {j(2πfnt + φn)} , (1)<br />

sn<br />

n=0<br />

sn<br />

where an is the expansion coefficient, the scale factor sn also<br />

called as octave or time-width parameter is used to control the<br />

width of the window function, and the parameter pn controls the<br />

temporal placement. The parameters fn and φn are the frequency<br />

and phase of the exponential function respectively. The signal<br />

x(t) is projected over a redundant dictionary of TF functions with<br />

all possible combinations of scaling, translations and modulations.<br />

The dictionary of TF functions can either suitably be modified or<br />

selected based on the application in hand. In our technique, we<br />

are using the Gabor dictionary (Gaussian functions) which has the<br />

best TF localization properties. At each iteration, the best correlated<br />

TF functions to the local signal structures are selected from<br />

0-7803-8874-7/05/$20.00 ©2005 IEEE V - 477<br />

80<br />

ICASSP 2005<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.


Frequency bands<br />

F4<br />

F3<br />

F2<br />

F1<br />

ME5<br />

s1 s2 ...........................sn<br />

Time−width<br />

ME5 = ME1 + ME2 + ME3 + ME4<br />

SPLIT<br />

the dictionary. The remaining signal called the residue is further<br />

decomposed in the same way at each iteration subdividing them<br />

into TF functions.<br />

2.2. Multiple TFD slices<br />

As explained in Section 1, in the initial iterations, the ATFT algorithm<br />

captures the coherent signal structures which have correlated<br />

TF dictionary elements and then as the number of iterations grows,<br />

it tries model the non-coherent structures by breaking them finer<br />

and finer till the information is diluted across the whole dictionary.<br />

The energy capture pattern can be extracted from the normalized<br />

decomposition parameter an. In order to explain how this energy<br />

capture pattern can be utilized to extract overlapping signal structures,<br />

let us take an example of a synthetic signal y(t) which is<br />

composed of a sinusoid, two chirps and random noise. The signal<br />

y(t) is given by:<br />

y(t) =w1s(t)+w2c1(t)+w3c2(t)+w4r(t) (2)<br />

where s(t) represent the sinusoid at approximately Fs/4, c1(t) is a<br />

linear chirp with increasing frequency cutting the sinusoid, c2(t) is<br />

another linear chirp with decreasing frequency again cutting both<br />

the sinusoid and c1(t). r(t) represents the random noise. The<br />

weight factors w1,2,3,4 are (1, .1, .01, .001) respectively. We performed<br />

the ATFT decomposition (1000 iterations) of y(t) using a<br />

Gabor dictionary. Figures 3(a) and 3(b) show y(t) in time domain<br />

and TF domain (spectrogram is used inorder to show all the three<br />

components at the same time). Here we deliberately introduced energy<br />

differences between the components so as to demonstrate the<br />

significance of energy capture pattern. Most of the times, the first<br />

few iterations capture significant amount of signal energy (coherent<br />

structures). Thereafter with the increase in the number of iterations<br />

we move from modeling coherent structures to non-coherent<br />

structures. The energy capture pattern of the ATFT decomposition<br />

for y(t) is shown in Fig. 2 (the first 50 iterations). The curve<br />

represents the normalized energy captured per iteration. We can<br />

see the energy captured per iteration drops as we move along the<br />

iterations. In this work as an example we split the energy capture<br />

pattern into 4 parts i.e. (E1) the number of iterations at which<br />

the energy captured per iteration drops to 10% of its initial value<br />

(initial value= 1), (E2) the number of iterations between 10% of<br />

Frequency band<br />

ME4<br />

ME3<br />

ME2<br />

ME1<br />

Fig. 1. Time-width vs frequency band mapping<br />

V - 478<br />

81<br />

Time−width<br />

initial value and 1% of initial value, (E3) the number of iterations<br />

between 1% of initial value and 0.1% of initial value, and (E4) the<br />

number of iterations between 0.1% of initial value to the end of<br />

decomposition.<br />

Normalized energy capture curve − log scale<br />

10 0<br />

10 −1<br />

10 −2<br />

10 −3<br />

0.1<br />

0.01<br />

0.001<br />

E1 E2 E3 E4<br />

Energy decomposition pattern( E1, E2, E3 and E4)<br />

10<br />

0 5 10 15 20 25 30 35 40 45 50<br />

−4<br />

Number of iterations<br />

Fig. 2. Energy capture pattern of the sample signal y(t).<br />

Following the energy capture pattern we accumulate the TF<br />

functions into the above explained four parts (E1-4). For this example,<br />

we had 5 TF functions for E1, 9 TF functions for E2, 16 TF<br />

functions for E3 and 970 TF functions for E4. The number of TF<br />

functions will give an idea that almost 99% of the signal energy<br />

needs only 30 (1 to E3) TF fucntions, whereas the remaining 1%<br />

signal energy (mostly noise like strutcures) needs 970 TF functions<br />

or more. Using these 4 sets of TF fucntions we construct 4<br />

different TFDs. i.e. splitting the original TFD of y(t) into 4 TFDs<br />

based on the energy capture pattern. The corresponding 4 TFDs<br />

are shown in Figs. 3(c), 3(d), 3(e) and 3(f). If we closely look at<br />

the TFDs, we can see the TFD in Fig. 3(c) showing the sinusoid<br />

s(t) alone, the TFD in Fig. 3(d) shows the disappearing sinusoid,<br />

the TFD in Fig. 3(e) shows the evolving chirp c1(t) signal from the<br />

sinusoid background and the TFD in Fig. 3(f) showing a stronger<br />

but noisy chirp c1(t), a faint evolving chirp c2(t) and the random<br />

noise. It is obvious to see that TFDs 3(c) to 3(f) are better individual<br />

representations of the signal components than the combined<br />

TFD 3(b). In this example if it so happens that one of the components<br />

that was masked by the overlapping strong component is<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.


Amplitude (.au)<br />

1.5<br />

1<br />

0.5<br />

0<br />

−0.5<br />

−1<br />

−1.5<br />

0 50 100 150 200 250 300 350 400 450 500<br />

Time samples<br />

Normalized frequency<br />

Normalized frequency<br />

Fs/2<br />

Fs/2<br />

(a)<br />

Time samples<br />

(c)<br />

Time samples<br />

(e)<br />

1024<br />

1024<br />

Normalized frequency<br />

Normalized frequency<br />

Fs/2<br />

Fs/2<br />

Time samples<br />

(b)<br />

Time samples<br />

(d)<br />

Time samples<br />

Fig. 3. (a) sample signal y(t), (b) TFD of the sample signal, (c)<br />

TFD of sample signal with TF functions of E1, (d) TFD of sample<br />

signal with TF function of E2 (e) TFD of sample signal with TF<br />

functions of E3 and (f) TFD of the residue signal<br />

the discriminator that we are looking for, then the proposed technique<br />

of generating multiple TF mapping using the energy capture<br />

pattern will be of immense help. Here it should be noted that the<br />

energy split shown in this example is not the best to show all the<br />

components individually and separately. This is just to give an<br />

idea about the possiblity of using the energy capture patttern for<br />

removing overlapping structures in complex situations. Also this<br />

approach may not work in all situations unless there are hidden<br />

signal structures either with (a) different energy contribution or<br />

(b) different contributions from coherent and non-coherent structures<br />

or both (a) and (b). Extending this same concept of multiple<br />

TF mappings, we now apply it on a novel time-width vs frequency<br />

band mapping as will be explained in the next Section 2.3.<br />

2.3. Multiple sn vs fn mappings<br />

In order to effectively analyze for classification applications, the<br />

ATFT signal decomposition parameters need to be rearranged in a<br />

pseudo dictionary format. There are five parameters as explained<br />

in Section 2.1 viz. an, sn, fn, pn and φn that represent the index<br />

of each of the dictionary element. After a signal is decomposed<br />

into TF functions, we group the TF functions with the time-width<br />

parameter sn in X axis and the the fn parameter in the Y axis.<br />

In order to reduce the computational complexity instead of using<br />

all the possible values of the fn parameter we break the frequency<br />

range into M bands only. whereas sn takes all the possible values<br />

(2 1..14 ) depending on the length of the signal. Each combination of<br />

Normalized frequency<br />

Fs/2<br />

(f)<br />

1024<br />

1024<br />

1024<br />

V - 479<br />

82<br />

sn with one of the M frequency bands form a cell which contains<br />

the cumulative normalized energy of all the TF functions falling<br />

in that particular combination of sn and frequency band. The left<br />

side of the Fig. 1 shows a sample time-width vs frequency band<br />

mapping. In the proposed work we used 4 frequency bands, which<br />

means we transform the decomposition parameters of a signal into<br />

14 time-widths (sn) vs 4 frequency band mapping.<br />

From this time-width vs frequency band mapping we can readily<br />

obtain the energy distribution of the signal in terms of the timewidth<br />

and frequency band (center frequency) decomposition parameters.<br />

Depending upon the application one can choose say<br />

K number of cells that covers an area corresponding to certain<br />

amount of signal energy. This area will provide the sn and fn<br />

ranges which are significant for that particular application. This<br />

area can be arrived by averaging the time-width vs frequency band<br />

of N sample signals. For classification applications this can be<br />

done using LDB as demonstrated in authors previous work [2].<br />

Considering the benefits of multiple TFD slices in signal analysis<br />

as explained in Section 2.2, instead of using one time-width vs frequency<br />

band mapping that covers all the signal energy, we slice it<br />

into L time-width vs frequency band mappings as shown in Fig. 1<br />

(L =4). This L sliced time-width vs frequency band mappings<br />

are expected to separate out the overlapping energy distribution of<br />

the TF functions based on the energy capture pattern and thereby<br />

enhance the discriminatory power of the cells. We performed classification<br />

on two biomedical signal databases to verify the effectiveness<br />

of the proposed technique of splitting the time-width vs<br />

frequency band mapping.<br />

2.4. Databases<br />

(1) Vibroarthographic (VAG) signals: These are the vibration signals<br />

emitted from the human knee joints during an active movement<br />

of the leg and can be used to detect the early joint degeneration<br />

or defects. Extensive work [3] has been done using timefrequency<br />

approach in analyzing these signals. Few important<br />

characteristics of the VAG signals which make them difficult to<br />

analyze are as follows: (i) Highly non-stationary, (ii) Varying frequency<br />

dynamics, and (iii) Multi-component signal. The database<br />

consists of 36 signals with 19 normal and 17 abnormal signals.<br />

(2) Pathological speech signals: These are speech signals recorded<br />

from the pathological and normal talkers in a sound-attenuating<br />

booth at the Massachusetts Eye and Ear Infirmary. All signals<br />

were sampled at 25 kHz. The signals were the first sentence of<br />

the rainbow passage, ’when the sunlight strikes rain drops in the<br />

air, they act like a prism and form a rainbow’. More details on the<br />

classification of this database can be found in author’s previous<br />

work [4]. The database used in this study consists of 30 signals<br />

with 15 normal and 15 pathological signals.<br />

2.5. Feature Extraction and Pattern Classification<br />

<strong>Signal</strong>s from both the databases were decomposed using the ATFT<br />

algorithm (5000 iterations) as explained in Section 2.1. For each<br />

signal, 4 time-width vs frequency band mappings were created using<br />

the decomposition parameters. The energy split used for generating<br />

these 4 mappings were same as the one used in the example<br />

of synthetic signals (E1-4). In these 4 mappings, each row<br />

of the mapping represents the signal energy distribution over all<br />

time-widths for a particular band of frequencies. Let us name the<br />

mappings as ME1, ME2, ME3 and ME4 and the frequency bands<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.


as F1, F2, F3 and F4 as shown in Fig. 1. Now for each combination<br />

of MEx and Fx we extract P × 14 energy values from<br />

the cells as feature matrix, where P is the number of signals in<br />

the database. From the 16 combinations of MEx and Fx, only<br />

non-zero feature matrices are used for classification. In order to<br />

compare the results with the original non-split time-width vs frequency<br />

mappings (let it be ME5), another set of 4 feature matrices<br />

were generated using the same procedure. When tested with the<br />

Ho-Kashyap [5] algorithm, most of these 20 combinations (MEx<br />

and Fx) for both the databases favored non-linear separability to<br />

achieve maximum classification accuracies. However, as the main<br />

focus of the proposed technique is to demonstrate the relative improvement<br />

in discrimination between the split and non-split timewidth<br />

vs frequency mappings, we restrict our analysis to a linear<br />

classifier. The extracted features were fed to a Linear Discriminant<br />

<strong>Analysis</strong> (LDA) based classifier using SPSS [6]. The classification<br />

accuracy was validated using the leave-one-out method which is<br />

known to provide a least bias estimate.<br />

3. RESULTS AND DISCUSSION<br />

A two stage classification was performed for the VAG database.<br />

In the first stage, we performed a two group classification classifying<br />

the VAG signals into normal and abnormal. Table 1 shows<br />

the highest classification accuracy achieved out of the 20 combinations<br />

of MEx and Fx. We observed an overall classification accuracy<br />

of 88.9% using leave-one-out (Cross. V) based LDA for the<br />

combination of ME4 and F3. This is higher than the classification<br />

accuracies reported by existing works for this database. There is<br />

no difference in the classification accuracy comparing it with the<br />

combination of non-split ME5F3. This is because F3 is non zero<br />

only for ME4 which means, there is no overlap in F3. So eventually<br />

ME4F3 and ME5F3 were the same. The results also gave a<br />

clue that the discriminatory information between normal and abnormal<br />

lies in F3.<br />

Table 1. Table showing 2 group classification accuracy achieved<br />

for the VAG database. Cross.V = Leave-one-out method LDA, %<br />

= Percentage of classification<br />

Method <strong>Group</strong>s Normal Abnormal Total<br />

Cross.V Normal 15 4 19<br />

Abnormal 0 17 17<br />

% Normal 78.9 21.1 100<br />

Abnormal 0 100 100<br />

We now performed the second stage of classification on the 17<br />

abnormal VAG signals. The abnormal VAG signals in the database<br />

are from different kinds of knee pathologies. Chondromalcia patella<br />

(CMP) [3] is one of the pathologies which has four categories (I,<br />

II, III and IV) of grading based on the severity. It is a difficult<br />

task to classify them by their gradings using the VAG signals. Out<br />

of the 17 abnormal signals, 10 were CMP signals. We performed<br />

a three groups classification on this 10 signals viz. grade(I and<br />

II), grade (II and III) and grade (III and IV). We observed a perfect<br />

classification of 100% using leave-one-out based LDA for the<br />

combination of ME2 and F1. None of the other combinations including<br />

the non-split ME5F1 could achieve 100% classification.<br />

This result explains the fact that splitting the time-width vs frequency<br />

band mappings does enhance the discriminatory power and<br />

also indicates the discriminatory features for CMP lies in the ME2<br />

and F1 mapping.<br />

V - 480<br />

83<br />

Similarly we performed a 2 group classification (normal and<br />

pathological) for the pathological speech database. Table 2 shows<br />

the highest classification accuracy achieved out of the 20 combinations<br />

of MEx and Fx. An overall classification accuracy of<br />

93.3% was achieved using the leave-one-out based LDA. The reported<br />

classification is for the combination of ME1F1 and nonsplit<br />

ME5F1. In which case we observe the classification accuracy<br />

to remain same with or without splitting the time-width vs<br />

frequency mapping. However the results give a clue that the discriminatory<br />

information lies in ME1 and F1.<br />

Table 2. Table showing the 2 group classification accuracy<br />

achieved for the pathological speech database. Cross.V - Leaveone-out<br />

method LDA, % = Percentage of classification<br />

Method <strong>Group</strong>s Normal Pathological Total<br />

Cross.V Normal 13 2 15<br />

Pathological 0 15 15<br />

% Normal 86.7 13.3 100<br />

Pathological 0 100 100<br />

4. CONCLUSIONS<br />

Enhancing the discriminatory power of the TF representations using<br />

a TFD splitting approach was proposed. The technique was explained<br />

using a synthetic signal and two real world signal databases.<br />

Using the technique on the VAG database showed a significant<br />

improvement in the sub classification of abnormal signals. Although<br />

the results are inconclusive for the real world databases,<br />

this approach may better suit for identifying finer discriminatory<br />

features inside global classifications. Adaptively choosing the energy<br />

split might improve the significance of the proposed technique.<br />

Future work involves in arriving at a suitable energy split<br />

ratio based on the nature of the signal, increase the number of frequency<br />

bands and extract visual feature treating the time-width vs<br />

frequency mapping as an image.<br />

5. REFERENCES<br />

[1] S. G. Mallat and Z. Zhang, “Matching pursuit with timefrequency<br />

dictionaries,” IEEE Trans. <strong>Signal</strong> Processing, vol.<br />

41, no. 12, pp. 3397–3415, 1993.<br />

[2] K. Umapathy, S. Krishnan, and A. Das, “Sub-dictionary selection<br />

using local discriminant bases algorithm for signal classification,”<br />

in Proceeding of IEEE Canadian Conference on<br />

Electrical and Computer Engineering, Niagara falls, Canada,<br />

May 2004, pp. 2001–2004.<br />

[3] S. Krishnan, “Adaptive signal processing techniques for analysis<br />

of knee joint vibroarthrographic signals,” in Ph.D dissertation,<br />

<strong>University</strong> of Calgary, June 1999.<br />

[4] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, “Timefrequency<br />

modeling and classification of pathological voices,”<br />

in Proceedings of IEEE Engineering in Medicine and Biology<br />

Society (EMBS) 2002 Conference, Houston, Texas, USA, Oct<br />

2002, pp. 116–117.<br />

[5] M. H. Hassoun and J. Song, “Adaptive Ho-Kashyap rules for<br />

perceptron training,” IEEE Trans. on Neural Networks,vol.3,<br />

no. 1, pp. 51–61, 1992.<br />

[6] SPSS Inc., “SPSS Advanced statistics user’s guide,” in User<br />

manual, SPSS Inc., Chicago, IL, 1990.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.


INDEXING OF NFL VIDEO USING MPEG-7 DESCRIPTORS AND MFCC FEATURES<br />

Syed G. Quadri, Sridhar Krishnan and Ling Guan<br />

Dept. of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong><br />

Toronto, Canada, M5B 2K3<br />

{squadri,krishnan,lguan}@ee.ryerson.ca<br />

ABSTRACT<br />

In this paper, we propose an application system to classify<br />

American Football (NFL) Video shots into 4 categories,<br />

namely: Pass plays, Run plays, Field Goal/Extra Point plays<br />

(FG/XP) and Kickoff/Punt plays (K/P). The proposed system<br />

consists of two stages. The first stage is responsible for<br />

play event localization and the latter stage is responsible for<br />

feature mapping and classification. For play event localization<br />

we have proposed an algorithm that uses MPEG-7 motion<br />

activity descriptor and mean of the magnitudes of motion<br />

vectors, in a collaborative manner to detect the starting<br />

point of a play event within a video shot with 83% accuracy.<br />

The indexing and classification stage uses MPEG-7 motion<br />

and audio descriptors along with Mel Frequency Cepstrum<br />

Coefficients (MFCC) features to classify the events into 4<br />

categories using Fisher’s LDA. We obtain indexing accuracy<br />

of 92.5% by using leave-one-out classification technique<br />

on a database of 200 video shots taken from 4 different<br />

games obtained from 4 different networks.<br />

1. INTRODUCTION<br />

The concept of On-Demand entertainment and programming<br />

is fast becoming a reality with the popularity of digital TV<br />

channels. Nearly every professional sports league and team<br />

in North America has a digital channel boasting of On-Demand<br />

programming and statistics. But the reality is that it takes<br />

nearly three to four hours in post-production work to prepare<br />

the highlights for a game. For example, on NFL Sunday<br />

Ticket you get Highlights-On-Demand on Monday morning<br />

for the games played on Sunday. In order to minimize<br />

this delay, we need a system that can analyze the contents<br />

of the broadcast and derive the semantics from the input.<br />

These semantics can be made available to the users<br />

for querying in order to create a true On-Demand experience.<br />

Recently a lot of research has been conducted on automating<br />

the process of indexing and annotating the sports<br />

video streams. Nearly all the major sports have been used<br />

to test the indexing and retrieval systems. One of the major<br />

projects working in generating semantic sports video an-<br />

notations is the ASSAVID project. As detailed in [1], this<br />

project focuses on developing a system that can categorize<br />

different types of sports and provide users with an interface<br />

to query events in a particular sport.<br />

In [2] Miyauchi et. al., used audio, textual and visual<br />

information to classify NFL video into events like touchdowns<br />

and field goals. In [3] Lazarescu et. al., classified<br />

different types of formations within NFL games using the<br />

natural language commentary from the game, the geometrical<br />

information about the play and the domain knowledge.<br />

In [4], Nitta et. al. used closed caption text and audio visual<br />

information to classify plays into 3 categories namely:<br />

scrimmage, FG/XP and K/P.<br />

All of the works mentioned above rely on domain knowledge<br />

to classify different high level concepts within American<br />

football. On the other hand, we propose a system that<br />

classifies recurring events of the game without using any domain<br />

knowledge. These recurring events are the most basic<br />

components of the game. By classifying these basic components<br />

first we can look for higher concepts contained within<br />

each of the basic events and thus generate a hierarchical<br />

graph of concepts which varies from low level to high level.<br />

In this work we focus on utilizing existing standard descriptors<br />

of MEPG-7 as the basic feature set. In [5], the authors<br />

have proposed applications for generating summary highlights<br />

in sports domain using MPEG-7 motion descriptor,<br />

but to our knowledge no one has used MPEG-7 audio and<br />

motion descriptors to index recurring events in the American<br />

football domain.<br />

Section 2, will detail the algorithm proposed for localization<br />

of play events within NFL video shots along with<br />

an analysis on the performance of the algorithm. Section<br />

3, provides details on the features set used for indexing and<br />

the classification scheme utilized. Section 4, presents the<br />

results of the classification scheme and Section 5, provides<br />

the concluding remarks and future directions.<br />

2. LOCALIZATION OF PLAY EVENT<br />

Sports have a very well defined structure. They have a set<br />

of rules that must be followed in order for the game to be<br />

0-7803-8874-7/05/$20.00 ©2005 IEEE II - 429<br />

84<br />

ICASSP 2005<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.


Figure. 1. Motion Vector magnitudes for various plays<br />

played properly. Many sports such as golf, baseball, bowling<br />

and American football have a requirement that the team<br />

or players must be in a distinctive position before each play.<br />

In Golf, the player positions himself by the ball in order to<br />

hit it in a certain direction. Likewise in American football<br />

the two teams first line up face to face before the ball is<br />

snapped to begin the play. The common theme among all<br />

these sports is that before the play starts, the level of motion<br />

activity in the video is lower compared to when the play<br />

has started. This distinction in the motion activity is utilized<br />

in the proposed algorithm to segment play events from<br />

non-play events. Figure 1, shows the magnitude of motion<br />

vectors in different types of NFL plays.<br />

2.1. Proposed Play Event Detection Algorithm<br />

The primary objective of the algorithm is to detect the key<br />

frame that can be used as the starting point of the play event<br />

in the shot. The end point of the play event is not extracted,<br />

as in most American football video shots containing play<br />

events, the shot usually terminates at the end of the play.<br />

In order to extract the intensity of motion descriptor,<br />

MPEG-1 video motion vectors are used. Only the motion<br />

vectors from the P frames are analyzed in order to speed<br />

up the processing time. In MPEG-7 the motion activity descriptor<br />

represents the Standard Deviation of motion vector<br />

magnitudes within a frame. The intensity of motion activity<br />

descriptor along with the mean of the motion vector magnitudes<br />

is used collaboratively in the algorithm to detect the<br />

starting point of the play event. An analysis of 20 video<br />

shots selected from each category was conducted to estimate<br />

the thresholds for the mean and standard deviation of<br />

motion vectors. The following steps detail the algorithm:<br />

Step 1: Find a P frame with a mean value of 4 or higher<br />

Step 2: Determine the gradient of the mean values within a<br />

window (3 or 4 adjacent frames)<br />

II - 430<br />

85<br />

Step 3: If gradients are all positive mark the frame as possible<br />

starting point, else go back to Step 1.<br />

Step 4: If the intensity of motion descriptor has a value of 2<br />

or higher, return frame number as the starting point<br />

Step 5: Otherwise, determine the gradient of the standard<br />

deviation values within a window (3 or 4 adjacent frames)<br />

Step 6: If the gradients are all positive return the frame number<br />

as the starting point, else go back to Step 1.<br />

2.2. Play Event Detection Algorithm Evaluation<br />

The above algorithm was tested on the American football<br />

video shot database which consists of 200 video shots taken<br />

from 4 different games and 4 different networks. In order<br />

to measure the performance of the algorithm, we have to<br />

establish the ground truth about the starting point of the play<br />

event within each video shot. This was accomplished by<br />

having an observer manually index the frame number which<br />

best represented the start point of the play event.<br />

Comparison of results was done by getting the delta between<br />

the ground truth frame number and the frame number<br />

estimated by the algorithm. The results still needed to be<br />

evaluated in terms of what this delta meant in actual time<br />

domain. That is we need to evaluate if the algorithm is estimating<br />

a starting point too early or if it is estimating the<br />

starting point after a certain amount of delay.<br />

Since MPEG-1 video has a frame rate of 30 frames/sec,<br />

building a histogram whose bin size was 30 frames would<br />

give a general idea of how apart the estimated frame numbers<br />

were from the ground truth in actual time domain. Figure<br />

2, shows the histogram of the number of shots within<br />

each time unit. Negative time units represents early detection<br />

and positive time units represents a delayed detection.<br />

From Figure 2 we can see that the algorithm detects the<br />

starting points of the play with 83% accuracy. That is 166 of<br />

the 200 video shots in the database had the starting points<br />

detected within ±1 seconds of the original starting point.<br />

The accuracy of the algorithm can be increased to 86.5% by<br />

increasing the window size from 3 frames to 4 frames. But<br />

this change in window size has its own side effect. By increasing<br />

the window size we are looking for motion activity<br />

being sustained for a longer period of time, which means<br />

we will get more shots with delayed detection.<br />

3. INDEXING AND CLASSIFICATION<br />

One of the biggest application areas for MPEG-7 is multimedia<br />

indexing and retrieval. Since the introduction of<br />

MPEG-7 standard, there has been significant research effort<br />

put in developing applications based on descriptors from<br />

MPEG-7, but to date there has been only a few applications<br />

that utilize MPEG-7 descriptors for sports video indexing<br />

and retrieval. The application we are proposing is a first<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.


Figure. 2. Performance of algorithm in actual time<br />

in the American football domain, which utilizes MPEG-7<br />

motion and audio descriptors along with MFCC features.<br />

In American football domain visual or motion features<br />

play a significantly dominant role in discriminating between<br />

different types of plays as evident from Figure 1. Therefore<br />

firstweevaluatetheefficacy of using motion descriptors<br />

for an American football video indexing system and then<br />

we evaluate the changes in system performance by adding<br />

audio descriptors and MFCC features.<br />

3.1. MPEG-7 motion features<br />

According to MPEG-7 description [6], the standard deviation<br />

of the magnitude of motion vectors formulate the intensity<br />

of motion descriptor. The descriptor takes on the<br />

value of 1 through 5, low value meaning low intensity of<br />

motion. Experiments done by using 5 levels showed that<br />

most of the motion descriptors were quantized into 2 or 3<br />

levels. Thus to provide better motion activity resolution the<br />

descriptor was quantized into 12 levels. Similarly according<br />

to MEPG-7 description the dominant direction descriptor is<br />

calculated by quantizing the angles of the motion vectors<br />

into 8 levels. In this work the same 8 quantization levels<br />

were used to define the dominant direction descriptor.<br />

A 2D feature map was created by combining the two<br />

motion activity descriptors. The motivation behind this was<br />

to create a feature set that can model both the intensity of<br />

motion and the direction of motion, thus discriminating between<br />

high intensity motion in upward direction versus high<br />

intensity motion in lateral direction. As can be seen from<br />

Figure 3, the feature map provides a unique representation<br />

of only 12 × 8 dimensions for both the intensity and direction<br />

of motion. In the feature map, blue colour corresponds<br />

to low values and red colour corresponds to high values.<br />

II - 431<br />

86<br />

Figure. 3. Motion feature map<br />

3.2. MPEG-7 audio features<br />

The motivation behind using audio descriptors is due to the<br />

fact that most sports have a certain vocabulary associated<br />

with each event. Almost all the announcers will utilize some<br />

of the vocabulary to describe similar events. Therefore we<br />

wanted a compact representation of audio characteristics to<br />

describe the general tone and pitch of the announcer. The<br />

objective is to analyze the similarity in the spoken sound<br />

between similar events.<br />

We used 3 MPEG-7 basic spectral audio features, namely:<br />

Audio Spectrum Envelope (ASE), Audio Spectrum Centroid<br />

(ASC) and Audio Spectrum Flatness (ASF) to achieve<br />

our objective. The ASE descriptor represents the power<br />

spectrum of an audio signal and can be calculated by taking<br />

the Fourier transform (FFT) of the audio signal which<br />

is windowed using a Hamming window with an overlap of<br />

50% between adjacent windows.<br />

The ASC descriptor represents the center of gravity of<br />

the power spectrum. This is calculated by adding the energy<br />

in each frequency bin in the FFT spectrum and dividing it<br />

by the total energy in the frame as shown below:<br />

ASC(l) = ΣK−1<br />

k=0 k.|P (l, k)|2<br />

Σ K−1<br />

k=0<br />

|P (l, k)|2<br />

, (1)<br />

where k is the frequency bins index. The descriptor shows<br />

which frequencies are dominated in the spectrum.<br />

The ASF descriptor represents the overall tonal component<br />

in the power spectrum of the audio signal. It is calculated<br />

by calculating the geometric mean of the audio frame<br />

and dividing it by the arithmetic mean of the audio frame as<br />

shown by the equation:<br />

ASF(l) = (ΠK−1<br />

k=0 |P (l, k)|2 ) 1<br />

N<br />

1<br />

N ΣK−1<br />

k=0 |P (l, k)|2<br />

, (2)<br />

where k is the frequency bins index and N is the size of the<br />

short time fourier transform window.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.


All the above descriptor were quantized into 10 levels,<br />

thus providing a feature set of 30 dimensions.<br />

3.3. MFCC features<br />

Due to the fact that most of the video shots contain a lot of<br />

crowd noise, and we want to extract the perceived rhythm<br />

and sound of the spoken content, we needed a feature that<br />

can model the human hearing and also works well under<br />

noisy condition. MFCC has been used extensively in the<br />

speech recognition systems as it tries to emphasize the frequencies<br />

that are more perceptive to the human ear.<br />

First the audio file is pre-processed in order to remove<br />

the silent segments. Then 13 MFCC coefficients are extracted<br />

for each segment. Each of the segments have 50%<br />

overlap, and thus there is lot of redundancy between adjacent<br />

MFCC values. In order to reduce the dimension of the<br />

matrix, the MFCC values are passed to a feature reduction<br />

stage. The MFCC features are reduced to a 12 × 64 matrix.<br />

4. EXPERIMENTAL RESULTS<br />

In order to evaluate the efficacy of the feature set, we used<br />

simple classification scheme such as Fisher’s Linear Discriminant<br />

<strong>Analysis</strong> (LDA). In a specific sense LDA also<br />

commonly refers to techniques in which a transformation<br />

is done in order to maximize between-class separability and<br />

minimize with-in class variability. LDA works on the feature<br />

set with no prior assumptions about the nature of the<br />

data set. It tries to compute a weight vector w, which when<br />

multiplied by the input feature vector x would generate discriminant<br />

functions gi(x). For C classes problem we define<br />

C discriminant functions g1(x)...gC(x). The feature vector<br />

x is assigned to a class whose discriminant function is the<br />

largest value of x.<br />

The test database consists of 200 video shots with durations<br />

varying from 5 seconds to about 25 seconds. In the<br />

database there are 88 pass plays, 67 run plays and 45 kicking<br />

plays. A total of 8 different teams were used to create<br />

the database from 4 different networks. This variety in the<br />

database ensured that the sample space of our work was diverse<br />

and included all major broadcasters.<br />

Table 1, shows the indexing results of using MPEG-7<br />

motion and audio descriptors along with MFCC features.<br />

5. CONCLUSIONS<br />

In this paper we have proposed a system with two main<br />

components. The first component finds the starting points<br />

of play events within a video shot. The second component<br />

is responsible for indexing and classificationofeventsin<br />

the American football domain. Both the components of the<br />

system utilize MPEG-7 motion descriptors, while MPEG-7<br />

II - 432<br />

87<br />

Play MPEG-7 MPEG-7 MPEG-7 motion<br />

Events motion motion+audio audio+MFCC<br />

Pass 79.5% 85.2% 94.3%<br />

Run 92.5% 91.0% 89.6%<br />

FG/XP 87.5% 87.5% 93.8%<br />

K/P 65.5% 82.8% 93.1%<br />

Overall 82.5% 87.0% 92.5%<br />

Table 1. Classification Performance Summary<br />

audio and MFCC features are added to enhance the indexing<br />

capabilities of the system.<br />

Although there is no baseline to compare our results<br />

with, but somewhat similar works reported in indexing and<br />

retrieval of American football events [2] [3] [4], have shown<br />

indexing accuracy of 84%, 81% and 84% respectively. In<br />

this work the we obtained classification accuracy of 82.5%<br />

by using a MPEG-7 motion features alone, while the above<br />

mentioned works used multiple modalities. By using multiple<br />

modalities, our system is able to index the events into 4<br />

categories with 92.5% accuracy.<br />

6. REFERENCES<br />

[1] W.J. Christmas B. Levienaise-Obadia J. Kittler,<br />

K. Messer and D. Koubaroulis, “Generation of semantic<br />

cues for sports video annotation,” in Proc. of IEEE<br />

Intl. Conf. on Image Processing.<br />

[2] N. Babguchi S. Miyauchi, A. Hirano and T. Kitahashi,<br />

“Collaborative multimedia analysis for detecting semantical<br />

events from broadcasted sports video,” in<br />

Proc. of IEEE 16th Intl. Conf. on Pattern recognition.<br />

[3] G. West M. Lazarescu, S. Venkatesh and T. Caelli, “On<br />

the automated interpretation and indexing of american<br />

football,” in IEEE Intl. Conf. on Multimedia Computing<br />

and Systems.<br />

[4] N. Babaguchi N. Nitta and T. Kitahashi, “Extracting<br />

actors, actions and events from sports video - a fundamental<br />

approach to story tracking,” in Proc. of IEEE<br />

Intl.Conf. on Pattern recognition.<br />

[5] R. Radhakrishnan Z. Xioing and A. Divakaran, “Generation<br />

of sports highlights using motion activity in combination<br />

with a common audio feature extraction framework,”<br />

in Proc. of IEEE Intl. Conf. on Image Processing.<br />

[6] P. Salembier B.S. Manjunath and T. Sikora, Introduction<br />

to MPEG-7: Multimedia Content Description Interface,<br />

John Wiley and Sons, England, UK, 2002.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.


2004 International Conference on <strong>Signal</strong> Processing 8 Communications (SPCOM)<br />

AUDIO SIGNAL FEATURE EXTRACTION AND CLASSIFICATION USING<br />

LOCAL DISCRIMINANT BASES<br />

Karthikeyan Umupathy, Raveendra; K. Rao<br />

Dept. of Electrical and Computer Engg.<br />

The <strong>University</strong> of Western Ontario<br />

London, ON, Canada N6A 5B9<br />

Email: kumapath@uwo.ca. rkrao@eng.uwo.ca<br />

ABSTRACT<br />

Automatic cIassilication of audio signals is an intcresting<br />

and a challenging task. With the rapid growth of' multimedia<br />

content over Internet, intelligent content-based audio and<br />

video retrieval techniques are required to perform efficient<br />

search over vast databases, Classification schemes form the<br />

basis of such content-based retrieval systems. In this paper<br />

we propose an audio classification scheme using Local Dis-<br />

criminant Bases (LDB) algorithm. The audio signals were<br />

decomposed using wavelet packets and the high discrimi-<br />

natory nodes were selected using the LDB algorithm. Two<br />

different dissimilarity measures were used to sclcct the LDB<br />

nodes and to extract features from them. The features were<br />

fed to a Linear Discriminant <strong>Analysis</strong> based classifier for<br />

a six group (Rock, Classical, Country, Folk, Jazz and Pop)<br />

and a four group (Rock, Classical, Country and Folk) clas-<br />

sifications. Overall classification accuracies as high as 77%<br />

and 88% were achieved for the six and four group classifi-<br />

cations respectively using a database of 170 audio signals.<br />

1.. INTRODUCTION<br />

Over the years many existing techniques [14] have addressed<br />

the problem of classification and content-based retrievat<br />

of audio signals. The general methodology of audio<br />

classification involves extracting discriminatory features<br />

from the audio data and feeding them to a pattern classifier.<br />

The features can be extracted either directly from the<br />

time domain or from a transformation domain depending<br />

upon the choice of signal anaIysis tool. Some of the audio<br />

features that have been succcssfully used for audio classification<br />

include mel-frequency cepstral coefficients (MFCC)<br />

[3], spectral similarity, timbral texture [3], band periodicity<br />

[2], zero crossing rate [2], entropy [5] and octaves (61.<br />

Few techniques generate a pattern from the features and use<br />

it for classification by the degree of correlation. Few other<br />

techniques use the numerical values of the features with statistical<br />

classification packages.<br />

0-7803-8674-4/04/$20.00 02004 IEEE 457<br />

Sridhar Krishnan<br />

Dept. of Electrical and Computer Engg.<br />

<strong>Ryerson</strong> <strong>University</strong><br />

Toronto, ON, Canada M5B 2K3<br />

Email: krishnan @ee.ryerson.ca<br />

Audio signals are highly non-stationary in nature and<br />

the best way to analyze them is to use a joint time-frequency<br />

(TF) approach. The previous works [5,6] of the authors<br />

have demonstrated the success of TF approach in audio clas-<br />

sification. In [5], audio features such as entropy, centroid,<br />

centroid ratio, bandwidth, silence ratio, energy ratio, fre-<br />

quency location of minimum and maximum energy were<br />

extracted from the spectrogram of the audio signals. These<br />

features werc fed to a Linear Discriminant <strong>Analysis</strong> (LDA)<br />

based classifier to perform a six group classification. An<br />

overall classification accuracy of 93% was reported with a<br />

database of 143 audio signals. In [6], the distribution values<br />

of thc TF decomposition parameter 'octave' over 3 bands of<br />

frequencies were used as the audio features and a similar six<br />

group classification was performed with a database of 170<br />

audio signals. An overall classification accuracy of 97.6%<br />

was rcparted.<br />

In order to perform efficient TF anaiysis on the signals<br />

for classification purposes, it is essential to locate the sub-<br />

spaces on the TF plane that demonstrate high discrimination<br />

between different classes of the signals. Once the target sub-<br />

spaces are identified, it is easier to extract relevant features<br />

for classifications. In the proposed work we use Local Dis-<br />

criminant Bases algorithm (LDB) [7] with wavelet packet<br />

bases to identify these target subspaces in the TF plane to<br />

classify the audio signals. The optimal choice of LDBs de-<br />

pends on the nature of the dataset and the dissimilarity mea-<br />

sures used to distinguish between classes. A combination<br />

of tnUItipk dissimilarity measures can be used to achieve<br />

high classification accuracies. Features were extracted from<br />

the basis vectors of the LDB nodes and fed to a LDA based<br />

classifier for a six (Rock, Classical, Country, Folk, Jazz and<br />

Pop) and four (Rock, Classical, Country and Folk) group<br />

classification. The paper is organized as follows: Section 2<br />

covers the methodology comprising of LDB, LDB selection<br />

process, feature extraction and pattern classification. Re-<br />

sults and Discussions are covered in Section 3, and Conclu-<br />

sions in Section 4.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />

88


2. METHODOLOGY<br />

2.1. Local Discriminant Bases Algorithm<br />

In the LDB [7] algorithm with wavelet packet bases, a set<br />

of training signals xp for all P classes are decomposed to<br />

a full tree structure of order N. The indexes p and i represent<br />

the pth signal class and ith signal in the training set of<br />

pth class. We restrict our analysis to binary wavelet packet<br />

trees [8]. Let s l be ~ a vector ~ space in €2" corresponding to<br />

the node 0 of the parent tree. Then at each level the vector<br />

space is spilt into two mutually orthogonal subspaces given<br />

by fl,,k = flj+l,zk @Q7+1,2k+l, wherej indicates the level<br />

of the tree and k represents the node index in level j, given<br />

by k = 0, ,..., 23 - 1. This process repeats till the level J,<br />

giving rise to 2' mutually orthogonal subspaces. Each node<br />

k contains a set of basis vectors B,,k,<br />

where 2". corresponds to the length of the signal. Then the<br />

signal xp can be represented by a set of coefficients as:<br />

3 k 1<br />

Basically the signal xr is decomposed into 2.' subspaces<br />

with aJ,k,J coefficients in each subspace. With the train-<br />

ing signals decomposed into wavelet packet coefficients we<br />

need to define a dissimilarity measure (&) in the vector<br />

space so as to identify those subspaces, which have larger<br />

statistical distance between classes. This dissimilarity mea-<br />

sure is used in an iterative manner to prune the tree in such<br />

a way that only a node is split if the cumulative discrimina-<br />

tive measure of the children nodes is greater than the par-<br />

ent nodc. The resulting tree contains the most significant<br />

LDBs, from which a set of Ir' significant LDBs are selected<br />

to construct the final tree. The testing set signals are then<br />

expanded using this tree and features are extracted from the<br />

respective basis vectors for classification.<br />

2.2. LDB selection process<br />

In the proposed method, we use a modified LDB approach.<br />

Instead of using a single dissimilarity measure, we use two<br />

dissimilarity measures (D1 and Dz) to arrive at the final tree<br />

structure. Using multiple dissimilarity measures provides<br />

more feature dimensions for classification. Especially for<br />

complex data sets like music signals, a single dissimilarity<br />

measure may not be able capture aII the characteristic information<br />

about its class. Also instead of the selective splitting<br />

of the nodes, which basically helps in removing the redundancy<br />

in the LDB selection, we used all the nodes from the<br />

full decomposition tree. The redundancy within the final set<br />

of LDBs were later removed in the feature evaluation pro-<br />

cess.<br />

The goal is to identify those nodes (LDB) from the full<br />

waveIet packet tree which demonstrate high discriminatory<br />

values between all the classes for a given dissimilarity mea-<br />

sure D,. If there are say P classes then the dissimilarity<br />

measure was computed by taking 2 classes at a time i.e.<br />

PC2 combinations, where C stands for the mathematical<br />

operation of combinations. The nodes which show rela-<br />

tively higher discriminatory power compared to all the other<br />

nodes in each of the PC2 combinations were chosen as<br />

LDBs for that particular combination. The LDB nodes are<br />

then sorted by their discriminatory power and the first Q<br />

LDBs were chosen for further processing. This is repeated<br />

for T trials using different audio signals for each of the<br />

classes. All the Q x T LD3s for each of the PC2 com-<br />

bination over the T trials were analyzed for number of oc-<br />

currences. The first 2 highly occurring LDBs for each of<br />

the PCz combinations was chosen as the best LDBs for that<br />

particular combination of classes. So after all the trials we<br />

will have 2PCz LDBs, from which we choose the first 10<br />

highly occurring LDBs over all the combinations. At the<br />

end of this selection process we wil1 have 10 LDBs in the<br />

wavelet packet tree that demonstrate relatively high discrim-<br />

inatory behavior among all the combination of P classes.<br />

In othcr words, these nodes demonstrate high statistical dis-<br />

tance between all the P classes.<br />

In this study the following values were chosen for P =<br />

6 and 4, Q = 5, and T = 10. Also we tested the database<br />

with few variations of wavelets (db, coif, sym) and observed<br />

sym4 wavelet to provide better discrimination between the<br />

classes. Hence, the results presented in this study are based<br />

on the sym4 wavelet packet decompositions. As we also<br />

used two different dissimilarity measures in selccting the<br />

LDBs to enhance the classification accuracy, at the end of<br />

the LDB selection process we will have 2 x 10 LDBs using<br />

the two dissimilarity measures. These 20 LDBs can be used<br />

to construct a composite wavelet packet tree which is used<br />

to decompose the testing set and extract features as will be<br />

explained in Section 2.3.<br />

2.2.1. Dissimilarity measures<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />

89<br />

458<br />

The first dissimilarity measure D1 is the difference in the<br />

normalized energy between the corresponding nodes of the<br />

training signals from one of the PCz combination of classes.<br />

This gives the difference in the energy distribution of the<br />

signals on the TF plane. Audio signals exhibit different<br />

TF energy distribution patterns based on their composition.<br />

Hence this measure is expected to approximately reveal the<br />

different energy concentration locations on the TF plane for<br />

different type of audio signals.<br />

(3)


where. and E;,k are the normalized energy of the cor-<br />

responding nodes for the PC$ combination signab. Fig. 1<br />

shows a sample LDB tree obtained using the dissimilarity<br />

measure D1,<br />

The discriminant measure Dz is a measure of estimat-<br />

ing the randomness or non-stationarity of the basis vectors.<br />

It is computed as the set of variances along the segments of<br />

the basis vector coefficients. The ratio of this variance mea-<br />

sure between the signals from each of the PC2 combination<br />

of classes indicate the amount of deviation observed in the<br />

non-stationarity between the classes. One of the important<br />

characteristics of an audio signal is its time varying signal<br />

structures. This variability in time-varying signal structures<br />

can be approximated using the discriminant measure D2.<br />

where p and q are the index of the L segments obtained by<br />

segmenting the basis vectors at node (j, k) for one of the<br />

PC2 combination of classes. Fig. 2 shows a sample LDB<br />

tree obtained using the dissimilarity measurc D.3.<br />

I<br />

(4)<br />

1 - 1 5 1 " '<br />

05 I 15<br />

'<br />

2<br />

' I<br />

2.5<br />

'I tme katms 1 Q'<br />

Fig. 1. A sample LDB trce obtained using the dissimilarity<br />

measure DI and sym4 wavelet.<br />

2.3. Feature extraction<br />

An audio database consisting of 24 Rock, 35 Classical, 31<br />

Country, 21 Jazz, 34 Folk and 25 Pop signals (a total of 170<br />

signals) was used in this study. Each of the signal from this<br />

database were extracted from individual music CDs. AI1 the<br />

samples were of 5 seconds length sampled at 44.1 kHz, Af-<br />

ter the LDBs were selected as described in the previous sec-<br />

tion, a composite wavelet packet tree was constructed using<br />

all the 20 LDBs. The signals from the audio database were<br />

1 1 I<br />

I<br />

Fig. 2. A sample LDB tree obtained using the dissimilarity<br />

measure D2 and sym4 wavelet.<br />

decomposed using this composite wavelet packet tree. The<br />

basis vectors from each of the LDB nodc from this wavelet<br />

packet tree can be directly used as features. However con-<br />

sidering the dimensions of the basis vectors, we extract the<br />

discriminatory values using thc same dissimilarity measures<br />

(Dl and Dz) on the LDB nodes and use them as features. So<br />

for each audio signals we will have 10 features using each of<br />

thc dissimilarity measure. In total we will have 20 features<br />

for each signal. The cornbination of these 20 features were<br />

evaluated for their significance in the class separability. A<br />

wrapper approach was used to select the highly discrimina-<br />

tive feature set. In wrapper approach the features are either<br />

added or removed sequentially one by one and the classi-<br />

fication accuracy is computed using thc classifier. The set<br />

of featurcs which provide minimum classification error was<br />

chosen to be the optimal feature set. The resulting set of<br />

fcatures were fcd a pattern classifier as will be explained in<br />

the next Section.<br />

459<br />

2.4. Pattern Classification<br />

The motivation for the pattern classification is to automat-<br />

ically group signals of same characteristics using the dis-<br />

criminatory features derived as explained in the previous<br />

section. Pattern classification was camed out using a LDA<br />

based classifier. In LDA, the feature vector derived as ex-<br />

plained above were transformed into canonicaI discriminant<br />

functions [91 such as<br />

f = Wlbl + ~ 2b2 +.......-I- ~,b, +a, (5)<br />

where {U} is the set of highly discriminative features, {b}<br />

and a are the coefficients and constant respectively. Using<br />

the discriminant scores and the prior probability values of<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />

90


.<br />

each group, the posterior probabilities of each sample oc-<br />

curring in each of the groups are computed. The sample is<br />

then assigned to the group with the highest posterior proba-<br />

bility.<br />

The classification accuracy was estimated using the Ieave-<br />

one out method which is known to provide a least bias esti-<br />

mate [ 101. In leave-one-out method, one sample is excluded<br />

from the dataset and the classifier is trained with the remain-<br />

ing samples. Then the excluded signal is used as the test<br />

data and the classification accuracy is determined. This is<br />

repeated for all samples of the dataset. Since each signal<br />

is excluded from the training set in turn, the independence<br />

between the test and the training set are maintained.<br />

3. RESULTS AND DISCUSSIONS<br />

AI! the signals from the audio database were decomposed<br />

using the LDB wavelet packet tree and features were ex-<br />

tracted as explained in Sections 2.2 and Section 2.3. The<br />

features were then fed to LDA based classifier. The clas-<br />

sification accuracies were computed and verified using the<br />

leave-one-out method. Table 1 shows the classification ac-<br />

curacies achieved for a six group classification. An overall<br />

classification accuracy of 77% using rcgular LDA and 65%<br />

using leave-one-out method werc achieved. From the table<br />

it can be observed that the Classical and Country classes<br />

were classified more accurately (94% and 84%) followed<br />

by the Rock and Folk (79% and 79%). We observe that<br />

the classes Jazz and Pop have significant overlap with other<br />

ctasscs showing a poor classification accuracy. Fig. 3 shows<br />

the scatter plot of the 6 group classification. The cluster-<br />

ing behaviour of the classes can he observed. Rock, Classi-<br />

cal, Country and Folk classes show distinct clusters whereas<br />

the Jazz and Pop overlap with the clusters of Classical and<br />

Country. It is hard to perceptually arrive at a clear boundary<br />

between music types. Always there exist a natural overIap<br />

between similar type of music signals. However, the poor<br />

classification of Jazz and Pop in our case may be attributed<br />

to the insufficient and less discriminatory clues (features)<br />

used in this study. Fine tuning and adding more dissimilar-<br />

ity measures can improve the overall classification accuracy.<br />

As we observed significant overlap from the Jazz and<br />

Pop classes, we performed a second classification using only<br />

4 groups (124 signals), removing Jazz and Pop. This was<br />

done to asses the performance of the classifier with the re-<br />

maining 4 groups. As expected the overall classification ac-<br />

curacy improved from 77% to 88% for the regular LDA and<br />

65% to 82% using leave-one-out method as shown in Ta-<br />

ble 2. Fig. 4 shows a clearer clustering behavior of the 4<br />

classes. The results obtained are from our initial testing of<br />

the proposed technique. Author's previous work [6] using<br />

a different TF approach has provided better classification<br />

accuracies, however with double the size of the rkported<br />

Method<br />

Original<br />

Gr Ro CI CO .Is Fo Po 1 CA%<br />

Ro 19 0 5 0 0 0 I 79.2<br />

CI 0 33 0 1 0 1 I 94.3<br />

Table 1. Six group classification results. Method: Origi-<br />

nal - Regular linear discriminant analysis, Cross - validated<br />

- Linear discriminant analysis with leave-one-out method,<br />

CA% - Classification accuracy rate, Gr-<strong>Group</strong>s, Ro-Rock,<br />

CI-Classical, CO-Country, Fo-Folk, Ja-Jazz, Po-Pop.<br />

N<br />

f i "<br />

4<br />

Y .I<br />

Scatter ploi<br />

F~ntmn 1<br />

#<br />

* * y +.<br />

*d *<br />

Fig. 3. Six groups scatter plot with the first two canonical<br />

discriminant functions<br />

features in this work. Also restricting the final significant<br />

LDBs to 10 from the set of 2PC2 controls the classification<br />

accuracy.<br />

4. CONCLUSIONS<br />

A novel LDB based audio classification scheme was pre-<br />

sented. High classification accuracies were achieved using<br />

the proposed methodology. Initial results suggest significant<br />

potential for LDB based audio classification. Simple dis-<br />

similarity measures like node energy and non-stationarity<br />

index performed well in identifying the discriminatory nodes<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />

91<br />

460<br />

9


CO 28 903<br />

I I I<br />

Validated 1 CI I 0 1 32 I 3 1 0 I 91.4<br />

IC01 1 1 4 1261 0 183.9<br />

Table 2. Four group classification results. Method: Origi-<br />

nal - Regular linear discriminant analysis, Cross - validated<br />

- Linear discriminant analysis with leave-one-out method,<br />

CA% - Classification accuracy rate, Gr-<strong>Group</strong>s, Ro-Rock,<br />

Cl-Classical, CO-Country, Fo-Folk.<br />

Scatter plot<br />

4 . 7-<br />

I “h.<br />

*? I.<br />

d -<br />

--<br />

. .<br />

Fig. 4. Four groups scatter plot with the first two canonical<br />

discriminant functions<br />

hetween the music classes. Future work involves improving<br />

the LDB selection process, arriving at an optimal number of<br />

LDBs for a given classification problem and include more<br />

dissimilarity measures for audio classification.<br />

5. ACKNOWLEDGEMENTS<br />

Thc authors thank the NSERC organization for funding this<br />

project. The authors also acknowledge the contributions of<br />

Andre Chang, a research assistant in the <strong>Signal</strong> <strong>Analysis</strong><br />

and <strong>Research</strong> (<strong>SAR</strong>) group at <strong>Ryerson</strong> <strong>University</strong>, Toronto,<br />

Canada.<br />

.<br />

46 1<br />

6. REFERENCES<br />

[I] H. G. Kim, N. Moreau, and T. Sikora, “Audio clas-<br />

sification based on mpeg-7 spectral basis representa-<br />

tions,” IEEE Transactions on circuits and systems for<br />

video technology, vol. 14, no. 5, pp. 716-725, May<br />

2004.<br />

[2] Lie Lu and Hong-hang Zhang, “Content analysis for<br />

audio classification and segmentation,” IEEE Trans-<br />

actions on Speech and Audio Processing, vol. 10, no.<br />

7, pp. 504-5 16, Oct 2002.<br />

[3] George Tzanetakis and Perry Cook, “Music genre<br />

classification of audio signals,” IEEE Transactions on<br />

Speech and Audio Processing, vol. IO, no. 5, pp. 293-<br />

302, July 2002.<br />

[4] G. Guo and S. 2. Li, “Content-based audio classifica-<br />

tion and retrieval by support vector machines,” fEEE<br />

Transactions on neural networks, vol. 14, no. 1, pp.<br />

209-215, Jan 2003.<br />

[SI S. Esmaili, S. Krishnan, and K. Raahemifar, “Con-<br />

tent based audio classification and retrieval using joint<br />

time-frequency analysis,” in IEEE Itrtertiational Con-<br />

ference on Acoustics, Speech arid Sigtzal Processing<br />

(ICASSP), May 2004, pp. V 665668.<br />

K. Umapathy, S. Krishnan, and S. Jimaa, “Multi-<br />

group classification of audio signals using time-<br />

frequency parameters,” IEEE Trarisactiotis on MuE-<br />

timedia, in press.<br />

N. Saito and R. R. Coifmann, “Local discriminant<br />

bases and their applications,” JolournaE of Muthemat-<br />

ical hinging arid Vision, vol. 5, no. 4, pp. 337-358,<br />

1995.<br />

Stephane Mallat, A wavelet tour of signal processing,<br />

Academic press, San Diego, CA, 1998.<br />

SPSS Inc., “SPSS advanced statistics user’s guide,” in<br />

User marrual, SPSS Inc., Chicago, IL, 1990.<br />

K. Fukunaga, Introduction to SratisticaE Pattern<br />

Recognition, Academic Press, Inc., San Diego, CA,<br />

1990.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />

92


A NOVEL ROBUST IMAGE WATERMARKING USING A CHIRP<br />

BASED TECHNIQUE<br />

Arunan Ramalingam and Sridhar Krishnan<br />

Department of Electrical and Computer Engineering,<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, Ontario, Canada MSB2K3<br />

Email: (aramalin)(krishnan) @ee.ryerson.ca<br />

Abstract<br />

In this study, we propose a new spread spectrum im-<br />

age watermarking algorithm thnt embeds linear chirps as<br />

watermark messages. The slopes of the chirp on the time-<br />

fmquency (TF) plane represent watermark messages such<br />

that each slope corresponds to a different message. We<br />

extract the watermark message using a line detection al-<br />

gorithm based on the Hough-Radon transform (HRTJ. The<br />

HRTderects the directional elements rhnt sari@ a paramet-<br />

ric constraint in the image of a TF plane. The proposed<br />

method not only detects the presence of watermark, but also<br />

extracts the embedded watermark bits and ensures the mes-<br />

sage is received correctly. The robustness of the proposed<br />

scheme has been evaluated using common imagepmcessing<br />

techniques such as JPEG compression and image cmpping,<br />

and we found that the maximum bit error rate to be 19.03%<br />

which is zero aferposrprocessing using HR?:<br />

Keywords: Imnge Watermarking, Spread Spectrum, Data<br />

Hiding, Hough-Radon Transform, Chirp Modulation.<br />

1. INTRODUCTION<br />

The huge success of the Internet allows for the trans-<br />

mission, wide distribution, and access of electronic data in<br />

an effortless manner. Content providers are faced with the<br />

challenge of how to protect their electronic data. One of<br />

the possible solutions in that area is data watermark, which<br />

is added to multimedia content by embedding an imper-<br />

ceptible and statistically undetectable signature. Thereby,<br />

multimedia data creators and distributors are able to prove<br />

ownership of intellectual property rights without forbidding<br />

other individuals to copy the multimedia content itself. In<br />

this study, we propose a novel chirp based watermarking<br />

scheme [I] for images that embeds linear chirps as water-<br />

mark messages. Different chirp rates, i.e., slopes on the<br />

time-frequency (TF) plane, represent watermark messages<br />

such that each slope corresponds to a different message. The<br />

narrowband watermark messages are spread with a water-<br />

mark key (PN sequence) across a wider range of frequen-<br />

CCECE 2004 - CCGEI 2004, Niagara Falls, May/mai 2004<br />

0-7803-8253-6/04/$17.00 @ 2004 IEEE<br />

- 1889 -<br />

cies before embedding. The resulting wideband noise is<br />

added to the perceptually significant regions of the origi-<br />

nal image. We use the block-based discrete cosine trans-<br />

form (DCT) scheme for inserting the watermark. As a re-<br />

sult of image manipulations some message bits extracted by<br />

the detector may he in error potentially resulting in the de-<br />

tection of the wrong watermark message. Our motivation<br />

for the proposed image watermarking algorithm is to detect<br />

the presence of the watermark, extract the embedded wa-<br />

termark message bits and decide on the watermark message<br />

even if some hits are received in error. As chirps are repre-<br />

sented as lines in a TF plane, line detection algorithm such<br />

as HRT has been applied to extract the watermark messages<br />

successfully.<br />

2. WATERMARK EMBEDDING<br />

Let m be a normalized chirp function that represents the.<br />

watermark message to be embedded into the original image.<br />

m takes continuous values in the interval [-1,11, and needs<br />

to be quantized for the detection of each embedded bit. mq<br />

is the quantized version of m formed according to the sign<br />

of the sample values of m, taking values -1 and 1. Let m:<br />

represent a watermark message bit to he embedded into the<br />

image. Each bit mi is spread with a cyclic shifted version<br />

pk of a binary PN sequence with a chip length of N and<br />

summed together to generate the widehand noise vector w.<br />

M<br />

w = m;Pk, (1)<br />

k=O<br />

where M is the number of watermark message bits in mq.<br />

There is always a possibility to make the trade-off between<br />

the embedded data size and robustness of the algorithm; as<br />

the PN length is decreasing, the algorithm is able to add<br />

more bits into the host image but the detection of the hidden<br />

bits and resistance to different attacks is decreased. The<br />

wideband noise vector w formed is added to the image in<br />

perceptually significant regions to ensure robustness of the<br />

watermark against attacks. The length of w and hence the<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />

93


number of watermark bits that can he embedded depends on<br />

the perceptual entropy of the image.<br />

To embed the watermark in the image, we can utilize<br />

the models that describe the masking characteristics of the<br />

human visual system [2]. Among such models, we use<br />

the model based on the jus: noticeable difference (JND)<br />

paradigm (31. A set of JNDs is associated with a particu-<br />

lar invertible transform T. Given that a multimedia signal<br />

is transformed using T, the JNDs provide an upper hound<br />

on the extent that each of the coefficients can be perturbed<br />

without causing perceptual changes to the signal quality.<br />

The set of signal and transform dependent JNDs can he de-<br />

rived using complex analytic models or through experimen-<br />

tation. The JND paradigm is widely used in image com-<br />

pression, and image watermarking applications. We use the<br />

JNDs to determine the perceptually significant regions and<br />

also to find the perceptual entropy of the image. In this work<br />

we use one such model based on the DCT.<br />

DCT Based Model<br />

We use the model proposed by Watson [4] that has been<br />

applied to JPEG coding. In this method, the original image<br />

is decomposed into non-overlapping 8x8 blocks, and the<br />

DCT is performed independently for every block of data.<br />

Let the original image pixels are represented as z,,j,b, where<br />

Fig. 1. Watermark embedding scheme.<br />

i and j represent the pixel elements in block b, and Xu,u,b<br />

denotes the DCT coefficients for the hasis function located<br />

in the position U, II of the block b. A frequency thresh-<br />

old value is derived for each DCT basis function and in<br />

this case result in an 8x8 mabix oft:," threshold values.<br />

These threshold values are determined for various view-<br />

ing conditions by Peterson er. al. [SI. The visual model<br />

we used is for a minimum viewing distance of four picture<br />

heights and a D65 monitor white point. Watson further re-<br />

- 1890 -<br />

94<br />

fines this model by adding a luminance sensitivity and con-<br />

trast masking component. Luminance sensitivity threshold<br />

is estimated by the formula<br />

where XO,O,b is the DC coefficient of the DCT for block b,<br />

Xh,o is the DC coefficient corresponding to the mean luminance<br />

of the display, and a is a parameter which controls the<br />

degree of luminance sensitivity. The authors in [51 suggest<br />

to set the value of a to 0.649. Given a DCT coefficient and<br />

a corresponding threshold value derived from the viewing<br />

conditions and local luminance masking, a contrast masking<br />

threshold is derived as<br />

tgv,b = max [t,".u,b, IXU,V,bl"'", (t;,t,b)'-""'"] > (3)<br />

where w,,,is a number between zero and one, and is empir-<br />

ically derived as 0.7 by the authors in [SI. The watermark<br />

embedding scheme is based on the model proposed in [61.<br />

The watermark encoder for the DCT scheme is described as<br />

where Xu,-,b refers to the DCT coefficients, X:,u,p refers to<br />

the watermarked DCT coefficients, ~ ~ , is ~ ohmned , b from<br />

the wideband noise vector w, and t$,,,b is the computed<br />

JND calculated from the visual model described in Eq(3).<br />

Fig. 1 shows the block diagram of the described watermark<br />

encoding scheme.<br />

3. WATERMARK DETECTION<br />

The received image may be different from the water-<br />

marked image due to some intentional or unintentional im-<br />

age processing operations such as lossy compression, shift-<br />

ing and downsampling. Fig. 2 shows the block diagram of<br />

the described watermark decoding scheme. The detection<br />

scheme for the DCT based watermarking can he expressed<br />

as<br />

where X;,",& are the coefficients of the received watermarked<br />

image, and +f is the received widehand noise vector. The re-<br />

ceived widehand noise vector can be expressed as<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />

iK=w+n, (7)


where n is the distortion component resulting from hostile<br />

image manipulations and is modeled as a zero-mean ran-<br />

dom vector uncorrelated with the PN sequence. We use the<br />

watermark key, i.e., the appropriately circular shifted PN<br />

sequence pk to despread %, and integrate the resulting se-<br />

quence to generate a test statistic (%, pr). The sign of the<br />

expected value of the statistic depends only on the emhed-<br />

ded watermark hit mi. Hence the watermark hits can be<br />

estimated using the decision rule:<br />

la TFD HRT m<br />

- 1891 -<br />

!m* L m rab, -MWl<br />

Fig. 4. Line detection using HRT.<br />

(WV) and the Hough space of the linear chirp received at<br />

a hit error rate of 19.03%. The prominence of the global<br />

maximum in the HRT space provides an indication of the<br />

presence of chirps in TFD and thereby leading to successful<br />

watermark detection.)<br />

4. RESULTS AND DISCUSSION<br />

We evaluated the proposed scheme using eight differ-<br />

ent images of size 512x512. The sampling frequency f&<br />

of the watermarks equal 1 kHz. Hence the initial and final<br />

frequencies, fob and fib of the linear chirps representing<br />

all watermark messages are constrained to [O-5001 Hz. We<br />

embed these chirps in to the images for a chip length of N,<br />

which depends on the perceptual entropy of the image. We<br />

experimentally found that for reliable detection of chirp un-<br />

der various image processing attacks, the chip length should<br />

he atleast 10000 samples. If the image can support more<br />

1M)OO samples, then.multiple chirps can he embedded and<br />

the payload can he increased. In out tests, we used a sin-<br />

gle watermark sequence having 176 message hits. To mea-<br />

sure the robustness of the watermarking algorithm, we per-<br />

formed the following difficult image manipulation tests: (i)<br />

JPEG Compression, (ii) JPEG Compression and Cropping<br />

(114 Original), (iii) JPEG Compression and Cropping (1116<br />

Original). The JPEG compression is performed with dif-<br />

ferent quality Q; higher value of Q indicates better image<br />

quality. These tests are performed on the watermarked im-<br />

ages to simulate the image processing attacks and the water-<br />

mark message hits are extracted as described in Section 3.<br />

During all these robustness tests, we assume that the image<br />

and the PN sequence are synchronized. Figs. 5 - 6 show the<br />

hit error rate (BER) obtained for JPEG compression with<br />

quality factors 60 and 20 respectively, and with watermark<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />

95


sJPEG CmpmrsionlSO; + Croppngilri)<br />

Fig. 5. BER (in %) for PEG compression with quality<br />

factor 60.<br />

message length of 176 bits. The extracted hits are localized<br />

in the TF domain using WV. Although some of the bits are<br />

received in error, HRT is able to detect the presence of chirp<br />

and estimate the parameters of the chirp for all the simula-<br />

tion results reported in the study.<br />

5. CONCLUSION<br />

In this paper, we proposed a novel image watermark-<br />

ing algorithm that embeds linear chirp as watermark mes-<br />

sages. The watermark message is added to the perceptually<br />

significant regions of the image to ensure robustness of the<br />

watermark to common image processing attacks. The algo-<br />

rithm is able to extract the watermark message even if some<br />

of the bits received are in error. A line detection algorithm<br />

based on the HRT detects the slope of the watermark mes-<br />

sage in the image of the TF plane of the chirp signal. The<br />

HRT provides error correcting capability and can be effi-<br />

ciently implemented as it operates on small images of the<br />

TF plane. Our studies confirm the robustness of the algo-<br />

rithm to image compression and cropping attacks. We are<br />

currently working in expanding our robustness tests and de-<br />

veloping a complete analytical model for the algorithm.<br />

Acknowledgements<br />

We would like to acknowledge Micronet for their finan-<br />

cial support.<br />

- 1892 -<br />

^^<br />

~ .................................................................. . ~~~<br />

~~ ..<br />

.............<br />

im1<br />

m<br />

lm2 Im3 ImZ ImL Im G Im 7 lm8<br />

OJPEG Cmpressian(20)<br />

0 JPEG CmpierEion(PO)+ CrappngW4I<br />

oJPEO Cmpresrian/PO) I CroF@ngil!lGi<br />

Fig. 6. BER (in %) for JPEG compresion with quality<br />

factor 20.<br />

References<br />

[I] S. Erkucuk,S. Krishnan and M. Zeytinoglu, “Ro-<br />

bust Audio Watermarking Using a Chirp Based Tech-<br />

nique,”IEEE Intl. Cony on Multimedia andfipo, vol.<br />

2, pp. 513416.2002.<br />

[2] M. S. Sanders and E. J. McCormick, Human Factors<br />

in Engineering and Design, McGraw-Hill, New York,<br />

7th edition, 1993.<br />

131 N. Jayant, J. Johnston, and R. Safranek, “<strong>Signal</strong> Com-<br />

pression Based Models of Human Perception,” Pm-<br />

ceedings ofthe IEEE, vol. 81, pp. 1385-1422, October<br />

1993.<br />

[4] A. B. Watson “DCT quantization matrices visually OQ-<br />

timized for individual images,’’ Proc. SPIE Con5 Hu-<br />

man Vision, Visual Processing, and Digital Display,<br />

vol. 1913,pp. 202-216. February 1993.<br />

151 H. A. Peterson, A. J. Ahumada and A. B. Watson,<br />

“Improved detection model for DCT coefficient quan-<br />

tization:’ Proc. SPIE Cony Human Vision, Visual Pro-<br />

cessing, andDigita1 Display, vol. 1913, pp. 191-201.<br />

February 1993.<br />

[6] C.I. Podilchuk and W. Zeng “Image-Adaptive Water-<br />

marking Using Visual Models,” IEEE Journal on Se-<br />

lected Areas in Communications, vol. 16, pp. 525-<br />

539. May 1998.<br />

171 R.M. Rangayyan and S. Krishnan, “Feature identifica-<br />

tion in the time-frequency plane by using the Hough-<br />

Radon transform,” Trans. Pattern Recognition, vol. 34,<br />

pp. 1147-1158,2001.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />

96<br />

I<br />

I


A Novel Way of Lossless Compression of Digital Mammograms<br />

Using Grammar Codes<br />

Xiaoli Li, Sridhar Krishnan and Ngok-Wah Ma<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ONT M5B 2K3, CANADA<br />

Phone: 416-979-5000 ext.6086 Fax: 416-979-5280<br />

Abstract-Breast cancer is the most common cancer among women<br />

in Canada. Despite slight declines in mortality rates over the past<br />

decade for women with breast cancer, one in nine Canadian<br />

women will develop breast cancer in her lifetime; one in 25<br />

Canadian women will die from this disease. Digital mammograms<br />

(X-rays of the breast) may allow better cancer magnosis and has<br />

the ability to be transmitted electronically around the world. The<br />

problem is mammograms are large size images and have less<br />

correlation details. Therefore, for a physician to diagnose diseases<br />

correctly even through the communication networks, gaining<br />

higher compression to save bandwidth without any data loss<br />

becomes a challenging issue. Among the traditional lossless<br />

compression algorithms such as Huffman, Lempel-Ziv and<br />

Arithmetic, Lempel-Ziv and Arithmetic source coding techniques<br />

have better performances than Hut?inan on digital mammograms.<br />

In order to achieve better compression ratios we investigate the<br />

newly developed Grammar-based source code for medical image<br />

compression such as mammograms. In this Grammar-based code,<br />

the original data (image) is first transformed into a context free<br />

grammar, from which the original data sequence can be fully<br />

reconshucted by performing parallel and recursive substitutions,<br />

and then uses an arithmetic coding algorithm to compress the<br />

context free grammar or the corresponding sequence of parsed<br />

phrases. We tested the grammar-based coding technique on digital<br />

mammograms obtained from the Mammographic Image <strong>Analysis</strong><br />

Society (MIAS). The result shows the newly developed grammar<br />

code performs better than the traditional lossless coding schemes.<br />

In general, the grammar-based lossless compression algorithm<br />

seems to be a promising technique for teleradiology applications.<br />

Keywordr-Arithmetic coding, grammar-based codes,<br />

mammography, compression ratio.<br />

I. INTRODUCTION<br />

In this paper, we investigate a novel lossless source coding<br />

technique called the grammar code for lossless compression of<br />

mammography. Universal source coding theory aims at designing<br />

data compression algorithms, whose performance is<br />

asymptotically optimal for a class of sources.<br />

To put things into perspective, let us first review briefly,<br />

from the information-theoretic point of view, the existing<br />

universal lossless data compression algorithms. So far, the most<br />

widely used universal lossless compression algorithms are<br />

arithmetic coding algorithms, Lempel-Ziv algorithms, and their<br />

variants. Arithmetic coding algorithms and their variants are<br />

statistical model-based algorithms. To use an arithmetic coding<br />

algorithm to encode a data sequence, a statistical model is either<br />

built dynamically during the encoding process, or assumed to<br />

exist in advance. Several approaches have been proposed in the<br />

CCECE 2004- CCGEI 2004, Niagara Falls, Maylmai 2004<br />

0-7803-8253-6/04/$17.00 02004 IEEE<br />

- 2085 -<br />

97<br />

literature to build the statistical model dynamically. Typically, in<br />

all these methods, the next symbol in the data sequence is<br />

predicted by a proper context and coded by the corresponding<br />

estimated conditional probability. Arithmetic coding algorithms<br />

and their variants are universal only with respect to the class of<br />

Markov sources with order less than some designed parameter<br />

value. Note that in arithmetic coding, the original data sequence is<br />

encoded letter by letter. In contrast, no statistical model is used in<br />

Lempel-Ziv algorithms and their variants. During the encoding<br />

process, the original data sequence is parsed into non-overlapping,<br />

variable-length phrases according to some kind of string matching<br />

mechanism, and then encoded phrase by phrase. Each parsed<br />

phrase is either distinct or replicated with the number of<br />

repetitions less than or equal to the size of the source alphabet.<br />

Phrases are encoded in terms of their positions in a dictionary or<br />

database. Lempel-Ziv algorithms are universal with respect to a<br />

class of sources which is broader than the class of Markov sources<br />

of bounded order; the incremental parsing Lempel-Ziv algorithm<br />

[SI is universal for the class of stationary, ergodic sources.<br />

Other universal compression algorithm include the dynamic<br />

HuiXnan algorithm [6], the move to front coding scheme [7] [SI<br />

[9], and some two-stage compression algorithms with codebook<br />

transmission [IO] [ll]. These algorithms are either inferior to<br />

arithmetic coding algorithms and Lempel-Ziv algorithms, or too<br />

complicated to implement.<br />

The class of grammar-based codes is broad enough to<br />

include block codes, Lempel-Ziv types of codes, multilevel<br />

pattern matching (MPM) grammar-based codes [2], and other<br />

codes as special cases. It has been proved in [I] that if a grammar-<br />

based code transforms each data sequence into an irreducible<br />

context-free grammar, then the grammar-based code is universal<br />

for the class of stationary ergodic sources. (For the definition of<br />

grammar-based codes and irreducible context free grammars,<br />

please see Section 11.) Each irreducible context-free grammar also<br />

gives rise to a nonoverlapping, variable-length parsing of the data<br />

sequence it represents. Unlike the parsing in Lempel-Ziv<br />

algorithms, however, there is no upper bound on the number of<br />

repetitions of each parsed phrase. More repetitions of each parsed<br />

phrase imply that now there is room for arithmetic coding, which<br />

operates on phrases instead of letters. (In Lempel-Ziv algorithms,<br />

there is not much gain from applying arithmetic coding to parsed<br />

phrases since each parsed phrase is either distinct or replicated<br />

with the number of repetitions less than or equal to the size of the<br />

source alphabet.)<br />

In Section 11, we review how context-free grammars are<br />

used to represent sequence x, and refer you to the articles that<br />

explain how the reduction rules are used for designing grammar<br />

transforms, and how to efficiently encode grammars. In Section<br />

III, we describe how we implemented the new algorithm for real<br />

cases and the compression performances of the grammar code and<br />

other traditional lossless compression techniques for<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:54 from IEEE Xplore. Restrictions apply.<br />

.


mammographic images. We also discuss what the advantage and<br />

disadvantage of the new algorithm are and why it is a promising<br />

algorithm after surmounting a few problems.<br />

11. RE\'IE\\' OF 1°K NE\\' L~VIVEKSAI. COKTEXT-FREE<br />

LOSSI.ESS DATA COMPRESSION ALGORITHM BASED<br />

ON A GREEDY COSTEXT-FREE SEQUESTIAL<br />

GRAMIIAR TRAVSFORhl<br />

'The purpuse of this ssction is IO bnAly review the new<br />

gr;immsr-bawrl codc \\e applied so th3t this paper is selfconuined<br />

For the derailed dcscription uf the grammar-based<br />

codes. plcasc reler to 131.<br />

l.et A be our sourx alphrhet with cardinaliry greater than or<br />

~~UJI to 2. Let A' is the set sf 311 finite slnngs of prraiti\e length<br />

from A. x denotes the length uix. To avoid poisihle confusion, a<br />

sequence from A ts somctimss called an A -sequence. Let .II EA'<br />

be a ssquencc tu be comprcssed<br />

A grammar-baed code has the structure choun in Fig I<br />

The onginal daw sequence i is first transformed into a concu-<br />

ires grammx (or simply a grammar) G irum which x mn he fully<br />

reroversd. and then G is comprcssed indirectly hy uring 3 7ero-<br />

order arithmetic code RcDre bringing in the grdmmar transfomi,<br />

we begin with cxpl~ining how context-free grdmmsn arc uscd to<br />

reprcwnt scquenccs x in ,\-.<br />

Figure 1: Structure of a grammar-based code.<br />

Fix a countable set S={so,s,,s z,...} ofsymhols, disjoint from<br />

A . Symbols in Swill be called variables; symbols in A will he<br />

called ferminal symbols. For any j>l, let S(j7={so ,S~,S~,...,S~.,).<br />

For our purpose, a context-free grammar G is mapping from S(i)<br />

to (S(i)uA )+for some j?l. The set S6) will be called the variable<br />

set of G and, to be specific, the elements of SO) shall he called<br />

sometimes G-variables. For the purpose of data compression, we<br />

are interested only in grammars G for which the parallel<br />

replacement procedure terminates after finitely many steps and<br />

every G-variable s( i


111. IMPLEMENTATION<br />

As we have presented in section 11, the new lossless<br />

grammar-based compression code is accomplished hy taking the<br />

following three steps:<br />

i) Dcfinc a size-on-demand variable set of G and ensure each G-<br />

variable is distinct from source symbols;<br />

ii) Convert the source sequence x into an irreducible context-free<br />

grammar by applying a greedy grammar transform which adopts<br />

reduction rules in some order [3];<br />

iii)Based on the grammar transform, use one of three universal<br />

lossless data compression algorithms which are sequential<br />

algorithm, improved sequential algorithm, and hierarchical<br />

algorithm, to compress the irreducible grammar. All these<br />

algorithms combine the power of arithmetic coding with that of<br />

string matching. We define the size 14 of S as the number of G-<br />

variables in S. In OUT work, we fixed the size 14 of S, then<br />

operated the irreducible grammar transform and finally applied the<br />

hierarchical data compression algorithm [3]. The rest of this<br />

section mainly describes how we implemented the new lossless<br />

algorithm, and presents the compression performance of OUT<br />

implementation on mammographic images in 3 categories from<br />

small size (35X5), middle size (200X150) to large size<br />

(1024X1024). Each category has 30 images.<br />

To obtain a higher compression rate, we directly<br />

transformed the MIAS image text into a grammar G without<br />

converting each text into binary stream. However, the elements of<br />

the image text are not identical in length compared to their binary<br />

forms. For recovering the image successfully hy decoding later<br />

on, we embedded a specific symbol among pixel values to<br />

distinguish every two neighbors as solution before starting<br />

grammar transform.<br />

As noted in [1][3], the G-variables that represent the distinct<br />

production rules are distinct. The size Ifl of S is dependent on the<br />

image size as the new irreducible grammar transform is applied.<br />

Since each production rule's lei? member is a unique G-variable<br />

and its right member is represented hy (Se) U A )+, and string<br />

match is often used by the new grammar transform, the G-<br />

variables should he actual and as many as we need. Whereas the<br />

total visible symbols in language C via which we simulated the<br />

grammar code are limited. The maximum number of the available<br />

symbols is 75. But it does not mean the language C can not<br />

overcome the problem, but it will involve more sophisticated<br />

programming. Therefore, we adopted two schemes. Even though<br />

both of the schemes are not same as the theory, they helped us to<br />

study the new lossless grammar-based code in depth and verify its<br />

feasibility. One is to allow reusing the 75 G-variables (or less than<br />

75) to encode the remaining data sequence. In other words, as a<br />

result, a complete image was represented by several parallel<br />

grammars as shown in Figure 3. The other scheme is to use only<br />

75 G-variables to convert an image into an irreducible grammar<br />

one time only.<br />

In the first scheme, we used 75 and 35 G-variables to<br />

encode the 30 middle-size (2OOX150) mammographic images<br />

respectively and used 35 G-variables to encode the 30 large-size<br />

(1024x1024) mammographic images. From the study, we found<br />

that the more variables we used, the more processing time we<br />

needed for converting from a source image to an irreducible<br />

- 2087 -<br />

grammar G. For example, using 75 G-variables consumed 15<br />

minutes and 37 seconds on a GNU Linux workstation to encode<br />

medium-size images, while using 35 G-variables only took 2<br />

minutes and 4 seconds on the average instead. While in medical<br />

applications especially for mammograms where real-time<br />

compression is not an issue, the computational time can be<br />

sacrificed to some extent. But for regular images, the<br />

computational time of grammar code will be considered. Another<br />

observation we obtained from the study of scheme 1 is that its<br />

compression rate is better than Huftnan, Lempel-Ziv, and<br />

Arithmetic algorithms in some cases but not very significant<br />

because such a scheme destroys the correlation as a whole of the<br />

source image. Figure 4 displays this conclusion. We also<br />

compared the compression rate of using 75 G-variables and using<br />

35 G-variables. They are 2.643:l and 2.639:1, respectively. The<br />

compression gain of using 75 G-variables is very limited<br />

compared to using 35 G-variables. While using 75 G-variables<br />

took much longer time on processing as described above.<br />

Obviously, in scheme 1, using 35 G-variables is good enough for<br />

encode source images. Therefore, to save time, we did not do test<br />

on 1024x1024 mammographic images by using 75 G-variables.<br />

In the second scheme, we compared average compression<br />

rate of grammar code over 30 small (35x5) mammographic<br />

images with Huftinan, Lempel-Ziv, and Arithmetic algorithms. Its<br />

compression rate is much greater than any other 3 traditional<br />

techniques, which is displayed in Figure 5. While this scheme is<br />

impractical for the compression of large images since 75 G-<br />

variables are not enough for this purpose, the result does<br />

demonstrate that if we let 14 be big enough to completely<br />

represent a large image, its compression performance will<br />

conform to the theory and the Result 2 described in Section 11.<br />

However, we should be aware of the time consumption involved<br />

for processing large images using grammar code.<br />

Figure 2: A sample of mammographic images. (a) the original<br />

image(b) the image aAer Grammar decoding<br />

Grayscale Image<br />

174 175 173 173 174<br />

177 180 182 180 175<br />

176 176 174 173 175<br />

..................................<br />

..................................<br />

io<br />

ill io<br />

N575<br />

Figure 3: The image represented by several grammars.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:54 from IEEE Xplore. Restrictions apply.<br />

99


4<br />

3.5<br />

2.5<br />

A"mS<br />

iaqmaioonw<br />

I.!<br />

3<br />

I<br />

0.S<br />

0<br />

350. A"Ul"nlC uw H u h<br />

"vlablcl<br />

ad<br />

rrpeudly<br />

Techniques<br />

Figure 4: The compression performances of techniques over 30<br />

1024x1024 digital mammographic images<br />

hmgr<br />

mqmnon mL<br />

750- A"tt.wc Lzw H u h<br />

urnable<br />

V l d om nm<br />

0"lY<br />

Techniques<br />

Figure 5: The compression performance of techniques over 30<br />

35x5 digital mammographic images<br />

For transmitting mammograms through network,<br />

conquering variables requirement of grammar-based code will<br />

provide high compression performance.<br />

IV. CONCLUSIONS<br />

For decades, researchers have kept looking for much<br />

effective lossless compression technique for critical and large<br />

images, MIAS images for example, to be transmitted across the<br />

internet without any data loss. The new lossless compression<br />

100<br />

- 2088 -<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:54 from IEEE Xplore. Restrictions apply.<br />

grammar-based code attracts our attention and prompts us to<br />

verify if it is a promising code. The result based on our simulation<br />

presents that it is promising to get higher compression ratio for<br />

large images than Huffman, Lempel-Ziv, Arithmetic algorithms.<br />

Assuming that the number of the symbols as the variables of<br />

G is infinite, we can completely compress large images according<br />

to the original design of the new universal lossless grammar-based<br />

code. But it will involve more complicated processes and large<br />

compression time. These two are the main obstacles for the<br />

grammar-based code to be applied practically.<br />

REFERENCES<br />

[I] I. C. Kieffer and E. -H. Yang, "Grammar based codes:<br />

A new class of universal lossless source codes," EEE Trans.<br />

Inform. Theory, Vol. 6, No. 3, May 2000<br />

[2] J. C. Kieffer, E. -H. Yang, G. Nelson, and P. Cosman,<br />

"Universal lossless compression via multilevel pattern matching,"<br />

IEEE Trans. Inform. Theory. vol. 46, pp. 12274245, july 2000.<br />

[3] E. -h. Yang and J. C. Kieffer, "Efficient universal<br />

lossless data compression algorithms based on a greedy sequential<br />

grammar transform-Part one:Without context models," EEE<br />

Trans. Inform. Theory, vol. 46, pp.755-788, May 2000.<br />

[4] N. Abramson, Information Theory and Coding. New<br />

York McGraw-Hill, 1963.<br />

[5] I. Ziv and A. Lempel, "Compression of individual<br />

sequences via variable rate coding," IEEE Trans. Inform. Theory,<br />

vol. IT-24, pp. 530-536, 1978.<br />

[6] R. G. Gallager, "Variations on a theme by Hufian,"<br />

IEEE Trans. Inform. Theory, vol. IT-24, pp. 668-674, 1978.<br />

[7] J. Bentley, D Sleator, R. Tarjan, and V. K. Wei, "A<br />

locally adaptive data compression scheme," Commun. Asoc.<br />

Comput. Mach., vol. 29, pp. 320-330, 1986.<br />

[8] P. Elias, "Interval and recency rank source coding: Two<br />

on-line adaptive variable length schemes," IEEE Trans. Inform.<br />

Theory, vol. IT-33, pp. 1-15, 1987.<br />

[9] B. Y. Ryabko, "Data compression by means of a 'book<br />

stack',"Probl. Inform. Transm.,vol. 16,110.4, pp. 16-21, 1980.<br />

[IO] D. L. Neuhoff and P. C. Shields, "Simplistic universal<br />

coding," IEEE. Trans. Inform. Theory, vol. 44, pp. 778-781, Mar.<br />

1998.<br />

[I I] D. S. Omstein and P. C. Shields, "Universal almost sure<br />

data compression," Ann. Probab.,vol. 18, pp. 441452,1990,


CONTENT BASED AUDIO CLASSIFICATION AND RETRIEVAL USING JOINT<br />

TIME-FREQUENCY ANALYSIS<br />

S. Esmaili, S. Krishnan and K. Raahemifar<br />

Multimedia Information and <strong>Signal</strong> <strong>Analysis</strong> <strong>Research</strong> (MI<strong>SAR</strong>) Laboratories<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, Ontario, Canada<br />

e-mail: (sesmaili)(krishnan)(kraahemi)@ee.ryerson.ca<br />

ABSTRACT<br />

In this paper, we present an audio classification and retrieval technique<br />

that exploits the non-stationary behavior of music signals<br />

and extracts features that characterize their spectral change over<br />

time. Audio classification provides a solution to incorrect and inefficient<br />

manual labelling of audio files on computers by allowing<br />

users to extract music files based on content similarity rather than<br />

labels. In our technique, classification is performed using timefrequency<br />

analysis and sounds are classified into 6 music groups<br />

consisting of rock, classical, folk, jazz and pop. For each 5-second<br />

music segment, the features that are extracted include entropy, centroid,<br />

centroid ratio, bandwidth, silence ratio, energy ratio, and<br />

location of minimum and maximum energy. Using a database<br />

of 143 signals, a set of 10 time-frequency features are extracted<br />

and an accuracy of classification of around 93% using regular linear<br />

discriminant analysis or 92.3% using leave one out method is<br />

achieved.<br />

1. INTRODUCTION<br />

With the abundance of personal computers, advances in high speed<br />

modems operating at 100 Mbps and GUI based peer-to-peer (P2P)<br />

file-sharing systems that make it simple for individuals without<br />

much computer knowledge to download their favorite music, there<br />

has been an increase of digitized music available on the Internet<br />

and on personal computers. As such, there is also a rising need<br />

to manage and efficiently search the large number of multimedia<br />

databases available online which is difficult using text searches<br />

alone. Current multimedia databases are indexed based on song<br />

title or artist name which requires manual entry and improper indexing<br />

could result in incorrect searches. A more effective content<br />

based retrieval system, analyzes audio signals, selects and extracts<br />

dominant perceptual features and classifies the music based<br />

on these features. Stronger features provide a higher degree of<br />

separation between classes and thereby a higher classification accuracy.<br />

The aim is to make music search engines as effective as<br />

text-based ones and this is examined further in this paper.<br />

In recent years, there has been many works on audio classification<br />

with various perceptual features and several classification<br />

algorithms. In one of the pioneer works done on audio classification<br />

and later commercialized into the “Muscle Fish” project, Wold<br />

et al [1] extracted an N dimensional vector consisting of several<br />

acoustical features such as loudness, pitch, brightness, bandwidth<br />

Thanks to Micronet and NSERC for funding.<br />

and harmonicity from each sound. A Euclidean (Mahalanobis)<br />

distance is then calculated between the input sound feature vector<br />

and the existing models in the database. Using the nearest neighbor<br />

(NN) rule, the signal is grouped into the class with the minimal<br />

Euclidean distance.<br />

In a similar work to that of [1], Liu et al [2] extract 13 different<br />

audio features to separate audio clips into different scene classes<br />

such as advertisement, basketball, football, news and weather. Features<br />

consist of volume distribution, pitch contour, bandwidth, frequency<br />

centroid and energy. A neural network classifier with a<br />

one-class-in-one network (OCON) structure is used and an overall<br />

classification rate of 88% is achieved. Artificial neural networks<br />

(ANN) are effective in detecting complex nonlinear relationships<br />

while requiring little formal training. However, their process is<br />

computationally expensive and more importantly, the relation between<br />

the input and output variables is defined in a black box<br />

model that has no analytical basis. In terms of audio classification<br />

this means that it is difficult to deduce which acoustical features<br />

are significant in classifying each type of sound [1].<br />

In a different technique, Lu and Hankinson [3] used a rulebased<br />

heuristic classification method to classify an audio signal<br />

into speech, music and noise. For each feature, a threshold is set<br />

to determine the segment type and the feature set includes silence<br />

ratio, centroid, harmonicity and pitch. Since the feature threshold<br />

must change for different audio inputs, this type of classifier is<br />

tedious and not ideal. A classification rate of 75% for speech, and<br />

89% for music is reported.<br />

Lu et al [4] proposed support vector machines (SVMs) as an<br />

alternative to current classification methods. Using a kernel-based<br />

SVM increases the classification rate by separating nonlinear cases.<br />

Here, a nonlinear kernel function maps the data to a high dimensional<br />

feature space where the data is linearly separable. The authors<br />

use a combination of a rule-based classifier and a kernel<br />

based SVM to distinguish between 5 different audio classes including<br />

silence, music, background sound, pure speech and nonpure<br />

speech. Their feature set include similar features to those<br />

reported in [1] and [5], such as MFCCs, zero-crossing rate (ZCR),<br />

short time energy (STE), sub-band powers, brightness, and bandwidth<br />

with some new features such as spectral flux (SF), band periodicity<br />

(BP), and noise-frame-ratio (NFR). An average classification<br />

accuracy of around 90% is achieved.<br />

In the majority of the previous work in this area, audio is examined<br />

in either the time or frequency domain where it is assumed<br />

that the signals are wide sense stationary. In reality, sounds are<br />

non-stationary and multi-component signals consisting of series<br />

<br />

101<br />

<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.


of sinusoids with harmonically related frequencies. Our algorithm<br />

considers the short-time Fourier transform (STFT) of an audio signal<br />

to extract parameters that will be used to classify signals. Our<br />

retrieval technique is less computationally intensive than those that<br />

use ANN, SVM, or Hidden Markov Models (HMM). Also, the<br />

efficiency of features can be examined which is not feasible in<br />

ANNs. Note that while HMM can be used to examine spectral<br />

change over time, past works have shown that HMM needs to be<br />

coupled with external features such as Cepstral or perceptual features<br />

to be efficient [6]. Finally, our method also offers the added<br />

improvement that it is not specific to certain audio files and can<br />

be applied without adjusting the algorithm or thresholds such as in<br />

rule-based models.<br />

Our work on content-based audio classification is presented<br />

as follows. Section 2 presents the application of time-frequency<br />

analysis to feature selection and analysis for audio classification.<br />

In Section 3 we present our classification results for the system and<br />

our conclusions are provided in Section 4.<br />

2. METHODOLOGY<br />

2.1. Short-time Fourier transform (STFT) algorithm<br />

Since speech and audio signals have spectral characteristics that<br />

vary over time, they require a non-stationary signal model such<br />

as the STFT to describe them. Ultimately, we would like to imitate<br />

the capability of the ear and provide simultaneous information<br />

about time and frequency of the music. STFT uses a sliding window<br />

to compute the Fourier transform thereby providing an estimate<br />

of the “local frequency” at a given time. The STFT of a signal<br />

x[n] is given by,<br />

STFT(n, f) =<br />

∞<br />

m=−∞<br />

x[n + m]w[m]e −j(2πf)m<br />

where w[m] is the window function and the spectrogram of x is<br />

defined as SPEC(n, f) =|STFT(n, f)| 2 . For a given signal<br />

x, SPEC(n, f)∆n∆f represents the energy in the time interval<br />

[n, n +∆n] in the frequency band [f,f +∆f]. In STFT analysis,<br />

we can improve the frequency resolution by decreasing the<br />

spectral width ∆f at the expense of increasing the temporal width<br />

∆n (poor time resolution). Also the shape of the window w[n]<br />

is important as a window with a sharp cutoff will introduce artificial<br />

discontinuities. Hanning windows are mainly used in audio<br />

classification techniques as they reduce spectral leakage.<br />

2.2. Audio feature extraction<br />

The set of features extracted are critical as they need to be strong<br />

enough to clearly separate the classes of signals. This procedure<br />

requires perceptual features that model the human auditory system.<br />

Discriminating music from speech is less complex than between<br />

different classes of music. The latter may only require a<br />

small number of features such as zero crossing rate or energy envelope<br />

and since the spectral characteristics are not very similar,<br />

high accuracy rates are achieved.<br />

In this paper, we examine the similarities of 143 audio signals<br />

and classify them under six different genres. Each audio signal<br />

is 5 seconds, mono-channel, 16 bits per sample and sampled at<br />

44.1 kHz. The length of the audio samples was chosen to be 5<br />

seconds in relevance with the human neurological behavior which<br />

(1)<br />

<br />

102<br />

was examined by Perrot et al in [7]. They found that human beings<br />

require at least 3 second excerpts to identify different musical<br />

genres with a 70% accuracy rate while the accuracy decreases to<br />

53% for a 250 ms excerpt.<br />

We start by transforming our audio signal into a spectrogram<br />

with a window size of 1024 samples which corresponds to about<br />

23 ms at 44.1 kHz. This window size is similar to that used in<br />

[4] and [8]. A Hanning window with 50% overlap is used and the<br />

DFT is calculated in each window. The audio features extracted<br />

from the two-dimensional time-frequency distribution (TFD) are<br />

explained below.<br />

2.2.1. Entropy<br />

The entropy of a signal is a measure of its spectral distribution<br />

and portrays the noise-like or tone-like behavior of the signal. The<br />

entropy of a signal in time frame n can be calculated as:<br />

where<br />

Fm<br />

H(n) = Pf (TFD(n, f)) log2 Pf (TFD(n, f)), (2)<br />

f=0<br />

Pf (TFD(n, f)) =<br />

TFD(n, f)<br />

. (3)<br />

TFD(n, f)<br />

f=Fm<br />

f=0<br />

Here, TFD(n, f) represents the energy of the signal at time frame<br />

n and frequency index f (it is equivalent to SPEC(n, f) defined<br />

in Section 2.1). Also, Fm refers to the maximum frequency.<br />

Consider the case where there are L number of frequency bins.<br />

Then the maximum entropy in time window n is log 2 L which occurs<br />

if the frequency bins are equiprobable. First, we examined<br />

the entropies of 3 different types of signals. These signals were<br />

analyzed using 128 frequency bins, implying that the maximum<br />

entropy is 7 bits. The first signal consisted of a single sine wave,<br />

at a sampling frequency of 1 kHz. In this case, the mean entropy<br />

was 1.24 bits and the standard deviation at 5.636 × 10 −6 . Next<br />

we considered the vowel “a” (a signal component with harmonic<br />

structure) and its entropy was calculated to be 2.84 bits with a standard<br />

deviation of 0.1. Finally, we considered white Gaussian noise<br />

and its mean entropy was 6.38 bits with a standard deviation of<br />

0.06. As we expected, the sine wave had the lowest entropy and a<br />

standard deviation of almost zero while white noise had the largest<br />

entropy (approaching maximum) with a larger standard deviation.<br />

From our database of music signals, we found that entropy<br />

was a dominant feature in classifying particularly rock or folk music.<br />

As shown in Figure 1a, rock signals possessed the highest<br />

entropy followed closely by folk music while classical, country,<br />

jazz and pop had low entropies. Figure 1b shows the distribution<br />

of entropy for rock music compared to classical. As can be seen,<br />

the entropy ranges for the two types of signals are quite different.<br />

In order to determine the strength of entropy from a different perspective,<br />

a receiver operating curve (ROC) was plotted. The ROC<br />

curve is a two dimensional measure of classification performance.<br />

The area under this curve measures discrimination, or the ability<br />

of a feature to correctly classify signals. An area of 1.0 represents<br />

a perfect test; where an area of 0.5 or less shows the feature is<br />

not useful in discrimination of that class. Rock, folk, jazz, classical,<br />

country and pop music had ROC areas of 0.933, 0.808, 0.644,<br />

0.337, 0.294, and 0.145 respectively. These results show that although<br />

entropy is a strong feature, further features are required to<br />

improve classification.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.


Entropy<br />

4.5<br />

4<br />

3.5<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

ROCK CLASSICAL COUNTRY FOLK JAZZ POP<br />

Type<br />

(a)<br />

Normalized Frequency<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1 1.5 2 2.5 3 3.5 4 4.5 5<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

Rock<br />

Classical<br />

0<br />

1 1.5 2 2.5 3 3.5 4 4.5 5<br />

Mean Entropy<br />

Fig. 1. Comparison of entropy values a) Results for different genres<br />

b) Distribution for classical and rock.<br />

2.2.2. Energy ratio<br />

The rate of change in the spectral energy over time was measured<br />

as the mean of the total energy in a frequency sub-band to the pre-<br />

vious time window (E[<br />

fupper<br />

TFD(n,f)<br />

f=flower f=fupper<br />

TFD(n−1,f)<br />

f=flower (b)<br />

]). This was exam-<br />

ined in three different sub-bands [0, 5 kHz], [5, 10 kHz], [10 kHz,<br />

Fm]. However, it was found empirically that the energy ratio in<br />

mid and high frequency bands did not improve the classification.<br />

This is probably because most energy activity in audio signals is<br />

in the low frequency band. Therefore, only the mean of energy in<br />

the low-band was used in our feature set.<br />

The frequency location with the lowest energy component was<br />

also computed. Although an estimate of the mean can be calculated<br />

from the frequency domain, it was included in our feature set<br />

as it improved the classification rate by 5%. In fact, using the mean<br />

and standard deviation of the location of minimum energy provided<br />

100% classification rates for classifying country, folk and<br />

jazz music but low classification rates for the other three genres.<br />

When examining the histogram of the location of minimum energy<br />

for our database of signals (Figure 2), the frequency spread<br />

was smaller for country (21.4-21.5 kHz), folk (21.45-21.85 kHz),<br />

jazz (21.36-21.51 kHz) and a wider range for pop (18.1-21.5kHz),<br />

classical 15.5-21.5kHz) and rock (20-21.6 kHz).<br />

2.2.3. Brightness<br />

The brightness of a signal also referred to as its frequency centroid,<br />

shows the weighted midpoint of the energy distribution in a given<br />

frame. It is defined by:<br />

Fm f=0 fTFD(n, f)<br />

fi(n) = . (4)<br />

Fm<br />

f=0 TFD(n, f)<br />

The brightness feature could also be seen as the instantaneous<br />

mean frequency parameter, a typical non-stationary feature of a<br />

signal. The frequency centroid of the audio signal in the low frequency<br />

range (0-5KHz) is also examined as most of the frequency<br />

content of audio signals is concentrated in low frequency.<br />

In addition, the mean of centroid ratio to previous window is<br />

a useful feature as it measures the spectral change over time. We<br />

found that rock, folk, pop and country music signals had the largest<br />

<br />

103<br />

Distribution of location of minimum energy<br />

1<br />

0.5<br />

Rock<br />

0<br />

15 16<br />

1<br />

Classical<br />

0.5<br />

17 18 19 20 21 22<br />

0<br />

15<br />

1<br />

Country<br />

0.5<br />

16 17 18 19 20 21 22<br />

0<br />

15<br />

1<br />

0.5<br />

Folk<br />

16 17 18 19 20 21 22<br />

0<br />

15<br />

1<br />

0.5<br />

Jazz<br />

16 17 18 19 20 21 22<br />

0<br />

15<br />

1<br />

0.5<br />

Pop<br />

16 17 18 19 20 21 22<br />

0<br />

15 16 17 18 19<br />

Frequency in kHz<br />

20 21 22<br />

Fig. 2. Distribution of location of minimum energy<br />

change in centroid frequency over time while classical and jazz<br />

signals had the lowest change. This is expected as classical and<br />

jazz music generally have less activity over time compared to the<br />

other 4 genres.<br />

2.2.4. Bandwidth<br />

Bandwidth is the magnitude-weighted average of the difference<br />

between the signal’s spectral components and centroid. It can be<br />

defined as:<br />

<br />

Fm <br />

B(n) = f=0 (f − fi(n)) TFD(n, f)<br />

. (5)<br />

Fm<br />

f=0 TFD(n, f)<br />

Effectively, it shows the spectral shape and the spread of energy<br />

relative to the centroid, therefore it is also a non-stationary feature.<br />

For instance, a sine wave without noise has zero bandwidth.<br />

2.2.5. Silence ratio<br />

Silence ratio is the number of silent time window frames with total<br />

energy less than 0.01. This threshold is set empirically. Note that<br />

this feature could also be extracted from the time domain.<br />

Bandwidth, brightness and silence ratio have been proven to<br />

be effective in previous audio classification papers including [1, 2]<br />

although an STFT approach showing the rate of change to previous<br />

windows has not been used.<br />

3. AUDIO CLASSIFICATION<br />

Using the above analysis, the 10 features extracted for each sample<br />

included mean and standard deviation of centroid frequency, mean<br />

centroid (low-frequency range), mean of centroid ratio to previous<br />

window, mean bandwidth, silence ratio, mean and standard deviation<br />

of the frequency location with the lowest energy, mean and<br />

standard deviation of entropy. Note that mean and variance of a<br />

feature are calculated over the entire time window. Once the features<br />

are extracted for the 143 audio signals, linear discriminant<br />

analysis (LDA) is then applied using SPSS software [9], to predict<br />

group classification of cases. This type of analysis tries to<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.


Function 2<br />

4<br />

2<br />

0<br />

-2<br />

-4<br />

-6<br />

Canonical Discriminant Functions<br />

-8<br />

Function 1<br />

-6<br />

-4<br />

Classical<br />

pop<br />

-2<br />

0<br />

count<br />

jazz<br />

2<br />

Folk<br />

ROCK<br />

4<br />

6<br />

file type<br />

<strong>Group</strong> Centroids<br />

Fig. 3. All-groups scatter plot with the first two canonical discriminant<br />

functions<br />

find a linear combination of those extracted features that best separate<br />

the group of cases. To represent this linear combination, a<br />

discrimination function is formed using the extracted features as<br />

discrimination variables and can be expressed as:<br />

pop<br />

jazz<br />

Folk<br />

count<br />

Classical<br />

L = b1x1 + b2x2 + ....... + b10x10 + c, (6)<br />

where b1..b10 are the coefficients, c is a constant and x1..x10 are<br />

the values of the extracted features. This technique finds the first<br />

function that separates the groups as much as possible and then<br />

finds further functions that improve the separation and are uncorrelated<br />

to previous ones. The number of functions is determined<br />

by the number of predictors or features and the number of groups<br />

available.<br />

Using Fisher’s coefficients and prior probabilities of each group,<br />

a scatterplot (Figure 3) is created showing the discriminant scores<br />

of the cases on two discriminant functions. This plot shows the<br />

separation between different cases. Songs are categorized into six<br />

groups (rock, classical, country, folk, jazz and pop) and the confusion<br />

matrix depicted in Table 1 shows the classification performance.<br />

Using the original LDA, 93.0% of all original grouped<br />

cases are correctly classified with folk music having the lowest<br />

rate. A more accurate estimate is obtained through the crossvalidated<br />

method where a portion of cases belong to the learning<br />

sample and the other cases belong to the test sample. In the leaveone-out<br />

method used, each case is classified by the functions derived<br />

by all cases except that one. This method yields a 92.3%<br />

classification rate revealing the discrimination strength of our feature<br />

set.<br />

4. CONCLUSIONS<br />

In this paper, we examined a technique where features used to classify<br />

music signals are derived directly from the time-frequency domain.<br />

Using six different genres for classification, we have shown<br />

that high accuracy rates can be obtained using features that reflect<br />

the non-stationarity properties of audio signals and are able to depict<br />

its spectral, energy and entropy change over time. In addition<br />

to the success rate, our algorithms have low computational complexity<br />

compared to other techniques and they offer versatility as<br />

ROCK<br />

<br />

104<br />

Method Type RO CL CO FO JA PO CA%<br />

1. Original RO 14 0 0 2 0 0 87.5<br />

CL 0 30 0 0 0 1 96.8<br />

CO 0 0 15 0 0 1 93.8<br />

FO 2 0 1 27 1 1 84.4<br />

JA 0 0 0 1 15 0 93.8<br />

PO 0 0 0 0 0 32 100<br />

Overall 93.0<br />

2. Cross- RO 14 0 0 2 0 0 87.5<br />

Validated CL 0 30 0 0 0 1 96.8<br />

CO 0 0 15 0 0 1 93.8<br />

FO 2 0 1 26 1 2 81.3<br />

JA 0 0 0 1 15 0 93.8<br />

PO 0 0 0 0 0 32 100<br />

Overall 92.3<br />

Table 1. Classification results. Method: Original - Linear discriminant<br />

analysis, Cross - validated - Linear discriminant analysis with<br />

leave-one-out method (RO-Rock, CL-Classical, FO-Folk, Ja-Jazz,<br />

PO-Pop, CA% - Classification accuracy rate)<br />

they can be applied to any audio signal without alteration. Further<br />

work will include optimization of window size in the TF domain<br />

as well as examining other classification methods such as minimum<br />

classification error (MCE) to improve classification rate for<br />

a larger database of signals.<br />

5. REFERENCES<br />

[1] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based<br />

classification, search, and retrieval of audio,” IEEE Multimedia,<br />

pp. 27–36, 1996.<br />

[2] Z. Liu, J. Huang, Y. Wang, and T. Chuan, “Audio feature extraction<br />

and analysis for scene classification,” in IEEE Workshop<br />

on Multimedia <strong>Signal</strong> Processing, June 1997, pp. 343–<br />

348.<br />

[3] G. Lu and T. Hankinson, “A technique towards automatic audio<br />

classification and retrieval,” in Fourth International Conference<br />

on <strong>Signal</strong> Processing, Beijing, China, October 1998,<br />

pp. 1142–1145.<br />

[4] L. Lu, H. Zhang, and S. Li, “Content-based audio classification<br />

and segmentation by using support vector machines,”<br />

ACM Multimedia Systems Journal 8, vol. 8, no. 6, pp. 482–<br />

492, March 2003.<br />

[5] J. Foote, “Content-based retrieval of music and audio,” in<br />

Multimedia Storage and Archiving Systems II, Proc. of SPIE,<br />

1997, pp. 138–147.<br />

[6] T. Zhang and C. Kuo, “Hierarchical classification of audio<br />

data for archiving and retrieving,” in Proc. ICASSP, March<br />

1999, pp. 3001–3004.<br />

[7] D. Perrot and R.O. Gjedigen, “Scanning the dial: An exploration<br />

of factors in the identification of musical style,” Proceedings<br />

of the 1999 Society for Music Perception and Cognition,<br />

p. 88, 1999.<br />

[8] G. Tzanetakis and P. Cook, “Music genre classification of<br />

audio signals,” IEEE Transactions on Speech and Audio Processing,<br />

vol. 10, no. 5, pp. 293–302, July 2002.<br />

[9] SPSS Inc., “SPSS advanced statistics user’s guide,” in User<br />

manual, SPSS Inc., Chicago, IL, 1990.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.


MODIFIED LOCAL DISCRIMINANT BASES AND ITS APPLICATIONS IN SIGNAL<br />

CLASSIFICATION<br />

Karthikeyan Umapathy<br />

Dept. of Electrical and Computer Engg.,<br />

The <strong>University</strong> of Western Ontario,<br />

London, ON N6A 5B8, Canada<br />

ABSTRACT<br />

One of the major challenges in classification problems based<br />

on signal decomposition approach is to identify the right basis<br />

function and its derivatives that can provide optimal features to<br />

distinguish the classes. With the vast amount of available libraries<br />

of orthonormal bases, it is hard to select an optimal set of basis<br />

functions for a specific dataset. To address this problem, pruning<br />

algorithms based on certain selection criteria is needed. Local<br />

Discriminant Bases (LDB) algorithm is one such algorithm, which<br />

efficiently selects a set of significant basis functions from the library<br />

of orthonormal bases based on certain defined dissimilarity<br />

measure. The selection of this dissimilarity measure is critical as<br />

they indirectly contribute to the performance accuracy of the LDB<br />

algorithm. In this paper, we study the impact of the dissimilarity<br />

measures on the performance of the LDB algorithm with two classification<br />

examples. The two biomedical signal databases used are<br />

1. Vibroarthographic signals (VAG) - 89 signals with 51 normal<br />

and 38 abnormal, and 2. Pathological speech signals - 100 signals<br />

with 50 normal and 50 pathological. Classification accuracies<br />

of 76.4% with VAG database and 96% with pathological speech<br />

databases were obtained. This modified method of signal analysis<br />

using LDB has shown its powerfulness in analyzing non-stationary<br />

signals.<br />

1. INTRODUCTION<br />

The Local Discriminant Bases (LDB) [1] algorithm is recently being<br />

used successfully in many classification problems. The optimal<br />

choice of LDBs for a given dataset is driven by the nature of<br />

the dataset and the dissimilarity measures [2] used to distinguish<br />

between classes. The choice of the dissimilarity measure for a particular<br />

dataset depends on knowledge of the data, computational<br />

complexity, and the classification accuracy requirements. For example<br />

probabilistic dissimilarity measures such as relative-entropy<br />

needs prior knowledge of the dataset distribution, whose accuracy<br />

depends on the size of data, on the other hand simple dissimilarity<br />

measures such as Euclidean distance is only suitable for numeric<br />

data sets. A combination of multiple dissimilarity measures with<br />

varying complexity can be used to achieve high classification accuracies.<br />

In this paper we analyze two biomedical signal databases using<br />

LDB algorithm with 3 different dissimilarity measures. The<br />

LDB algorithm is based on the wavelet packet decompositions<br />

with 3 different wavelets namely Daubechies (db4), Coiflet (cf4)<br />

Thanks to NSERC for funding this research work.<br />

Sridhar Krishnan<br />

Dept. of Electrical and Computer Engg.,<br />

<strong>Ryerson</strong> <strong>University</strong>,<br />

Toronto, ON M5B 2K3, Canada<br />

and Symlet (sy4) [3]. This gives us 9 different combinations for<br />

each of the databases. A two group (class1 and class2) classification<br />

was performed for the 9 combinations. Linear discriminant<br />

analysis (LDA) based classifier was used to compute the classification<br />

accuracies. The classification accuracies were verified<br />

using the leave-one-out method [4]. The paper is organized as<br />

follows: In Section 2 on Methodology, Local Discriminant Bases<br />

algorithm, dissimilarity measures, feature extraction and pattern<br />

classification are covered. Results and discussions are covered in<br />

Section 3, and Conclusions in Section 4.<br />

2. METHODOLOGY<br />

2.1. Local Discriminant Bases Algorithm<br />

In the LDB [1] algorithm with wavelet packet bases, a set of training<br />

signals x c i for all C classes are decomposed to a full tree<br />

structure of order N. We restrict our analysis to binary wavelet<br />

packet trees. Let Ω0,0 be a vector space in R n corresponding to<br />

the node 0 of the parent tree. Then at each level the vector space<br />

is spilt into two mutually orthogonal subspaces given by Ωj,k =<br />

Ωj+1,2k ⊕ Ωj+1,2k+1 where j indicates the level of the tree and k<br />

represents the node index in level j, givenbyk =0, ...., 2 j − 1.<br />

This process repeats till the level J, giving rise to 2 J mutually<br />

orthogonal subspaces. Our goal is to select the set of best subspaces<br />

that provide maximum discriminant information between<br />

the classes of the signal. Each node k contains a set of basis vectors<br />

Bj,k =[wj,k,l] l=2no−j −1<br />

l=0 , where 2 no corresponds to the length of<br />

the signal. Then the signal xi can be represented by a set of coefficients<br />

c as:<br />

xi =Σj,k,lcj,k,lwj,k,l<br />

Basically the signal xi is decomposed into 2 J subspaces with<br />

cj,k,l coefficients in each subspace. With the training signals decomposed<br />

into wavelet packet coefficients we need to define a dissimilarity<br />

measure (Dn) in the vector space so as to identify those<br />

subspaces, which have larger statistical distance between classes.<br />

This dissimilarity measure is used in an iterative manner to prune<br />

the tree in such a way that only a node is split if the cumulative discriminative<br />

measure of the children nodes is greater than the parent<br />

node. The resulting tree contains the most significant LDBs,<br />

from which a set of K significant LDBs are selected to construct<br />

the final tree. The testing set signals are then expanded using this<br />

tree and features are extracted from the respective basis vectors for<br />

classification.<br />

<br />

105<br />

<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.<br />

(1)


In the proposed method we use a similar approach with some<br />

modification. Instead of the selective splitting of the nodes, which<br />

basically helps in removing the redundancy in the LDB selection,<br />

we used all the nodes from the full decomposition tree and ranked<br />

them in decreasing order of their dissimilarity measure values between<br />

classes. The first 5 nodes that exhibit high dissimilarity<br />

measure values between the classes are selected for each trial.<br />

Among these nodes, based on the frequency of occurrence in all<br />

the trials, the 5 most occurring significant LDBs are selected. The<br />

redundancy within these 5 LDBs is later removed in the feature<br />

evaluation process in the LDA classifier. This is basically done<br />

to reduce the computational complexity of the LDB algorithm implementation.<br />

The whole process is repeated for three different<br />

wavelets (db4, cf4 and sy4) and the wavelet, which provides maximum<br />

dissimilarity measures among all the tested wavelets, is chosen<br />

to be the best basis for expansions.<br />

2.2. Databases<br />

2.2.1. Vibroarthographic (VAG) signals<br />

These are the vibration signals emitted from the human knee joints<br />

during an active movement of the leg. The VAG signals can be<br />

used to detect the early joint degeneration or knee defects that<br />

are reflected in knee movements. Extensive work [5] has been<br />

done using time-frequency approach in classifying these signals<br />

into multiple groups. Few important characteristics of the VAG<br />

signals which make them difficult to analyze are as follows: (i)<br />

Highly non-stationary in nature, (ii) Varying frequency dynamics,<br />

and (iii) Multi-component signal. The database consists of 89 signals<br />

with 51 normal and 38 abnormal signals. A normal and an<br />

abnormal VAG signal are shown in Fig. 1a.<br />

2.2.2. Pathological speech signals<br />

These are speech signals recorded from the pathological and normal<br />

talkers in a sound-proof booth at the Massachusetts Eye and<br />

Ear Infirmary. The normal talkers exhibited no abnormal vocal<br />

characteristics and indicated no history of voice disorders. All signals<br />

were sampled at 25 kHz. The signals were the first sentence<br />

of the rainbow passage, ’when the sunlight strikes rain drops in<br />

the air, they act like a prism and form a rainbow’, as spoken by<br />

the subjects. More details about the database and the classification<br />

problem can be found in authors previous work [6]. The database<br />

consists of 100 signals with 50 normal and 50 abnormal signals. A<br />

normal and pathological speech signal are shown in Fig. 1b.<br />

Amplitude (.au)<br />

Amplitude(.au)<br />

60<br />

40<br />

20<br />

0<br />

20<br />

40<br />

20<br />

0<br />

20<br />

40<br />

60<br />

80<br />

Normal and Abnormal VAG signals<br />

Normal<br />

1000 2000 3000 4000<br />

Time samples<br />

5000 6000 7000<br />

Abnormal<br />

1000 2000 3000 4000<br />

Time samples<br />

5000 6000 7000<br />

(a) VAG signals<br />

Amplitude(.au)<br />

Amplitude(.au)<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.2<br />

0.4<br />

0.4<br />

0.2<br />

0<br />

0.2<br />

0.4<br />

0.6<br />

Normal and Pathological speech signals<br />

Normal<br />

2 4 6 8 10 12 14<br />

x 10 4<br />

Time samples<br />

Pathological<br />

2 4 6 8 10 12 14 16 18<br />

x 10 4<br />

Time samples<br />

(b) Pathological speech signals<br />

Fig. 1. An Example of normal and abnormal/pathological signals<br />

for both the databases.<br />

<br />

106<br />

2.3. Dissimilarity measures<br />

In this study we used three different dissimilarity measures and<br />

performed a two group (class1 and class2) classification on the<br />

databases. In general most of the biomedical signals can be characterized<br />

by one or more of the following, (i) Their average energy<br />

distribution pattern over frequency bands, (ii) Event based temporal<br />

structures, (iii) Periodicity, and (iv) The amount of randomness.<br />

These rationales were used in arriving at the following dissimilarity<br />

measures.<br />

The first dissimilarity measure D1 is the difference in the normalized<br />

energy between the corresponding nodes of the training<br />

signals from class1 and class2. This gives the difference in the<br />

energy distribution of the signals on the time-frequency plane.<br />

D1 = E 1 j,k − E 2 j,k, (2)<br />

where E 1 j,k and E 2 j,k are the normalized energy of the corresponding<br />

nodes for class1 and class2 signals.<br />

The second dissimilarity measure D2 is the correlation index<br />

between the basis vectors at corresponding nodes. This measure<br />

emphasizes those nodes that can detect the difference in the temporal<br />

characteristics of the signals between class1 and class2.<br />

D2 =< Bj,k,Fj,k >, (3)<br />

where B and F are the corresponding basis vectors of class1 and<br />

class2 at node (j, k)<br />

The discriminant measure D3 is a measure of estimating the<br />

randomness or non-stationarity of the basis vectors. It is computed<br />

as the set of variances along the segments of the basis vector coefficients.<br />

The ratio of this variance measure between the signals<br />

from class1 and class2 indicate the amount of deviation observed<br />

in the non-stationarity between the classes.<br />

D3 = var(var(p))j,k)<br />

, (4)<br />

var(var(q))j,k)<br />

where p and q are the index of the L segments obtained by segmenting<br />

the basis vectors at node (j, k) for class1 and class2.<br />

2.4. Feature extraction<br />

Once the LDB nodes for each of the three dissimilarity measures<br />

are identified using the training sets (in our study 10 randomly selected<br />

signals for each class were used to form the training set) as<br />

explained in Section 2.1, all the 89 VAG signals and the 100 pathological<br />

speech signals were decomposed using the corresponding<br />

sets of LDB tree structures. Figs. 2 and 3 show the sample<br />

LDB tree structure obtained for the VAG and pathological speech<br />

databases respectively.<br />

The basis vectors from each of the nodes (LDBs) can be directly<br />

used as feature vector, however, considering the dimension<br />

of the basis vectors, we extract the same features from the basis<br />

vectors of LDBs using the dissimilarity measures (D1, D2, and<br />

D3) [1]. That is, from each of the LDB nodes of the corresponding<br />

tree structures, the normalized node energy, correlation index<br />

and the variance measure were calculated. In short, each signal in<br />

the database is used to compute 15 features, 5 from each dissimilarity<br />

measure. As for the correlation index calculation we use a<br />

random choice of normal signal as a template to correlate with the<br />

signals from respective test databases. The above procedure was<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.


Tree Decomposition<br />

(0,0)<br />

(1,0) (1,1)<br />

(2,0) (2,1) (2,2) (2,3)<br />

(3,0) (3,1)<br />

(3,6) (3,7)<br />

(4,12) (4,13)<br />

(5,24) (5,25)<br />

80<br />

60<br />

40<br />

20<br />

0<br />

20<br />

40<br />

60<br />

80<br />

100<br />

data for node: (0) or (0,0).<br />

1000 2000 3000 4000 5000 6000 7000<br />

Fig. 2. A sample LDB tree decomposition for VAG database (db4<br />

wavelet and D3 dissimilarity measure)<br />

Tree Decomposition<br />

(0,0)<br />

(1,0) (1,1)<br />

(2,0) (2,1)<br />

(3,0) (3,1)<br />

(4,0) (4,1)<br />

(5,2) (5,3)<br />

(4,2) (4,3)<br />

(5,6) (5,7)<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0.2<br />

0.4<br />

0.6<br />

data for node: (0) or (0,0).<br />

2 4 6 8 10 12 14<br />

x 10 4<br />

Fig. 3. A sample LDB tree decomposition for pathological speech<br />

database (cf4 wavelet and D3 dissimilarity measure)<br />

repeated for all the three wavelets. So, in total, for each wavelet, a<br />

set of 15 feature vectors was extracted from each of the signal in<br />

the test database.<br />

Figs. 4 and 5 demonstrate the feature space with the first two<br />

dominant features of the VAG and pathological speech database<br />

respectively. From the figures of the feature space plots, the discriminatory<br />

boundaries can be visualized between the signals of<br />

class1 and class2. These extracted features were then fed to a linear<br />

discriminant based classifier as will be explained in next section.<br />

2.5. Pattern Classification<br />

The motivation for the pattern classification is to automatically<br />

group signals of same characteristics using the discriminatory features<br />

derived as explained in the previous section. Pattern classification<br />

was carried out by linear discriminant analysis (LDA) technique<br />

using the SPSS software [7]. In discriminant analysis, the<br />

feature vector derived as explained above were transformed into<br />

canonical discriminant functions such as<br />

f = x1b1 + x2b2 + ....... + x42b42 + a, (5)<br />

where {x} is the set of features, {b} and a are the coefficients<br />

<br />

107<br />

Feature 2<br />

13<br />

12<br />

11<br />

10<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

x 10 3<br />

Feature space of the two dominant features for the VAG database with db4 wavelet<br />

Normal<br />

Abnormal<br />

0.05 0.1 0.15 0.2<br />

Feature 1<br />

0.25 0.3 0.35<br />

Fig. 4. Feature space with the first two dominant features - VAG<br />

database, db4 wavelet<br />

Feature 2<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

x 10 4 Feature space of the two dominant features for the pathological speech database with cf4 wavelet<br />

Normal<br />

Pathological<br />

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35<br />

Feature 1<br />

Fig. 5. Feature space with the first two dominant features - Pathological<br />

speech database, cf4 wavelet<br />

and constant respectively estimated and derived using the Fisher’s<br />

linear discriminant functions [7]. Using the chi-square distances<br />

and the prior probabilistic values of each group the classification<br />

is performed to assign each sample data to one of the groups.<br />

The classification accuracy was estimated using the leave-one out<br />

method which is known to provide a least bias estimate [4]. In<br />

leave-one-out method, one sample is excluded from the dataset<br />

and the classifier is trained with the remaining samples. Then the<br />

excluded signal is used as the test data and the classification accuracy<br />

is determined. This is repeated for all samples of the dataset.<br />

Since each signal is excluded from the training set in turn, the independence<br />

between the test and the training set are maintained.<br />

3. RESULTS AND DISCUSSIONS<br />

All the signals from both the databases were decomposed using<br />

their corresponding LDB tree structures. Features were extracted<br />

as explained in Section 2.4 and fed to the LDA based classifier.<br />

Classification accuracies were computed for the 9 combinations<br />

of the wavelet and the dissimilarity measures as shown in Table<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.


Wavelet LDA type D1 D2 D3<br />

db4 Regular 65 64 67<br />

Cross.V 61 57 64<br />

cf4 Regular 70 61 61<br />

Cross.V 65 57 48<br />

sy4 Regular 67 63 57<br />

Cross.V 61 60 45<br />

Table 1. Classification table for VAG database. Regular - Normal<br />

LDA, Cross.V - Leave-one-out method LDA, Classification<br />

accuracies are in percentage (%)<br />

Wavelet LDA type D1 D2 D3<br />

db4 Regular 84 64 77<br />

Cross.V 84 60 72<br />

cf4 Regular 85 52 92<br />

Cross.V 84 37 91<br />

sy4 Regular 87 53 86<br />

Cross.V 84 32 84<br />

Table 2. Classification table for pathological speech database.<br />

Regular - Normal LDA, Cross.V - Leave-one-out method LDA,<br />

Classification accuracies are in percentage (%)<br />

1 and Table 2 for both the databases. It can be observed from<br />

Table 1 that even though there are little variations, on an average<br />

all the three dissimilarity measures perform equally for the<br />

VAG database. However from Table 2 for the Pathological speech<br />

database it can be seen that the dissimilarity measures D1 and<br />

D3 provide high classification accuracies, whereas D2 performs<br />

poorly. In overall for VAG database we observe that the db4 wavelet<br />

in combination with all the three dissimilarity measures provides<br />

the highest classification accuracy. Similarly we observe for pathological<br />

database that the cf4 wavelet in combination with D1 and<br />

D3 provides the highest classification accuracy. Using these combinations<br />

we computed the highest possible classification accuracies<br />

for both the databases as shown in Table 3 and Table 4.<br />

For the VAG database an overall classification accuracy of<br />

78.7% using regular LDA and 76.4% using leave-one-out method<br />

were achieved. This is higher than the reported classification accuracy<br />

in [5]. For the pathological speech database an overall classification<br />

accuracy of 97% using regular LDA and 96% using leaveone-out<br />

method were achieved. This is higher than the reported<br />

classification accuracy in [6]. The above results demonstrate the<br />

performance optimization of the LDB algorithm using the right<br />

choice and combination of the dissimilarity measures to achieve<br />

high classification accuracies for non-stationary signal analysis.<br />

4. CONCLUSIONS<br />

The importance of the dissimilarity measure in the performance<br />

optimization of the LDB algorithm was discussed with two classification<br />

examples. Classification accuracies were analyzed for<br />

different combinations of wavelets and the dissimilarity measures.<br />

Improvement in the classification accuracies by using a combination<br />

of multiple dissimilarity measures was demonstrated. High<br />

classification accuracies were achieved for the databases under<br />

study, thus proving the success of the modified LDB in analyz-<br />

<br />

108<br />

Method <strong>Group</strong>s Normal Abnormal Total<br />

Regular Normal 39 12 51<br />

Abnormal 7 31 38<br />

% Normal 76.5 23.5 100<br />

Abnormal 18.4 81.6 100<br />

Cross.V Normal 39 12 51<br />

Abnormal 9 29 38<br />

% Normal 76.5 23.5 100<br />

Abnormal 23.7 76.3 100<br />

Table 3. Table showing the highest classification accuracy<br />

achieved for the VAG database(db4 wavelet and selective combination<br />

of D1, D2 and D3) . Regular - Normal LDA, Cross.V -<br />

Leave-one-out method LDA, % = Percentage of classification<br />

Method <strong>Group</strong>s Normal Pathological Total<br />

Original Normal 48 2 50<br />

Pathological 1 49 50<br />

% Normal 96 4 100<br />

Pathological 2 98 100<br />

Cross.V Normal 48 2 50<br />

Pathological 2 48 50<br />

% Normal 96 4 100<br />

Pathological 4 96 100<br />

Table 4. Table showing the highest classification accuracy<br />

achieved for the pathological speech database (cf4 wavelet and<br />

combined D1 and D3). Regular - Normal LDA, Cross.V - Leaveone-out<br />

method LDA, % = Percentage of classification<br />

ing non-stationary signals. Future work involves in automating<br />

the choice of dissimilarity measures based on the nature of the<br />

databases and applications.<br />

5. REFERENCES<br />

[1] N. Saito and R. R. Coifman, “Local discriminant bases and<br />

their applications,” Journal of Mathematical Imaging and Vision,<br />

vol. 5, no. 4, pp. 337–358, 1995.<br />

[2] Andrew Web, Statistical Pattern Recognition, WILEY, West<br />

Sussex, England, 2002.<br />

[3] Stephane Mallat, A wavelet tour of signal processing, Academic<br />

press, San Diego, CA, 1998.<br />

[4] K. Fukunaga, Introduction to Statistical Pattern Recognition,<br />

Academic Press, Inc., San Diego, CA, 1990.<br />

[5] S. Krishnan, “Adaptive signal processing techniques for analysis<br />

of knee joint vibroarthrographic signals,” in Ph.D dissertation,<br />

<strong>University</strong> of Calgary, June 1999.<br />

[6] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, “Discrimination<br />

of pathological voices using an adaptive timefrequency<br />

approach,” in Proceedings of ICASSP 2002 IEEE<br />

International conference on Acoustics, Speech and <strong>Signal</strong><br />

Processing, Orlando, USA, May 2002, pp. IV 3853–3855.<br />

[7] SPSS Inc., “SPSS Advanced statistics user’s guide,” in User<br />

manual, SPSS Inc., Chicago, IL, 1990.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.


RADIO OVER MULTIMODE FIBER FOR WIRELESS ACCESS<br />

Roland Yuen Xavier N. Fernando Sridhar Krishnan<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, Ontario, Canada<br />

r yuenaee. ryerson. ca xavier Qieee. org krishnanQee. ryerson. ca<br />

Abstract<br />

A radio over fiber link is a promising technology for<br />

antenna remoting applications. Typically, the radio<br />

over fiber link employs a single mode fiber. But, the<br />

signal power at the remote antenna is very small. The<br />

main reason is large power loss in the E/O and O/E<br />

convertor. But, the coupling efficiency of a E/O con-<br />

vertor can be improved with multimode fiber (MMF),<br />

so we propose to use a ROF link with a vertical-cavity<br />

surface-emitting laser with a graded index MMF to<br />

transport optical signals. A multimode fiber has a larger<br />

core radius compared to a SMF. A larger core radius<br />

allows more optical power coupled into a fiber. With<br />

simple butt-coupling techniques, the coupling eficiency<br />

can be 90% and simplicity leads to reduction in cost of<br />

the link. Normally, the MMF is used in short distance<br />

digital applications with a bandwidth distance product<br />

of about 500 MHDkm, so it is good for local area pic-<br />

ocells. Our approach as to transmit passband signals<br />

such as QPSK and FSK through the ROF link. Our<br />

simulation shows that a 900 MHz carrier can transport<br />

through a link of 1.22 km long. In this paper, we in-<br />

vestigate the feasibility of using a MMF for antenna<br />

remoting in local area picocells and compare the trade-<br />

08 between coupling efficiency and bandwidth.<br />

Keywords: Radio over fiber; multimode fiber; remote<br />

antenna; coupling eficiency.<br />

1. INTRODUCTION<br />

Radio over fiber (ROF) link is used in remote an-<br />

tenna applications to distribute signals for microcell<br />

or picocell base station (BS). In the remote antenna<br />

application, the downlink RF signals are distributed<br />

from a central base station (CBS) to many BS known<br />

as radio access point (RAP) through fibers. The up-<br />

link signals received at the RAPs are sent back to the<br />

CBS for any signal processing. A RAP is much more<br />

cost effective to deploy than a normal BS because it<br />

is mostly consisted of simple devices, which includes a<br />

E/O convertor, a O/E convertor, and an amplifier. The<br />

cost of signal processing in a CBS is shared amounts of<br />

CCECE 2004 - CCGEI 2004, Niagara Falls, May/mai 2004<br />

0-7803-8253-6/04/$17.00 @ 2004 IEEE<br />

- 1715 -<br />

many RAPs. Additional to the lower cost advantage, a<br />

smaller cell size coverage reduces the near fax effect and<br />

relaxes the battery requirement on mobile receivers.<br />

Although the fiber is a reliable medium with low<br />

attenuation (0.5 dB/km at 1550 nm), challenge still<br />

exists in large loss due to E/O and O/E conversion [l].<br />

In this paper, we propose to employ multimode fiber<br />

(MMF) to increase the coupling efficiency, which re-<br />

duces the E/O conversion loss. However, MMF has<br />

limited bandwidth largely due to modal dispersion.<br />

In this paper, we will be discussing two topics. The<br />

downlink architecture of the ROF link. The tradeoff<br />

between power and bandwidth in remote antenna ap-<br />

plication.<br />

109<br />

2. THE RADIO OVER FIBER LINK<br />

Figure 1: Radio over fiber link in remote antenna ap-<br />

plication i<br />

The radio over fiber (ROF) links in remote antenna<br />

application is illustrated in Figure 1. The central base<br />

station (CBS) and the radio access points (RAPs) are<br />

connected through two fibers, which transport the up-<br />

link and downlink signal. The RAPs act as remote<br />

antenna that receives and transmits signals to mobile<br />

users, whereas the CBS collects signals from the RAPs<br />

for processing and distribute signals to all the RAPs.<br />

The downlink of the ROF can be divided into an op-<br />

tical channel and a wireless channel denoted by ROF<br />

and Air respectively in Figure 2. When a signal s(t)<br />

goes through the optical channel, it is attenuated by<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.


a loss of L,t. After the optical channel, the signal<br />

is boosted by G,l and later in the wireless channel it<br />

further attenuated by a path loss L,l. Noise nopt(t)<br />

is added to the signal in the optical channel, and noise<br />

n,1 (t) is added in the wireless signal. Finally, the quality<br />

of the received signal r(t) is determined from the<br />

signal to noise ratio (SNR) at the mobile user.<br />

Figure 2: Downlink block diagram of radio over fiber<br />

link<br />

2.1 Optical Channel<br />

The optical channel of the ROF link that use a mul-<br />

timode fiber (MMF) is illustrated in Figure 3. It con-<br />

sists of an optical source, a fiber, and a photodetector.<br />

1 +S(t) Multimode fiber h,,,(t) Photodetector<br />

Figure 3: The optical channel<br />

The signal s(t) from the CBS can be in any form<br />

such as QPSK, 16-PSK, and FSK. Usually in mobile<br />

communication, the signal has a bandwidth less than 2<br />

MHz. The signal is directly modulated onto a laser and<br />

biased to minimize nonlinearity and clipping distortion.<br />

The signal, s(t), after biased is given as,<br />

Sbias(t) = [I + ms(t)] (1)<br />

where m is the optical modulation index.<br />

The impulse response of a MMF can be generalized<br />

as a Gaussian response [2] with respect to optical power<br />

and it is given as,<br />

where T is the delay of the channel and c the stan-<br />

dard deviation of the impulse response that increases<br />

linearly with the link distance. Longer the link is, the<br />

modal dispersion effect is more apparent.<br />

110<br />

- 1716 -<br />

Other than distortion from modal dispersion, there<br />

are noises in the optical channel. These noises are combined<br />

into a term, nopt(t). The output photocurrent is<br />

given as,<br />

i(t) = - ps [I+ ms(t)l* hrnrnp(t) + nopt(t) (3)<br />

6<br />

where Ps is the average optical power emitted from the<br />

laser diode, Lopt is the loss in the optical channel, and<br />

hmmf(t) is the impulse response of a MMF.<br />

The Lo, includes the losses from the E/O, O/E<br />

conversion, the fiber attenuation, the connectors, and<br />

matching of the transmitter and the receiver. In [l],<br />

the electrical loss in dB is given as<br />

zin<br />

Lopt,dB = -20 log(%Gm) + lolog(-) + 2(2Zc + ad)<br />

zmt<br />

(4)<br />

and in linear is given as,<br />

where IR is the responsivity of the photodetector in<br />

mA/mW, G, is the modulation gain of the optical<br />

source in mW/mA, Zin is the impedance of the laser,<br />

Zout is the impedance of the optical receiver, IC is the<br />

optical connector, a is the attenuation per km of the<br />

fiber, and d is the distance of the link in km. In above<br />

expression, the E/O, O/E conversion loss, connector<br />

loss, and fiber attenuation are double because they are<br />

the losses in the optical domain. The modulation gain<br />

G, is the coupling efficiency that accounts for the F'res-<br />

ne1 loss and the misalignment loss. It reflects the qual-<br />

ity of coupling techniques.<br />

2.2 Optical <strong>Signal</strong> to Noise Ratio<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.<br />

To evaluate the performance of an optical link, the<br />

optical signal to noise ratio (OSNR) is needed. It is<br />

evaluated at the output of the optical receiver. The<br />

OSNR can be expressed as follows,<br />

m2 P," ( s2 (t))<br />

OSNR =<br />

LOPt (Gpt (t)) .<br />

The nopt(t) is the noise induced in the optical channel,<br />

and it is assumed to be an additive white gaussian<br />

noise. The noise consists of the relative intensity noise<br />

(nLIN(t)) from the laser, the shot noise (&(t)) from<br />

the photodetector, and the thermal noise (n&(t)) from<br />

the receiver electronics. In [l], the total noise power of<br />

the signal is given as


3. COMPARISON OF MULTIMODE<br />

FIBER AND SINGLE MODE FIBER<br />

To reduce the penalty from the E/O conversion,<br />

multimode fiber (MMF) is used. The proposed sys-<br />

tem chose the combination of vertical-cavity surface-<br />

emitting laser with graded index MMF fiber. In [3], the<br />

authors found that the coupling efficiency to a graded<br />

index MMF is strongly depended on the active laser di-<br />

ameter, the index guiding of a laser, and the transverse<br />

mode emission spectrum of a laser. The coupling effi-<br />

ciency also depends on the coupling techniques. How-<br />

ever, for better coupling efficiency, there is a tradeoff<br />

for the bandwidth of a radio over fiber link.<br />

Physically, the MMF has a larger core diameter of<br />

50-200 pm compares to the signal mode fiber (SMF) of<br />

8- 12 pm. In addition, MMF also has higher numerical<br />

aperture in the range of 0.19 - 0.30. The higher nu-<br />

merical aperture means a larger acceptance angle that<br />

allows more optical power to be coupled into a fiber.<br />

These physical characteristics of the SMF and MMF<br />

are found in [4]. Moreover, the typical VCSEL has an<br />

active diameter of 16 - 20 pm [3], which is smaller than<br />

the core diameter of a MMF, but larger than the core<br />

diameter of a SMF. Thus, physically a MMF can better<br />

capture the optical power emitted from a laser.<br />

It has been reported in [3] that the coupling effi-<br />

ciency can exceed 90%. This is achieved with butt-<br />

coupling a graded index MMF with a weakly index<br />

guide proton-implanted VCSEL laser. But, the typ-<br />

ical coupling efficiency lies in the 70% - 80% range.<br />

Whereas, the SMF has a typical coupling efficiency<br />

in the 40% - 70% range [5]. Various coupling tech-<br />

niques have been proposed and evaluated in terms of<br />

their coupling efficiency and the fabrication complexity.<br />

They can be generalized into butt-coupling, lens cou-<br />

pling and pigtail coupling. The butt-coupling is usually<br />

used for MMF. In butt-coupling, the MMF is placed as<br />

close to the laser facet as possible. This results in good<br />

coupling efficiency and increases the misalignment tol-<br />

erance [6]. Moreover, this technique is relative easy<br />

to fabricate. However, this technique is not suitable<br />

for SMF because of its small core diameter and its low<br />

numerical aperture. In practice, more complex tech-<br />

niques like lens coupling and pigtail coupling is used in<br />

SMF. In the lens coupling technique, single or multi-<br />

ple lenes are placed between a laser facet and a optical<br />

fiber [5]. This technique improves the coupling effi-<br />

ciency to be more than 50%. However, it is harder to<br />

fabricate lens that is suitable for the SMF, so pigtail<br />

coupling technique would be used. This technique first<br />

couples the laser to a MMF, then from the MMF to<br />

the SMF. Additional coupling loss is introduced from<br />

- 1717 -<br />

the extra coupling stage, but it is easier to fabricate [7].<br />

From all the discussion above, it is obvious that MMF<br />

has better coupling efficiency and lower cost.<br />

On the other hand, the bandwidth of the MMF<br />

is significantly less than the SMF. It is widely re-<br />

ported that that digital system SMF has a bandwidth<br />

in the GHz.km range and the MMF has a bandwidth<br />

of 500 MHz-km. However, the MMF is sufficient for<br />

the local picocells with short distance and bit rate in<br />

the low Mbps range. This is demonstrated in the next<br />

section.<br />

111<br />

4. NUMERICAL DISCUSSION<br />

In this section, simulation of the downlink transmis-<br />

sion from the central base station to the radio access<br />

point is discussed. A vertical-cavity surface-emitting<br />

laser and a gradient index multimode fiber (MMF) is<br />

assumed for the radio over fiber link. The laser oper-<br />

ates at 850 nm and emits 1 mW of optical power. We<br />

assume the same butt-coupling technique as in [3]. We<br />

also assume a relatively large G, = 0.80 mW/mA. The<br />

responsivity !J? of optical receiver is 0.75 mA/mW. The<br />

c7 of the MMF impulse response (2) is 0.5 ns/km [2] and<br />

the delay T is 30. The fiber attenuation is 2.5 dB/km<br />

and the connector loss is 2 dB. The system is assumed<br />

to be perfectly matched, so there is no loss from match-<br />

ing. Noise is added according to (7) and generated for<br />

a bandwidth of 2 MHz, a relative intensity noise pa-<br />

rameter RIN of -155 dB/Hz, a 50 R load resistance,<br />

and a temperature of 278K. The optical signal to noise<br />

ratio (OSNR) of the link is calculated according to (6).<br />

Figure 4 shows the four different OSNR curves as a<br />

function of the ROF link distance under various chan-<br />

nel and signal conditions. The top most curve is the<br />

OSNR of a SMF and the rest of the curves is the OSNR<br />

of a graded index MMF at virous carrier frequencies.<br />

The dispersion with the MMF has significant impact<br />

on the OSNR of the link. With a carrier frequency of<br />

900 MHz at a distance of 1 km, there is about 30 dB loss<br />

in OSNR compares to a SMF and as the link distance<br />

increase the rate of the loss increases faster. However,<br />

application considered is a short haul link. For a carrier<br />

frequency of 900 MHz, the ROF link still can support<br />

up to 1.22 km with the OSNR better than 10 dB. For<br />

a 1200 MHz and a 1500 Mhz carrier, the link supports<br />

910 m and 740 m respectively. Figure 5 also shows the<br />

same OSNR curves, but the noise bandwidth increase<br />

to 10 MHz. It shows a decrease of about 7 dB in all<br />

the OSNR curves and the distance a link can support<br />

also decrease.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.<br />

'


OSNR of The ROF Link<br />

-- OSNR Carrier Fnsqueney<br />

Distance of the link in h<br />

Figure 4: OSNR versus distance of multimode ROF<br />

links with 2 MHz noise bandwidth<br />

80 ~<br />

70<br />

Figure 5: OSNR versus distance of multimode ROF<br />

links with 10 MHz noise bandwidth<br />

- 1718 -<br />

112<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.<br />

5. CONCLUSION<br />

In this paper, we have investigated the radio over<br />

fiber (ROF) link that employs a graded index mul-<br />

timode fiber (MMF) with a vertical-cavity surface-<br />

emitting laser to increase the coupling efficiency. The<br />

coupling efficiency of 90% can be achieved with butt-<br />

coupling, which is relatively simple for the MMF. With<br />

the complexity translated to the cost of a system, it<br />

makes our proposed system attractive. However, there<br />

is tradeoff in bandwidth. For a 900 MHz carrier, the<br />

ROF link distance is restricted to 1.22 km with an op-<br />

tical signal to noise better than 10dB. For the remote<br />

antenna application of local picocells, this configura-<br />

tion of ROF is sufficient.<br />

References<br />

[l] X. N. Fernando; A. Anpalagan, “On the design of<br />

optical fiber based wireless access systesm.. .”, In-<br />

ternation Conference on Communication, 2004, To<br />

be presented.<br />

[2] K. Azadet; E. F. Haratsch; H. Kim; F. Saibi;<br />

J. H. Saunders; M. ShafFer; L. Song; Meng-Lin<br />

Yu, “Equalization and FEC techniques for optical<br />

transceivers”, Solid-state Circuits, IEEE Journal<br />

of, vol. 37, no. 3, pp. 317-327, March 2003.<br />

[3] J. Heinrich; E. Zeeb; K. J. Ebeling, “Butt-coupling<br />

efficiency of VCSELs into multimode fibers”, Pho-<br />

tonics Technology Letters, IEEE, vol. 9, no. 12, pp.<br />

1555 -1557, Dec. 1997.<br />

[4] Gerd Keiser, Optical Fiber Communication,<br />

Boston, MA : McGraw-Hill, 2000.<br />

[5] John M. Senior, Optical Fiber Communications:<br />

Principles and Practice, Prentice Hall, second edi-<br />

tion, 1992.<br />

[6] J. A. Hiltunen; K. Kautio; J.-T. Makinen; P. Kar-<br />

ioja; S. Kauppinen, “Passive multimode fiber-to-<br />

edge-emitting laser alignment based on a multilayer<br />

LTCC substrate”, in Electronic Components and<br />

Technology Conference, 2002. Proceedings. 52nd.<br />

IEEE, May 2002, pp. 815 -820.<br />

[7] Leslie A. Reith; Paul W. Shumate, “Coupling sen-<br />

sitivity of an edge-emitting LED to single-mode<br />

fiber”, Lightwave Technology, Jounal of, vol. LT-<br />

5, no. 1, pp. 29-34, January 1987.


SUB-DICTIONARY SELECTION USING LOCAL DISCRIMINANT BASES<br />

ALGORITHM FOR SIGNAL CLASSIFICATION<br />

Karthikeyan Umapathy and Anindya Das<br />

Dept. of Electrical and Computer Eng.,<br />

The <strong>University</strong> of Western Ontario,<br />

London, Ontario, CANADA.<br />

Email: kumapath@uwo.ca<br />

Abstract<br />

In signal decompositions using over-complere. redundant timefrequency<br />

(TF) dictionaries, oren it is challenging to restrict the<br />

dictionary to a sub-dictionary tailored for specific applications.<br />

In the proposed technique we used a similar appmach as Local<br />

Discriminant Bases Algorithm (mB) to select optimal TF subdictionaries<br />

for signal classification applications. A novel timewidth<br />

versus frequency band mapping was generated for each of<br />

the signal class. These mappings of different classes were compared<br />

using a discriminant measure ro arrive at a sub-dicrionary.<br />

This sub-dictionary was then used for decomposing the testing<br />

set signals, followed by fearure exrraction and classification. Two<br />

highly non-stationary bio-medical databases I . Vibroarrhrographic<br />

signals (89 signals, 51 normal and 38 abnormal) 2. Pathological<br />

speech darabase (103 signals, 50 normal and 50 pathological)<br />

were rested. Classification accuracies as high as 74.2% and 92%<br />

wem achieved respectively. Due Io the sub-dictionary appmach,<br />

appmximately a 40% reduction in signal decomposition time was<br />

observed for the tested databases.<br />

Keywords: timerfrequenq, sub-dictionary, matching pursuit, lo-<br />

cal discriminant bases, discriminanr measure<br />

1. INTRODUCTION<br />

lime-frequency (TF) transformations have signi,kantly contributed<br />

in the area of automatic signal classikation. The TF<br />

transforms help us to understand the signals better, thereby to extract<br />

strong clues or features for classikation. Even though the<br />

complete TF plane contain details about the signals, in classikation<br />

application it is often a small area or pockets of areas in the<br />

TF plane that actually exhibit the difference between the classes<br />

of signals. The success of a classiPcation application depends on<br />

how well these target areas can be identiPed and analyzed in the<br />

TF plane. Once the target areas are identiPed, it is easier to zoom<br />

into these areas by performing time and frequency localized decompositions<br />

to extract relevant features for classiPcations.<br />

Pruning algorithms such as Local Discriminant Bases algorithm<br />

(LDB) [I] were introduced to identify the target subspace in<br />

the TF plane that exhibit high discrimination values between signal<br />

classes. However most of the existing literature of LDB deals<br />

only with dictionary of orthonormal bases (Wavelet packets). Considering<br />

the various advantages 121 of using redundant dictionaries<br />

for Bexible signal representations, in the proposed technique<br />

we use adaptive time-frequency transformation (ATFT) based on<br />

matching pursuit algorithm. The nature of the ATIT based on<br />

CCECE 2004 - CCGEI 2004, Niagara Falls, May/rnai 2004<br />

0-7803-8253-6/04/517.00 @ 2004 IEEE<br />

Sridhar Krishnan<br />

Dept. of Electrical and Computer Eng.,<br />

<strong>Ryerson</strong> <strong>University</strong>,<br />

Toronto, Ontario, CANADA.<br />

Email: krishnan@ee.ryerson.ca<br />

matching pursuit is different from wavelet packet transform (Un-<br />

like wavelet/wavelet packet transform the scale and frequency pa-<br />

rameters are not related in ATFT), hence the LDB approach in<br />

identifying the target subspace has to be modiPed before it can be<br />

applied to the matching pursuit based ATlT decomposition.<br />

In this paper we demonstrate the process of selecting a sub-<br />

dictionary (subspace) from a redundant dictionary using a similar<br />

approach to LDB for classikation application. The selected sub-<br />

dictionaries were then used to decompose two biomedical signal<br />

databases to a localized TF plane. Features were extracted and<br />

classiPcation was performed. The paper is organized as follows:<br />

Section I1 covers Methodology consisting of ATFT, LDB, feature<br />

extraction and pattern classiPcation. Results and conclusions are<br />

given in Section 111.<br />

2. METHODOLOGY<br />

2.1. Adaptive Time-frequency Transform<br />

The signal decomposition technique used in this work is based<br />

on the matching pursuit (MP) [2] algorithm. MP is a general<br />

framework for signal decomposition. The nature of the decom-<br />

position varies according to the dictionary of basis functions used.<br />

When a dictionary of TF functions is used, MP yields an adaptive<br />

time-frequency transformation [Z]. In MP any signal x(t) is de-<br />

composed into a linear combination of TF functions g(t) selected<br />

horn a redundant dictionary of TF functions.<br />

where<br />

-2001 -<br />

and a, are the expansion coefkients. The scale factor sn also<br />

called as octave or time-width parameter is used to control the<br />

width of the window function, and the parameter p, controls the<br />

temporal placement. The parameters fn and $., are the frequency<br />

and phase of the exponential function respectively. The signal<br />

x(t) is projected over a redundant dictionary of TF functions with<br />

all possible combinations of scaling, translations and modulations.<br />

The dictionary of TF functions can either suitably be modiPed or<br />

selected based on the application in hand. In our technique, we<br />

are using the Gabor dictionary (Gaussian functions) which has the<br />

113<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.


est TF localization properties. At each iteration, the best corre-<br />

lated TF functions to the local signal structures are selected from<br />

the dictionary. The remaining signal called the residue is further<br />

decomposed in the same way at each iteration subdividing them<br />

into TF functions.<br />

Theoretically when using a redundant dictionary the decom-<br />

position parameters a,,, sn. fn, p, and can take any values<br />

within their ranges. However in the practical discrete implementa-<br />

tion used in this work, sn can vary in powers of 2 from 2 to 14, fn<br />

can vary from 0 to Fs/2 (Fs is the sampling frequency), p,, can<br />

vary from 0 to signal size. &, can vary from 0 to 1. The possible<br />

values taken by these parameters can be restricted to construct a<br />

sub-dictionary, in Section 2.3 we will demonstrate how these pa-<br />

rameters can be restricted to a localized area in the TF plane for<br />

classiPcation application.<br />

2.2. Local Discriminant Bases Algorithm<br />

In the LDB [I] algorithm (using wavelet packet bases), a set<br />

of training signals xf for all C classes are decomposed to a full<br />

tree structure of order N. We restrict our explanation to binary<br />

wavelet packet trees. Let Ro,o be a vector space in R- corre-<br />

sponding to the node 0 of the parent tree. Then al each level<br />

the vector space is spilt into two mutually orthogonal subspaces<br />

given by Cl,,, = Rj+l,2k eB Rj+l,2r;+~ where j indicates the level<br />

of the tree and k represents the node index in level j, given by<br />

k = 0, __._, 21 - 1. This process repeats till the level J, giving rise<br />

to 2' mutually orthogonal subspaces. The goal is to select the set<br />

of best subspaces that provide maximum discriminant information<br />

between the classes of the signal. Each node k contains a set of<br />

[='"a-'-l<br />

basis vectors Bj,k = [wj,!+],=, , where 2"" corresponds<br />

to the length of the signal. Then the signals xi can be represented<br />

by a set of coetkients cas:<br />

xi = Cik,lW1,k,L . (3)<br />

AkJ<br />

The time index of the signals xi has been dropped for nota-<br />

tional convenience. Basically the signals xi are decomposed into<br />

2' subspaces with Cj,k,l coetkients in each subspace. The sub-<br />

spaces which exhibit high discriminant values for the discriminant<br />

measure D, are then used to expand the testing set signals and<br />

features are extracted for classikation.<br />

Unlike the wavelet packet decomposition, in ATFl the scale<br />

and frequency parameters are not explicitly related. The LDB<br />

strategy of splitting a subspace (node) to obtain children subspaces<br />

(node) does not apply to A m. In ATFI any scale can occur in<br />

combination with any frequencies (only resuicted by the uncer-<br />

tainty principle) giving it the extreme Bexibility to obtain any local<br />

TF resolution. So we will have to adapt the LDB approach before<br />

it can be applied to ATFT.<br />

2.3. Sub-dictionary Selection Process<br />

As we will be performing a two-group ciassikation on the<br />

given datasets, the following sub-dictionary selection process will<br />

be explained with a two-group ciassiPcation of signals (Class A<br />

and Class B). Coarse TF decompositions were performed on the<br />

training sets of both classes of signals. Coarse TF decompositions<br />

can be achieved, by controlling the step size of the decomposi-<br />

tion parameters. In the proposed technique, out of the possible<br />

114<br />

sn values (2' to 214), we force the decomposition to select only<br />

scales of sl = 2', 92 = Z6, 93 = 2" and s4 = 214. Simi-<br />

larly we group the f,, parameters into frequency bands off 1 = 0<br />

to Fs/8, f2 = Fs/8 to Fs/4, f3 = Fs/4 to 3 * Fs/8 and<br />

f4 = 3 * Fs/8 to Fs/2. As we choose to completely cover the<br />

frequency range in 4 bands without gaps, we allow the decompo-<br />

sition to choose fn from the complete range 0 to Fs/2. Later<br />

in the processing of the decomposition parameters we group them<br />

into four frequency bands. With all the training signals decom-<br />

posed using the restricted s, values, we group the decomposition<br />

parameters in combinations of (sl, s2, s3 and s4) and (f 1, f 2, f 3<br />

and f4). So in total we will be able to group them into 16 cells<br />

as shown in Fig. I. The cells in the respective time-width versus<br />

frequency band mapping are numbered as A1 to A16 and B1 to<br />

B16.<br />

Here it should be noted that the time-width axis in the Fig. 1<br />

does not correspond to time, but scale (time-width). During the<br />

decomposition process, any scale parameter can occur at any time<br />

position so the arranging of the scale parameter from low to high<br />

does not mean they occur in that order in real time. Once we get<br />

this time-width vs frequency hand mapping for the training set of<br />

signals for both the classes, we average them to get an averaged<br />

time-width versus frequency band mapping for both classes of sig-<br />

nals.<br />

In order to identify the cells which demonstrate high discrimi-<br />

nant values between the classes we use a similar approach to LDB.<br />

We dePne a discriminant measure D, which is used to compare the<br />

corresponding cells in the time-width versus frequency hand map-<br />

pings. Unlike LDB with orthonormal bases where the set of basis<br />

functions and their variations are limited and Pxed, in ATFT the<br />

variations can be limitless theoretically (only restricted by the un-<br />

cerlainty principle). In other words the TF tiling (TF resolution)<br />

is Pxed for a particular scale function of waveletslwavelet pack-<br />

ets although their position in the TF plane can be altered (wavelet<br />

packets). In ATFT it is &@cult to assign a Pxed subspace shape or<br />

size based only on scale parameter or frequency parameter. Hence<br />

we choose both scale and frequency to assign a subspace 011 the<br />

TF plane. However this cannot be generalized as the combination<br />

of scale and frequency can be limitless (only restricted by the un-<br />

certainty principle) based on the signal structures.<br />

In the proposed technique we use the normalized cumulative<br />

energy difference between the cells as the discriminant D measure.<br />

The discriminant D measure is give by:<br />

and<br />

- 2002 -<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.<br />

where E is the normalized cumulative energy of the TF functions<br />

in a cell, at is the energy coepcient of the TF function. k is the<br />

number of TF functions grouped in a cell and B, is the total de-<br />

composed energy of the signal.<br />

We compare these cumulative energies of the corresponding<br />

cells and compute the D. We sort them in decreasing order of D.<br />

The cells which yields high values for D exhibit signiPcant differ-<br />

ences between the classes. This indicates the target space for clas-<br />

sipcation lies within these cells. Fig I pictorially explains the way<br />

we compare the corresponding cells using D and as an example we<br />

have shown a possible outcome with 5 cells (cross hatched) iden-<br />

tibed as the highly discriminative cells between classes. These 5


0 : SI 92 r3 54 =-ax<br />

Tirnep"idfh<br />

: S"b-discio"ary :<br />

cells are chosen as the hst h e cells when soned in the decreasing<br />

order of their discriminant values. Now we identify the range covering<br />

all these cells both in time-width and frequency band axis. In<br />

the given example as shown in Fig I with doned lines, we choose<br />

the frequency band axis ranging from 0 to 3*Fs/8 and time-width<br />

ranging from sl to s2. Once these ranges are identiPed we restrict<br />

the redundant dictionary and construct a sub-dictionary with these<br />

ranges for time-width and frequencies. The testing set signals will<br />

now he decomposed using this sub-dictionary enabling them to<br />

zoom into only the target space in the TF plane. In the decomposing<br />

process of testing set signals using the sub-dictionary we allow<br />

the decomposition to go in Pne steps of time-width and frequencies.<br />

This targeted decomposition yields parameters that contain<br />

high discriminatory information between the classes. Features are<br />

extracted from these decomposition parameters and fed to a Linear<br />

Discriminant <strong>Analysis</strong> (LDA) based classiper as will be explained<br />

in subsequent sections.<br />

0 . 1 U 0 r(rmu 0 . 1 e 3, dvnu<br />

T>nr"d* i,rnrwd,h<br />

(a) (b)<br />

Fig. 2. Sub-dictionary selection for VAG (a) and Pathological<br />

speech signals (b).<br />

2.4. Feature Extraction and Pattern Classifi-<br />

cation<br />

We use the following two highly non-stationary databases for<br />

Fig. 1. Sub-dictionary selection process<br />

- 2003 -<br />

testing with our proposed technique. 1. Vibroarthrographic (VAG)<br />

signals and 2. Pathological speech signals. Vibroarthrographic<br />

signals are the signals emitted from the human knee joints during<br />

an active movement of the leg. More details of this database can he<br />

had from [3]. Pathological speech signal database contains speech<br />

signals from both normal and pathological talkers. More details of<br />

this database can be had from [4].<br />

As explained in Section 2.3 we obtained sub-dictionaries for<br />

both the VAG and pathological speech databases. In performing<br />

the coarse TF decomposition on the training set, a faster version of<br />

the ATFT algorithm [5] was used with 2000 iterations. IO ran-<br />

domly selected signals from both classes, from both VAG and<br />

pathological speech signals were used as the training set. We<br />

use the Rst 3 highly discriminating cells in arriving at the time-<br />

width and frequency ranges. Figs 2(a) and 2(b) show the ranges<br />

(cross hatched) obtained for time-width and frequencies for VAG<br />

and pathological speech databases respectively. For VAG database<br />

based on the chosen cells, time-width varies from 2? to 2' and<br />

the frequency varies from 0 to Fs/4. For pathological speech<br />

database the time-width varies from 26 to 214 and the frequency<br />

vanes from 0 to Fs/8. All the 89 VAG signals and 100 patho-<br />

logical speech signals were decomposed using their correspond-<br />

ing sub-dictionaries with the regular ATFT algorithm. We use Pne<br />

steps of time-width within the range of the sub-dictionary. The<br />

iterations were limited to I000 for both VAG and pathological<br />

speech signals as we are only interested in the discriminative sub-<br />

space in the TFplane and we do not require a complete decomposi-<br />

tion of the signal. Due to the usage of sub-dictionary approach the<br />

decomposition times were reduced by approximately 40% in the<br />

proposed work compared to using a full range redundant dictio-<br />

nary. The reduction in the decomposition times depends on how<br />

small the sub-dictionary is. The decomposition parameters were<br />

analyzed and the following features were extracted. 1. Number of<br />

TF functions (Flc,): This feature is the number of TF functions<br />

falling into each of the cells covering the same area as the highly<br />

discriminative cells that were used to construct the sub-dictionary.<br />

115<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.


2. Cumulative energy of the cells (FZc,): We compute the cu-<br />

mulative energy contained in each of the cells that were used to<br />

compute (Flc,). It should he noted here that as we are using<br />

Pne steps of time-width in the decomposition of testing signals we<br />

should be having more number of cells covering the same range of<br />

the sub-dictionary. For example in aniving at the sub-dictionary of<br />

VAG signal we identikd the cells corresponding to 3 time widths<br />

2', Z6 and 2". However in decomposing the testing signal we<br />

used Pne steps of time-widths which means, the time-width step<br />

size is reduced fiom 4 to 1 and so now we will have 9 cells cover-<br />

ing the same time-width range.<br />

Both the above explained feature vectors (Flcn and F2c,)<br />

were evaluated for their discriminant power and only 9 out the total<br />

18 features (both feature vector put together) were selected for the<br />

purpose of classiPcation. This selected 9 features contained fea-<br />

tures from both feature vectors (FIG,, and FZc,). The motiva-<br />

tion for the pattern classiPcation is to automatically group signals<br />

of same characteristics using the discriminatory features derived.<br />

In LDA, the feature vector derived as explained above were trans-<br />

formed into canonical discriminant functions such as<br />

f = ulbl + uzbz + ....... + vgbg + a, (6)<br />

where {U} is the set of features, {b} and a are the coeecients and<br />

constant respectively estimated and derived using the Fisher& lin.<br />

ear discriminant functions [6]. Using the chi-square distances and<br />

the prior probabilistic values of each group the classiPcation is per-<br />

formed to assign each sample data to one of the groups. The clas-<br />

siPcation accuracy was estimated using the leave-one-out method<br />

which is known to provide a least bias estimate 171.<br />

Table 1. Table showing classiPcation accuracy achieved for the<br />

VAG database. Regular - Normal LDA, Cr0ss.V - Leave-one-out<br />

method LDA, % =Percentage of classiPcation<br />

Method 1-<br />

Regular I Normal I 35 I 16 I 51<br />

Abnormal 81.6<br />

I Abnormal I 9 1 29<br />

% Normal 66.1 33.5 100<br />

Abnormal 23.7 16.3 100<br />

Table 2. Table showing the classiPcation accuracy achieved for the<br />

pathological speech database. Regular - Normal LDA, Cr0ss.V -<br />

Leaveane-out method LDA, %=Percentage of classiPcation<br />

Method I <strong>Group</strong>s I Normal I Pathological 1 Total<br />

Reeular I Normal I 48 I 2 I 50<br />

Pathological 6 44 50<br />

% Normal 96 4 100<br />

Pathological 12 88 100<br />

Cr0ss.V Normal 48 2 50<br />

% Normal 96 4<br />

Pathological 12 88<br />

100<br />

100<br />

116<br />

- 2004 -<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.<br />

3. RESULTS AND CONCLUSIONS<br />

The paper describes a novel way of constructing a target spe-<br />

cik sub-dictionary from a redundant dictionary, for classiPcation<br />

applications. High classiPcation accuracies were achieved with<br />

approximately 40% reduction in decomposition time. T\uo highly<br />

non-stationary biomedical databases were used to demonstrate the<br />

performance of the proposed technique.<br />

Features were extracted as explained in Section 2.4 for all the<br />

89 VAG signals and the 100 pathological signals. They were fed to<br />

a LDA based classikr using the SPSS software. ClassiPcation was<br />

performed and the results are given in Tables 1 and 2. For the VAG<br />

database an overall classiPcation accuracy of 74.2% using regular<br />

LDA and 70% using leave-one-out method were achieved. This<br />

is higher than the reported classikation accuracy in [3]. For the<br />

pathological speech database an overall classiPcation accuracy of<br />

92% using regular LDA and 92% using leave-one-out method was<br />

achieved. This is higher than the reported classiPcation accuracy<br />

in [4]. However the classkation accuracies for both the databases<br />

are not higher than the authors recent work with LDB based clas-<br />

sikation (74.2% vs 78.6% and 92% vs 97%). Considering the<br />

following facts the results obtained in the proposed technique can<br />

he justiPed (1) Novelty involved in the sub-dictionary construc-<br />

tion (2) While writing this paper, only Gaussian basis functions<br />

were tested with the databases (3) Reduced decomposition times<br />

(4) Simple features. Future work involves in rePning the proposed<br />

technique to include more bases and optimizing the targeted de-<br />

compositions to yield high classiPcation accuracies than the re-<br />

ported.<br />

Acknowledgements<br />

The authors thankfully acknowledge the NSERC organization<br />

for funding this project. The authors also acknowledge the Last-<br />

Wave software package group.<br />

References<br />

[I] N. Saito and R. R. Coifman, bLocal discriminant bases and<br />

their applications.6 Journal of Mathematical Imaging and W-<br />

sion, vol. 5, no. 4, pp. 3379358, 1995.<br />

[2] S. G. Mallat and Z. Zhang, bMatching pursuit with time-<br />

frequency dictionaries.6 IEEE Trans. <strong>Signal</strong> Pmcessing, vol.<br />

41,110. 12,pp. 339793415, 1993.<br />

[3] S. Krishnan, bAdaptive signal processing techniques for anal-<br />

ysis of knee joint vibroarthrographic signa1s,6 inPh.D disser-<br />

tation, <strong>University</strong> ofcalgary, June 1999.<br />

[4] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, bDis-<br />

crimination of pathological voices using an adaptive time-<br />

frequency approach.6 inproceedings of ICASSP 2002 IEEE<br />

International conference on Acoustics, Speech and Sigml<br />

Pmcessing, Orlando, USA, May 2002, pp. IV 3853D3855.<br />

[5] R. Gribonval, bFast matching pursuit with multiscale dictio-<br />

nary of Gaussian chirps.6 IEEE Transactions on <strong>Signal</strong> Pm-<br />

cessing. vol. 9, no. 5. May 2001.<br />

[6] SPSS Inc., bSPSS Advanced statistics user6 guide.6 Sser<br />

manual, SPSS Inc., Chicago, IL, 1990.<br />

[7] K. Fukunaga, Introduction to Slatistical Pattern Recognition,<br />

Academic Press, Inc., San Diego, CA, 1990.


Proceedings of the 25h Annual International Conference of the IEEE EMBS<br />

Cancun, Mexico September 17-21,2003<br />

Ultrasound Backscatter <strong>Signal</strong> Characterization and Classification Using<br />

Autoregressive Modeling and Machine Learning Algorithms<br />

Noushin R.Farnoud', Michael Kolios1*2<br />

Co-author: Srindhar Krishnan'<br />

'Department of Electrical Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />

'Department of Math-Physics and Computer Science, <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />

Abstract- This research explores the possibility of monitoring<br />

apoptosis and classifying clusters of apoptotic cells based on<br />

the changes in ultrasound backscatter signals from the tissues.<br />

The backscatter from normal and apoptotic cells, using a high<br />

frequency ultrasound instrument are modeled through an<br />

Autoregressive (AR) modeling technique. The proper model<br />

order is calculated by tracking the error criteria in the<br />

reconstruction of the original signal. The AR model<br />

coefficients, which are assumed to contain the main statistical<br />

features of the signal, are passed as the input to Linear and<br />

Nonlinear machine classifiers (Fisher Linear Discriminant,<br />

Conditional Gaussian Classifier, Naive Bayes Classifier and<br />

Neural Networks with nonlinear activation functions). In<br />

addition, an adaptive signal segmentation method ,(Least<br />

Squares Lattice Filter) is used to differentiate the data from<br />

layers of different cell types into stationary parts ready for<br />

modeling and classification.<br />

Keywords-Apoptosis, Ultrasound Backscatter<br />

I. INTRODUCTION<br />

High frequency ultrasound (US) has been shown to<br />

detect the structural changes cells and tissues undergo<br />

during cell death. <strong>Research</strong> has shown that the ultrasound<br />

backscatter signals from apoptotic' acute myeloid<br />

leukemia(AML) cells differ in intensity and frequency<br />

spectrum as the result of the change in size, spatial<br />

distribution and acoustic impedance of the scattering sources<br />

within the cell [l] (Fig. 1). Therefore, we assume that pulse<br />

echo data from different cell types contain distinguishable<br />

statistical regularities. In this work we attempt to classify<br />

normal and apoptotic cancerous cells by tracking the<br />

statistics of the ultrasound backscatter signals from tissues<br />

by using Autoregressive (AR) method for time series<br />

modeling of ultrasound signals.<br />

11. METHODOLOGY<br />

A. Autoregressive (AR) Modeling of US signals<br />

Biomedical signals contain large quantities of data.<br />

Moreover these data usually contain some redundancies<br />

which make processing and analyzing them more difficult.<br />

In such situations signal modeling may help to take out the<br />

' Apoptosis is a genetically determined destruction of cells from<br />

within due to activation of a stimulus or removal of a suppressing<br />

agent or stimuli.<br />

0-7803-7789-3/03/$17.00 02003 IEEE 2861<br />

117<br />

Fig 1 a) H&F ' stains of b) 11 &C stains of<br />

Normal Cells Apoptotic Cells<br />

irrelevant information carried by the signal and simplifies<br />

classification and segmentation by using a reduced number<br />

of model parameters. Autoregressive (AR) modeling is<br />

widely used for speech and biomedical signal processing<br />

[2-41. This model is linear and has been successfully used<br />

for high-resolution spectral estimation [5]. An AR model is<br />

defined by the difference equation:<br />

P<br />

x(n> = -C akx(n - IC) + e(n> (1)<br />

k=l<br />

where x(n) is a wide-sense stationary3 AR process, {a(k)}<br />

represent AR coefficients, e(n) is white Gaussian noise and<br />

p is the model order which determines the error criterion. In<br />

section C, we will present a way to estimate this error and<br />

reduce it based on choosing the proper model order @).<br />

B. Data Acquisition<br />

AML cells were grown in suspension and exposed to the<br />

chemotherapeutic cisplatin to induce apoptosis. Pellets were<br />

made by swing bucket centrifugation. Details on the<br />

biological procedure can be found elsewhere (Czemote et al.<br />

1996)[6]. A 20MHz f2.35 or 40 MHz f2 transducer (Visual<br />

Sonics4) was used to image the pellets of normal and<br />

apoptotic cells. RF backscatter data was digitized at<br />

SOOMHz and stored for later analysis. In one experiment,<br />

layers of normal and apoptotic cells were created to emulate<br />

a clinical situation.<br />

C. Choosing the proper Model Order<br />

The modeling order @) controls the error associated<br />

with the AR signal approximation. This parameter<br />

Hematoxylin and Eosin.<br />

'A stochastic process is called wide-sense stationary (WSS) if its<br />

mean is constant and its autocorrelation depends only on the time<br />

difference.<br />

www.visualsonics.com<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.


determines the number of previous samples used to model<br />

the original signal. A small model order ignores the main<br />

statistical properties of the original signal while a big model<br />

order will result in modeling the noise associated with data<br />

and over-fitting5 occurs. A very common method for<br />

estimating the proper model order is Akaike Information<br />

Criterion (AIC) [7], although applying this method would be<br />

very difficult in our work due to nature of US signals.<br />

Instead, we used the following parameters based on the<br />

statistics of the reconstructed signal and its frequency with<br />

different model orders to determine the best modeling order.<br />

a) Ensemble Reconstruction Error<br />

The error(2) shows the total difference of original and<br />

reconstructed signals in frequency domain using AR<br />

modeling technique:<br />

4<br />

Z(n) = -Ca,x(n - k)<br />

k=l<br />

rt=l '<br />

where :(U) is the approximated signal based on AR<br />

modeling with order p, N is the total number of samples<br />

within an individual RF line, f andj represents the fft of<br />

original and estimated signals respectively.<br />

b) Model Noise (error) Variance<br />

The AR process is the output of an all-pole filter<br />

invoked by a white noise e(@. This noise, which is also<br />

our modeling error, can be viewed as the output of the<br />

prediction error filter A(z), as shown in Fig. 2, where<br />

x(n) is the original signal and A(z) is the transfer<br />

function of AR modeling.<br />

Fig. 2. Block diagram of AR process<br />

(Model error)<br />

Therefore we expect that after estimating the AR<br />

coefficients of our model, if we invoke a filter as shown<br />

in fig. 2 with the estimated AR coefficients in A(z) the<br />

filter output, e(n), would be a white Gaussian noise. We<br />

can verify this by estimating the variance of the output<br />

of such a filter and its auto-correlation (which has a jump<br />

to one in zero lag and remains zero otherwise).<br />

D. <strong>Signal</strong> Segmentation<br />

The classification methods we discussed were based on<br />

US backscatter from pure apoptotic and normal cell pellets.<br />

When the model do well on training data but poorly on test data.<br />

2862<br />

118<br />

In patient imaging the data are acquired from tissues which<br />

contain different layers or layers with different mixtures on<br />

normal and apoptotic cells. The probabilistic behavior of the<br />

backscattered US signal from these cells, make the signal<br />

non-stationary6. This non-stationarity is important from the<br />

point of view of AR modeling, as this method is applicable<br />

if the signal is stationary'. Therefore we must use signal<br />

segmentation algorithms to break the signal acquired from<br />

tissues into stationary segments and classify each segment<br />

respectively. The segmentation algorithms can be classified<br />

into fixed *[8] and adaptive [2,9-111. Adaptive segmentation<br />

algorithms rely on tracking the statistical changes in the<br />

signal (such as mean and variance) to set a breaking<br />

boundary. We used this method for US signals due to its<br />

accuracy, modularity and ease of testing [2].<br />

E. Adaptive signal Segmentation: Recursive-Least Squares<br />

Lattice Filter (RLSL)<br />

In adaptive segment,ation, the segment length changes<br />

dynamically according to the statistical changes in the<br />

signal. The main idea of using RLSL filter was to get to a<br />

fast convergence by using forward and backward filters. The<br />

parameter which expresses the statistical change in the<br />

signal is called convergence factor (y,(n)). The convergence<br />

factor provides the connecting link between different sets of<br />

a priori and posteriori estimation errors in this algorithm and<br />

is defined by<br />

where m is the order of the lattice filter, y,(n) is the<br />

convergence factor at time sample n in the mth stage of<br />

lattice, bm-, (n) and Bm-, (n) are the backward prediction<br />

error and its power at this stage [2].<br />

IV. RESULTS<br />

a) Model Order Determination for Autoregressive (AR)<br />

Modeling of US signals<br />

Using the error criteria explained in section C, we<br />

calculated the error associated with the frequency of<br />

reconstructed and original US signals averaged over 30<br />

normal and apoptotic sample RF lines respectively (Fig. 3).<br />

Matlab (version 6.5) was used for all the calculations. Also,<br />

as explained in section D, we found the variance of the<br />

' The statistics of a non-stationary process are variant with respect to<br />

any translation among the time axis.<br />

' We have determined that US, signals from normal and apoptotic cells<br />

are quasi-stationary.<br />

' Fixed segmentation algoritlhms are widely used for speech signal<br />

processing.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.


estimated noise generated as the output of a filter with the<br />

estimated AR coefficients in its transfer function and the<br />

original signal as its input. The result of averaging the<br />

variance of this noise over 30 samples is shown in fig. 4.<br />

These graphs indicate that model order 15 (p=l5) is a good<br />

choice for AR modeling order for high frequency US<br />

backscatter signals, as we do not see much improvement in<br />

ensemble error(the ratio of error between model order 15<br />

and 40 is 2.6 in comparison to 2.9e5 between model order 1<br />

and 15). Furthermore, the variance of the estimated model<br />

noise does not change dramatically after this model order.<br />

To verify this result, we modeled an US backscatter signal<br />

with order 15, reconstructed this signal with the estimated<br />

AR coefficients and found the auto-correlation of the model<br />

error' (noise) .As depicted in Fig. 5; this auto- correlation<br />

indicates the similarity of the estimated error to white noise.<br />

Therefore we used AR modeling with order 15 for US<br />

backscatter signals in the rest of this paper.<br />

3<br />

2 Algorithm Normal Accuracy Apoptotic Accuracy<br />

1.5'<br />

+ 5 10 15<br />

P<br />

20 25 30 35 40<br />

Model Order<br />

Fig. 3: Average Ensemble Error between the ffts of estimated<br />

and original US signal (30 samples of normal and apoptotic signals)<br />

,@Y<br />

14<br />

12.<br />

3<br />

%lo'<br />

8. ;<br />

z 6.<br />

1 5 10 15 20 25 30 35 40<br />

Model Order<br />

Fig. 4: Average variance of the estimated model noised based on the<br />

estimated AR coefficient (30 samples).<br />

This error was assumed to be the absolute difference between original<br />

and reconstructed signals.<br />

2863<br />

119<br />

1<br />

0<br />

0 IOW 2000<br />

Lags<br />

Fig. 5: Auto-correlation of the estimated model error (noise)<br />

6) Ultrasound <strong>Signal</strong> Classification<br />

Conditional Gaussian<br />

Classifier"<br />

Naive Bayes Classifier I<br />

Fisher's Linear<br />

Discriminant<br />

Neural Network with<br />

40%<br />

46% I<br />

60%<br />

71%<br />

Sigmoid activation 93.8% 99%<br />

1 98% I 64% I<br />

tanh activation 95.5% 99%<br />

This result shows the ability of Neural Networks with non-<br />

linear activation functions (in both hidden and output layers)<br />

to classify US signals from normal and apoptotic cells. We<br />

are still investigating the advantages and disadvantages of<br />

each approach.<br />

c) Ultrasound <strong>Signal</strong> Segmentation<br />

Fig. 5 shows RLSL algorithm applied on a layer on<br />

Normal-Apoptotic-Normal cell pellet with the apoptotic<br />

layer located between samples 800 and 15000. As long as<br />

the input data is stationary, the convergence factor would<br />

remain in the same range, but when it drops below a<br />

I" The priors for each class were equally set (p=0.5)<br />

I' The network was trained using 50000 iterations.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.<br />

I<br />

1


threshold '* it indicates a sudden change in statistical<br />

properties of the signal which is set to the segment<br />

boundary.<br />

100 500 800 1200 1500 2000<br />

Sample Index<br />

(b)<br />

Fig. 5. (a): Original signal from a 3 layer Normal-Apoptotic-Normal cell<br />

pellet. (b): Convergence factor as a parameter to detect the layer boundaries<br />

(stationary).<br />

These figures indicate that RLSL algorithm can detect the<br />

sudden changes in the signal due to the different statistical<br />

properties of normal and apoptotic layers and therefore can<br />

adaptively found their corresponding boundary in an US<br />

backscatter signal. While in Fig. 5.a the difference is<br />

evident, in clinical situations it is anticipated that small<br />

percentage of apoptotic cells would be surrounded by<br />

normal cells.<br />

V. CONCLUSION<br />

The best model order in using AR technique for US<br />

signals was found to be p=15. The accuracy of different<br />

classifiers has been studied and it was found that non-linear<br />

neural networks were most successful in classification.<br />

Because the actual clinical data from patients include US<br />

backscatter from layers and mixtures of cells, a method for<br />

'* The threshold in this work is set by visual inspection (however in<br />

the future it will be extracted from the signal based on its statistical<br />

properties).<br />

2864<br />

120<br />

differentiating these layers was presented which enables the<br />

AR modeling to be applicable for US signals.<br />

ACKNOWLEDGMENT<br />

We should thank Dr. Michael Sherar and Ontario Cancer<br />

Institute of the Princess Margaret Hospital for their support,<br />

Anoja Giles for helping 11s with the biological work and Dr.<br />

Gregory Czarnota for his scientific input. Noushin<br />

R.Farnoud would also like to thank Dr. Sam Roweis at the<br />

Computer Science Department of the <strong>University</strong> of Toronto<br />

for his help with the Machine Learning algorithms.<br />

REFERENCES<br />

[I] MC. Kolios, GJ. Czamota, M. Lee, JW. Hunt, MD. Sherar,<br />

Ultrasonic spectral parameter characterization of apoptosis,<br />

[2]<br />

Ultrasound Med. & Biol. 2002 May, 8(5):589-97.<br />

S. Krishnan, Adaptivefili'ering. Modeling, and Classification of Knee<br />

Joint Vibroarthrographic <strong>Signal</strong>s, Master's Thesis, <strong>University</strong> of<br />

Calgary, Alberta, Canada, 1996.<br />

[3] F. Towfiq, C.W. Barenes, E.J. Pisa, Tissue classification basedon<br />

autoregressive models for ultrasound pulse echo data, ACTA<br />

Electronica,l984, (.26): 95-1 10.<br />

[4] A. Nair, BD. Kuban, N. Obuchowski, DG. Vince, Assessing spectral<br />

algorithms to predict atherosclerotic plaque composition with<br />

normalized and raw intravascular ultrasound, Ultrasound in Med. &<br />

Biol., 27(10): 13 19-1 331,2001.<br />

[51 M. Akay, Time Frequency and Wavelets in Biomedical <strong>Signal</strong><br />

Processing (Book style). Piscataway, NJ: IEEE Press, 1998: 123-135.<br />

[6] GJ. Czamota, MC. Kolios, J. Abraham, M. Portnoy, FP.<br />

Ottensmeyer, JW. Hunt, hfD. Sherar, Ultrasound imaging of<br />

apoptosis: high-resolution non-invasive monitoring of programmed<br />

cell death in vitro, in situ ;and in vivo, Br J Cancer. 1999 Oct;<br />

81(3):520-7.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.<br />

[7] Y. Sakamoto, M. Ishiguro, G. Kitagawa, Akaike Information Criferion<br />

Statistics, D. Reidel Publishing Company, KTK Scientific Publishers,<br />

Tokyo Hardbound, ISBN 90-277-2253-6, November 1986.<br />

[8] J.D. Markel, A.H. Gray, Jr. Linear Prediction of Speech. Springer-<br />

Verlag, N.Y., New York, 1976.<br />

[9] D. Michael, J. Houchin, Automatic EEG analysis: A segmentation<br />

procedure based on the atrtocorrelation function, Electroenceph.,<br />

Clinical Neurophysiology, (46):.232-235, 1979.<br />

[IO] G. Bodenstein, H.M. Praetorious, Feature extraction from the<br />

electroencephalogram by adaptive segmentation, Proceeding of IEEE,<br />

65(5): 642-652, May 199;'.<br />

[I I] H.M. Praetorious, G. Bodfenstein, O.D. Creutzfeldt, Adaptive<br />

segmentation of EEG records: A new approach to automatic EEG<br />

analysis, Electroencephalogram, Clinical Neurophysiology, Vo1.42,<br />

pp.84-91, 1917.<br />

[12] T. Mitchell, Machine Learning, McGraw Hill, 1997.<br />

[I31 C.D. Nugent, J.A. Lopez, A.E. Smith, Prediction Models in Design of<br />

Neural Network based ECG classifiers, BMC Medical Informatics<br />

and Decision Making, 2001.<br />

[I41 S. Chakrabarti, N. Bindal, Robust Radar Target Classifier Using<br />

Artificial Neural Networks, IEEE Transaction on Neural Networks,<br />

6(3), May 1995.<br />

[ 151 D. Docampo, Intelligent Methods in <strong>Signal</strong> Processing and Artificial<br />

Communications, Birkauser Boston, 1997.<br />

[ 161 D.M.Skapura, Building Neural Networks Algorithms, Applications.<br />

and Programming Techniques, ACM press, 1998.<br />

[ 171 J.A. Freeman, D.M. Skapura, Building Neural Networks. ACM press,<br />

1998.


ROBUST AUDIO WATERMARKING USING A CHIRP BASED TECHNIQUE<br />

Serhut Erkugiik, Sridhar Krishnan and Mehntet Zeyfinoglu<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON M5B 2K3 Canada<br />

e-mail: (serkucuk)(knshnan)(mzeytin)@ee.ryerson.ca<br />

ABSTRACT<br />

In this study, we propose a new spread spectrum audio wa-<br />

termarking algorithm that embeds linear chirps as water-<br />

mark messages. Different chirp rates, i.e., slopes on the<br />

time-frequency (TF) plane, represent watermark messages<br />

such that each slope corresponds to a different message.<br />

We extract the watermark message using a line detection al-<br />

gorithm based on the Hough-Radon transform (HRT). The<br />

HRT detects the directional elements that satisfy a paramet-<br />

ric constraint in the image of a TF plane. The proposed<br />

method not only detects the presence of watermark, but also<br />

extracts the embedded watermark bits and ensures the mes-<br />

sage is received correctly. The results show that the HRT de-<br />

tects the embedded watermark message even after common<br />

signal processing operations such as MPEG audio coding,<br />

resampling, lowpass filtering and amplitude re-scaling.<br />

1. INTRODUCTXON<br />

In recent years, the digital format has become the standard<br />

for the representation of multimedia content. Today’s tech-<br />

nology allows the copying and redistribution of multime-<br />

dia content over the Internet at a very low or no cost. This<br />

has become a serious threat for multimedia content owners.<br />

Therefore, there is significant interest to protect copyright<br />

ownership of multimedia content (audio, image, and video).<br />

Watermarking is the process of embedding additional data<br />

into the host signal for copyright ownership. The embed-<br />

ded data characterizes the owner of the data and should be<br />

extracted to prove the ownership. Besides copyright protec-<br />

tion, watermarking may be used for data monitoring, finger-<br />

printing, and observing content manipulations. All water-<br />

marking techniques should satisfy a set of requirements [I].<br />

In particular, the embedded watermark should be: (i) imper-<br />

ceptible, (ii) undetectable to prevent unauthorized removal,<br />

(iii) resistant to all signal manipulations, and (iv) extractable<br />

to prove ownership. Before the proposed technique is made<br />

public, all the above requirements should be met.<br />

Thin work was supponed by NSERC and Minonet<br />

The watermarking literature describes two classes of wa-<br />

termarking schemes. The first class of techniques called<br />

the one-bit watermarks [2], only detects the presence of the<br />

watermark rather than extracting it [3, 4, 51. The second<br />

class of techniques detects and extracts the embedded wa-<br />

termark message [6, 7, 81. If b bits represent the embedded<br />

watermark message, the detector has the task of correctly<br />

identifying the watermark message from an alphabet of 2 *<br />

possible watermark messages. As a result of signal manipu-<br />

lations some message bits extracted by the detector may be<br />

in error potentially resulting in the detection of the wrong<br />

watermark message. Our motivation for the proposed au-<br />

dio watermarking algorithm is to detect the presence of the<br />

watermark, extract the embedded watermark message bits<br />

and decide on the watermark message even if some bits are<br />

received in error. To achieve this goal, we embed linear<br />

chirps as watermark messages. Different chirp rates, i.e..<br />

slopes on the TF plane, represent watermark messages such<br />

that each slope corresponds to a different message. The nar-<br />

rowband watermark messages are spread with a watermark<br />

key (binary PN sequence) across a wider range of frequen-<br />

cies before embedding. The resulting wideband noise is<br />

perceptually shaped and added to the original signal. The<br />

original and watermarked signals exhibit no perceptual dif-<br />

ferences. At the receiver a line detection algorithm based<br />

on the Hough-Radon transform (HRT) detects the slope of<br />

the extracted chirp in the image of the TF plane, even at<br />

discontinuities corresponding to bit errors.<br />

2. WATERMARK EMBEDDING<br />

Let x = [z(O)z(l) . ..IT be the audio signal which we<br />

first divide into N-sample long blocks. We use the notation<br />

xk = [z(kN) . . . z((k + l)N - 1)IT to represent the sam-<br />

ples for the kth audio block. Let m be a normalized chirp<br />

function that represents the watermark message to be em-<br />

bedded into the original signal. m takes continuous values<br />

in the interval [-1.1]. and needs to be quantized for the de-<br />

tection of each embedded bit. mq is the quantized version<br />

of m formed according to the sign of the sample values of<br />

m, taking values -1 and 1. Let ml represent a watermark<br />

0-7803-7965-9/03/%17.00 02003 IEEE 11 - 513 ICME 2003<br />

121<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.


message bit to be embedded into the kth audio block. We<br />

embed one watermark bit into each block. Each bit is spread<br />

with a binary PN sequence pk with a chip length of N to<br />

generate the wideband noise vector wk.<br />

We need to perceptually shape wk before adding to each<br />

block. To make WI imperceptible, the amplitude level of the<br />

wideband noise should be attenuated to 0.5 percent of the<br />

dynamic range of the host signal [9]. Let w; = [w’(kN) . . .<br />

~’((lc+l)N-l)]~ bethe signal dependentwidebandnoise<br />

generated from wk such that<br />

w;(n) = awk(n)Isk(n)l> (1)<br />

where a is the embedding strength. The high frequency<br />

band of the wideband noise sequence w; will not be robust<br />

to compression and lowpass filtering. Therefore. we gen-<br />

erate the frequency-limited noise wl;‘ by lowpass filtering<br />

w; with a cut-off frequency of IO percent (i.e. 2.2 kHz) of<br />

the maximum audio frequency. which represents part of the<br />

signal spectrum with significant energy content. After lim-<br />

iting the maximum frequency of the wideband noise to 2.2<br />

kHz, we found that a = 0.3 (independent of the audio sig-<br />

nal) is sufficient to achieve imperceptibility. This value of<br />

a is different than what is used in [9] because we embed a<br />

frequency-limited noise rather than a wideband noise. The<br />

resulting watermarked audio signal block yk equals<br />

yk = Xk i wk’.<br />

(2)<br />

We repeat the process for each block until we embed all the<br />

bits in mq. The resulting watermarked signal y is perceptu-<br />

ally the same as the original signal x. Figure I provides an<br />

overview of the proposed watermark embedding scheme.<br />

Pk<br />

Fig. 1. Watermark embedding scheme.<br />

3. WATERMARK DETECTION<br />

Under ideal signal conditions the received signal will be<br />

identical to the transmitted sequence yk. In Section 4, we<br />

will relax this condition and investigate the proposed wa-<br />

termarking scheme under the assumption that the received<br />

signal is different than yk as a result of various signal pro-<br />

cessing operations. Assuming ideal signal conditions and<br />

perfect synchronization of the signal and the PN sequence,<br />

we first lowpass filter the received signal yk to 2.2 Hz. Let<br />

y;’ represent the output of the lowpass filter in the receiver.<br />

Since w;‘ is band-limited to 2.2 kHz, we can express yp as:<br />

y;’ = x; + w;.<br />

(3)<br />

11 - 514<br />

122<br />

where xi is the audio signal component at the output of<br />

the lowpass filter. We then use the watermark key, i.e. the<br />

PN sequence Pk. to despread y;’ and integrate the result-<br />

ing sequence to recover mi, the embedded message bit for<br />

that block. Let (y;‘, pa) be the output of this integration<br />

Fig. 2. Watermark bit detection s,cheme.<br />

operation, where () represents the inner product operation.<br />

Under the assumption x k is a zero-mean sequence which is<br />

statistically independent from pk, we can approximate the<br />

expected value of (y;, pk) by the expression<br />

N-1<br />

lL=O<br />

(4)<br />

where /3 is a positive constant resulting from the filtering<br />

operations. Therefore, the extracted message bit m;, which<br />

estimates ml, can be based on the test statistic (y;!pk)<br />

such that<br />

To achieve improved watermark extraction performance we<br />

postprocess the extracted message bits using a time-frequency<br />

distribution (TFD). After all message bits are extracted, we<br />

construct the TFD (spectrogram) of the elements in mg.<br />

The TFD of a chirp watermark message is a line with vari-<br />

able slopes. A line detection algorithm based on the HRT<br />

then detects the presence of the line and estimates its param-<br />

eters. This second stage, consisting of TFD and HRT, func-<br />

As TFD<br />

Fig. 3. Detection of the watermark message.<br />

tions as an error-correcting technique and significantly increases<br />

the robustness of the proposed watermarking scheme.<br />

The HRT is an efficient tool to detect energy-varying directional<br />

chirps [IO]. It treats the TFD as an image, where each<br />

pixel value corresponds to the energy present at a particular<br />

time and frequency [lo]. The Radon transform (RT) computes<br />

the projections of different angles of an image (TFD)<br />

or two-dimensional data distribution f(u, v) measured as<br />

line integrals along ray paths [ 111:<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.


where 0 is the angle of the ray path of integration, p is the<br />

distance of the ray path from the center of the image and<br />

6 is the Dirac delta function. Equation (6) represents in-<br />

tegration of f(u,u) along the line p = ucosb' f usinb'.<br />

The Hough transform (HT) is a pattern-recognition method<br />

Fig. 4. Line detection using HRT.<br />

c<br />

U<br />

that calculates the number of image points satisfying a para-<br />

metric constraint [12]. The HT can be applied only to bi-<br />

nary images. However, TFDs can be gray-level images with<br />

intensity levels corresponding to energy values. The HRT<br />

method is the combination of HT and RT. It has the ad-<br />

vantage over HT as it can be applied to gray-level images<br />

to detect energy varying chirp components [IO]. The HRT<br />

method can also detect lines with discontinuities. This char-<br />

acteristic allows the extraction of the watermark message<br />

even if some of the watermark bits are incorrectly detected.<br />

Once the points on the two-dimensional data distribu-<br />

tion f(u,v) (in this case the probability density function<br />

of the TFD) that satisfy directional parametric constraints<br />

are found, the presence of the chirp, i.e. the watermark is<br />

decided. If there is watermark. the slope of the chirp deter-<br />

mines the watermark message.<br />

4. ROBUSTNESS TESTS AND DISCUSSIONS<br />

We evaluated the proposed scheme using 5 different audio<br />

signals (SI, . . . , S,) sampled at 44.1 kHz. Due to the limited<br />

resolution of the spectrogram, watermark messages are<br />

modulated as linear chirp functions with initial and final frequencies<br />

in one of the 17 frequency bands of 30 Hz bandwidth<br />

in the [0-510] Hz range. This approach allowed us to<br />

use a message alphabet with 289 possible watermark messages.<br />

We embedded these messages into audio signals of<br />

40 second duration for a chip length of 10000, and into audio<br />

signals of 20 second duration for a chip length of 5000.<br />

Hence, each audio signal contains 176 embedded message<br />

bits. To measure the robustness of the system, we performed<br />

the following tests: (i) TI: MP3 compression with bit rate<br />

11 - 515<br />

123<br />

128 kbps, (ii) Tz: MP3 compression with bit rate 80 kbps,<br />

(iii) Tg: lowpass filtering to 4 Wz, (iv) Tq: resampling at<br />

different sampling rates (22.05 kHz and 11.025 W), and<br />

(v) T,: amplitude scaling. We use the notation To to refer<br />

to watermark detection without signal manipulation. Therefore,<br />

the results corresponding to To serve as a reference.<br />

After the watermark embedded signal y goes through a<br />

signal manipulation process, the message bits are extracted<br />

using the detection scheme described in Section 3. During<br />

all the robustness tests, we assumed that the audio signal and<br />

the PN sequence are synchronized. Tables 1 and 2 show the<br />

bit error rate (BER) results expressed as a percentage of the<br />

total number of message bits (176) for the two chip lengths<br />

and for each signal manipulation operation. Extracted bits<br />

1 Audio I Robustness Test I<br />

-1 0.00<br />

Audio<br />

Sample<br />

1.14<br />

0.57<br />

I 1.14 1 1.14 I 3.42<br />

1 0.57 1 0.57 1 3.42<br />

0.00 0.00 1.70<br />

Table 1. BER (in percentage) for N = 10000.<br />

Robustness Test<br />

TO/T4/TS I TI 1 Tz I TS<br />

4.55 5.11 11.36<br />

s4 3.98 3.98 3.98 10.80<br />

S< 3.98 3.98 4.55 9.66<br />

Table 2. BER (in percentage) for N = 5000<br />

are localized on the TF plane using a spectrogram generated<br />

by a fixed window length short-time Fourier transform. Al-<br />

though some hits are received in error (even in the case of no<br />

signal manipulation), the HRT correctly detected the slope<br />

of the chirp functions in the image of the TF plane and suc-<br />

cessfully extracted the embedded watermark messages thus<br />

providing error-correction capability. In the simulations re-<br />

ported in this study we detected all the embedded water-<br />

mark messages correctly. Figure 5 shows the TFDs of the<br />

message hits embedded with chip length 5000 and extracted<br />

after various signal manipulations for the audio signal SS of<br />

20 second duration.<br />

The definition of the watermark message as a linear chirp<br />

function limits the data payload. We can increase the data<br />

payload by using any of the following techniques. (I) Em-<br />

bedding watermark messages in shorter signal segments.<br />

(2) Selecting the initial and final frequencies for the water-<br />

mark messages from a wider frequency band rather than the<br />

current [O-5101 Hz band. (3) Narrowing the 30 Hz deci-<br />

sion bands using higher TF resolution. We can improve TF<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.


m ntddd msp.<br />

Fig. 5. TFDs of the embedded and extracted bits<br />

resolution by using adaptive TF representation of the signal<br />

based on longer windows for slowly varying functions, and<br />

shorter windows for quickly varying functions. However,<br />

any of the above techniques can potentially degrade the de-<br />

tectibility of the watermark. We are currently investigating<br />

the potential of these techniques for increasing the data pay-<br />

load and their impact on the performance of the proposed<br />

watermark detection scheme.<br />

To test the robustness of the HRT with respect to large<br />

discontinuities, we corrupted the signal by adding white<br />

Gaussian noise of 5 second duration starting at the 15 sec-<br />

ond mark of a 40 second long audio sample. During this<br />

interval the watermark bit detection scheme incorrectly de-<br />

tected a significant number of bits (BER = 50%). Yet, the<br />

HRT successfully detected the slope of the linear chirp at<br />

the discontinuity and extracted the message.<br />

The initial robustness tests for additive noise, multiple<br />

watermarks, and multiple attacks resulted in small BERs to<br />

be further evaluated.<br />

5. CONCLUSIONS<br />

In this paper, we proposed a new audio watermarking al-<br />

gorithm that extracts the watermark message even if some<br />

of the message bits are extracted in error. A line detection<br />

algorithm based on the HRT detects the slope of the water-<br />

mark message in the image of the TF plane of the signal.<br />

The HRT provides error correcting capability and can be<br />

efficiently implemented as it operates on small images of<br />

the TF plane. Initial studies confirm that the proposed al-<br />

gorithm achieves robustness with respect to common signal<br />

manipulations. The current implementation yields a modest<br />

data payload. However, the use of higher resolution TFDs<br />

124<br />

promises to increase the data payload while retaining all the<br />

desirable characteristics of the proposed watermarking al-<br />

gorithm. We are currently working on the synchronization<br />

problem and the expansion of the robustness tests.<br />

6. REFERENCES<br />

[I] M. Arnold, ‘‘Audio watermarking: Features, applica-<br />

tions and algorithms,” lEEE lntl. Conf oti Multiniedia<br />

arid Expo, vol. 2, pp. 1013-1016,20ClO.<br />

[2] I.J. Cox, M.L. Miller and J.A. Bloom, Digiral ”mer-<br />

niarkbig, San Diego, Academic Pres, 2002.<br />

[3] S. Lee and Y. Ho, “Digital audio watermarking in the<br />

cepstrum domain,” IEEE Trans. 011 Cotisioner Elec-<br />

tronics, vol. 46, no. 3, pp. 744-750, .August 2000.<br />

[4] P. Bassia, 1. Pitas and N. Nikolaidis, “Robust audio<br />

watermarking in the time domain,”’ IEEE Trans. on<br />

Mirlriniedin, vol. 3, no. 2, June 2001.<br />

[5] D. Kirovski and H. Malvar, “Spread-spectrum audio<br />

watermarking: Requirements, applications, and limi-<br />

tations,” lEEE Foenh Workshop OIL Mitlrirriedia Sigriol<br />

Processing, pp. 219-224, October 2001.<br />

[6] M.D. Swanson. B. Zhu and A.H. Tewfik, “Current<br />

state of the art, challenges and future directions for<br />

audio watermarking:’ lEEE Ind. Cot$ on Mirltiniedia<br />

Conipriritig otid Sysrenis, pp. 19-24, vol. I, 1999.<br />

[7] W.N. Lie and L.C. Chang, “Robust high quality time-<br />

domain audio watermarking subject to psychoacoustic<br />

masking:’ IEEE Itirl. Synip. on Circrrifs arid Systeriis,<br />

pp. 45-48, vol. 2,2001.<br />

[SI J.W. Seok and J.W. Hong, “Audio watermarking for<br />

copyright protection of digital audio data,” Electronics<br />

letters. pp. 60-61, vol. 37, no. I; Jan. 2001.<br />

[9] W. Bender, D. Gruhl, N. Morimoto and A. Lu “Tech-<br />

niques for data hiding,” IBM Systems Jormial, vol. 35,<br />

nos. 3 & 4, pp. 313-336.1996.<br />

[IO] R.M. Rangayyan and S. Krishnan, “Feature identifica-<br />

tion in the time-frequency plane by using the Hough-<br />

Radon transform,” Trans. Partern Recognition, vol. 34,<br />

pp. 1147-1 158,2001.<br />

[Ill G.T. Herman, Image Reconsrrrrcrion from Projec-<br />

tions: The Firndameiitals of Conipurerized Tomogra-<br />

phy, New York, Academic Press, 1’980.<br />

[ 121 R.O. Duda and P.E. Hart, “Use of Hough transform to<br />

detect lines and curves in pictures,” Comniunicnrions<br />

of the ACM, 15(1): 11-15, January 1972.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.


TIME-FREQUENCY FILTERING OF INTERFERENCES IN SPREAD SPECTRUM<br />

COMMUNICATIONS<br />

ABSTRACT<br />

A novel technique to excise single and multi-component<br />

chirp-like interferences in direct sequence spread spectrum<br />

communications is proposed. The received signal is decom-<br />

posed into its time-frequency (TF) functions using an adap-<br />

tive signal decomposition algorithm and the TF fnnctions<br />

are mapped onto the TF plane. The TF plane is optimized<br />

and treated as an image, and the interference represented<br />

in the TF plane is detected using the Hough-Radon trans-<br />

form (HRT). Simulation results with synthetic models have<br />

shown successful performance for the excision of linear and<br />

non-linear chirp interferences. The method has shown im-<br />

munity to both white noise and multiple interferences even<br />

under very low SNR conditions of -10 dB.<br />

Keywords: interference excision, spread-sprectrum com-<br />

munications, adaptive signal decomposition, Hough-Radon<br />

transform, time-frequency filtering<br />

1. INTRODUCTION<br />

Serliat Erkiiqiik and Sridhar Krishnan<br />

Department of Electrical and Computer Engineering<br />

<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON MSB 2K3, Canada<br />

e-mail: (serkucuk)(krishnan)@ee.ryerson.ca<br />

In spread spectrum communications, the message signal is<br />

modulated and spread over a widerbandwidth with a pseudo-<br />

noise (PN) code also known at the receiver, and is transmit-<br />

ted over the channel. The bandwidth increase of the trans-<br />

mitted signal yields a processing gain, defined as the ratio of<br />

the bandwidth of the transmitted signal to the bandwidth of<br />

the message signal. Although the processing gain provides<br />

a high degree of interference suppression, there is a trade-<br />

off between increasing the processing gain and the available<br />

frequency spectrum. In case of high interference to signal<br />

ratio (ISR), the spread spectmm system with a limited pro-<br />

cessing gain may not be able to suppress the interference.<br />

Therefore, excising the interference prior to despreading the<br />

received signal is necessaly to increase the performance of<br />

the system [I].<br />

In this study, we will evaluate the proposed interference<br />

excision algorithm using the direct sequence spread spec-<br />

trum (DSSS) system, one of the most widely used spread<br />

spectrum techniques [I]. In DSSS, m k, the kth hit of the<br />

This work was supported by NSERC and Micronet.<br />

0-7803-7946-2/03/$17.00 82003 IEEE<br />

323<br />

message signal m(t), is multiplied with a PN code p(t),<br />

where each message bit occurs every T,, seconds and the<br />

PN code bits every Tp seconds. The processing gain, i.e.<br />

the length of the PN code is therefore L = T,/Tp, where<br />

T,,, >> Tp. During the transmission of the modulated signal,<br />

additive white Gaussian noise n(t) and intelference i(t)<br />

are added to the signal in the channel, and the following signal<br />

is received:<br />

.(t) = 7n&) + n(t) + i(t).<br />

(1)<br />

At the receiver, the received signal r(t) is synchronized and<br />

correlated with the same PN code p(t) and the estimate of<br />

the message bit f i g is made,<br />

125<br />

fie = b(t)>P(t))<br />

= mk (P(t),P(t)) + (n(t),p(t)) + (i(t)>P(t)) ' (2)<br />

As seen in the above equation, despreading of the received<br />

signal recovers the message signal, while spreading the noise<br />

and the interference. The decision is made on the polarity<br />

of nik. If the ratio of the interference power to the signal<br />

power is large so the processing gain can not suppress the<br />

interference, then the estimate of the message hit, I?L~ may<br />

be wrong. Therefore the interference should be suppressed<br />

prior to despreading.<br />

Some excision techniques such as adaptive notch filtering,<br />

decision-directed adaptive filtering, and analog to digital<br />

conversion techniques are commonly used to suppress<br />

narrowband interferences in DSSS [I]. However, if the interference<br />

has a narrowband instantaneous bandwidth in a<br />

wideband frequency range such as chirps, time-frequency<br />

(TF) methods perform well to localize the interference [2].<br />

There have been several techniques proposed to suppress<br />

the interference using time-frequency distributions (TFDs)<br />

of the signal [3,4,5]. TFDs localize any interference both<br />

in time and frequency domain [2], and are ideally suited<br />

for interference excision. The commonly used TFDs suffer<br />

from a trade-off between TF resolution and cross-terms<br />

suppression.<br />

In this paper, we focus on a new excision technique<br />

based on constructing a positive TFD [6, 7, 81 of the received<br />

spread spectrum signal using an adaptive signal de-<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.


composition technique [9]. The block diagram of the pro-<br />

posed interference excision algorithm is shown in Figure<br />

1. By decomposing a signal into components, the inter-<br />

action between components can be avoided, and the TFD<br />

constructed by combining the TFDs of the individual com-<br />

ponents would be free of cross-terms. Also, by using Gaus-<br />

sian functions as bases for decomposition, a high TF reso-<br />

lution of interference signals can be achieved. By treating<br />

the TF plane as an image, the interference patterns can be<br />

detected by using the image analysis technique of Hough-<br />

Radon transform (HRT). Curves with mathematical equa-<br />

tions can be easily detected by transforming the shapes in<br />

the TF image into the Hough domain (also known as para-<br />

metric domain), and searching for dominant peaks (maxi-<br />

mum values). The co-ordinates of the dominant peaks pro-<br />

vide the parameters of the shape. Interferences are recon-<br />

structed by suitably thresholding the corresponding energy<br />

values in the TF plane and subtracted from the received<br />

spread sprectrum signal.<br />

The paper is organized as follows: In Section 2, con-<br />

struction of TF image is explained. The HRT theoty for<br />

detection of chirps in the TF image is explained in Section<br />

3. In Section 4, the performance of the proposed system is<br />

evaluated in terms of ISR, bit error rate (BER) and average<br />

chip error rate. The paper is concluded in Section 5.<br />

Figure 1: Interference excision algorithm.<br />

2. CONSTRUCTION OF TIME-FREQUENCY<br />

IMAGE<br />

The adaptive signal decomposition algorithm we use to de-<br />

compose the signal into its TF functions is the matching<br />

pursuit (MP) algorithm [9]. In MP, the received signal r(t)<br />

is decomposed into its linear combinations of TF functions<br />

g,(t) selected from an overcomplete dictionaly of TF func-<br />

tions. The signal r(t) can be represented as<br />

m<br />

where<br />

1 t-pn<br />

grn(t) = - g(-)<br />

s,.<br />

exp Ij(Znfnt + 4d1, (4)<br />

and a, are the expansion coefficients. The scale factor s,<br />

controls the width ofthe window functionandp, is the temporal<br />

placement coefficient. The parameters f and 4,, rep-<br />

(3)<br />

324<br />

126<br />

resent the frequency and the phase of the exponential func-<br />

tion, respectively. The signal r(t) is projected onto an over-<br />

complete dictionary of TF functions with all possible win-<br />

dow sizes, frequencies and temporal placements. At each<br />

iteration, the best-correlated function is selected from the<br />

dictionary and the remaining of the signal, which is called<br />

residue is further decomposed using the same iteration pro-<br />

cedure. For our application, we use the Gabor dictionaty<br />

consisting of Gaussian functions. Gaussian functions sat-<br />

isfy the minimum time-handwidth product and represent the<br />

signal on the TF plane with an optimal time-frequency res-<br />

olution [2]. After M iterations, the signal r(t) can be repre-<br />

sented as,<br />

M-l<br />

r(t) = (Rnr,gy,,(t))gy,(t) + R M~, (5)<br />

"=O<br />

where Rnr represents the residue of the signal r(t) after n<br />

iterations. The first term in Eqn. 5 represents the first M<br />

Gaussian functions matching the signal best (we will refer<br />

to the first term as r'(t)) and the second term (referred as<br />

r"(t)) represents the residue of the signal r(t). In order<br />

for the signal to be fully decomposed, the iteration process<br />

should continue until all the energy of the residue signal<br />

is consumed. In this study, we are interested in modeling<br />

the interferences with power higher than the power of the<br />

transmitted signal. The unmodeled part of the interference<br />

is suppressed by the processing gain. Therefore, we stop<br />

our iterations when the power of the residue of the signal<br />

r" (t) becomes less than the expected power of the interference<br />

free signal for less computational load. After the signal<br />

decomposition is achieved, the TFD W(t, w) may be constructed<br />

by taking the Wiper-Wle distribution (WVD) [2]<br />

of the Gaussian functions represented in r'(t):<br />

M-l<br />

W(t,w) = I(R"r,g7n(t))12Wg~~((t,w)<br />

M-I M-I<br />

*=0<br />

+ (R"r, 97, (t)) (R"'r.gT.., (0) Yg7" ,g,,,.~(t, (6)<br />

*=a<br />

where WgTn(t,w) is the WVD of the Gaussian window<br />

function. The double sum corresponds to the cross-terms of<br />

the WVD and should be rejected in order to obtain a crossterm<br />

free energy distribution of r'(t) in the TF plane [9].<br />

Therefore the resulting TFD is given by the first term of<br />

W(t,w), which we denote it by W'(t,w). W'(t, w) is a<br />

positive and cross-term free distribution but it does not satisfy<br />

the marginal properties<br />

.I<br />

JW.(t,w)dw # Ir'(t)lZand W'(t,w)dt # IR'(w)12 (7)<br />

in order to be a proper TFD for feature identification applications.<br />

The TFD W'(t, w) may be modified to satisfy the<br />

marginal requirements and still preserve its important properties.<br />

The cross-entropy minimization method can be used<br />

to optimize the TFD [PI. The resulting TFD is a true probability<br />

density function and it can be used for feature identification<br />

[6]. Let's denote the optimizedTFD by W"(t, w).<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.


3. INTERFERENCE DETECTION. nouen AND<br />

RADON TRANSFORM<br />

The combined Hough and Radon transform (HRT) is an efficient<br />

tool to detect energy-varying directional chirps [IO].<br />

In the HRT, the TFD is treated as an image, where each pixel<br />

value corresponds to the energy present at a particular time<br />

and frequency. For convenience, we will refer to the graylevel<br />

image of the optimized TFD W"(t, tu) as f (U, u).<br />

The Radon transform (RT) computes the projections of<br />

different angles of an image or two-dimensional data dis-<br />

tribution f (U, U) measured as line integrals along ray paths.<br />

The RT can be expressed as<br />

R = 11: f(u, u)6(p - ( UCOS~ + UsinO)) dudu, (8)<br />

where 8 is the angle of the ray path of integration, p is the<br />

distance of the ray path from the center of the image and<br />

6 is the Dirac delta function. The equation represents integration<br />

of f (U, U) along the line p = U cos 8 + U sin 8.<br />

The Hough transform (HT) is a paitem-recognition method<br />

that calculates the number of image points that satisfy a<br />

parametric constraint (quadratic interferences are modeled<br />

as second order equations as in [lo]). The HT can be applied<br />

to binary images only. The advantage of the combined<br />

HRT over HT is that it can be applied to gray-level images<br />

where we can detect energy varying chirp components as<br />

well. Once the points on the two-dimensional data distribution<br />

f (U, U) that satisfy directional parametric constraints<br />

are found, we transform the parameters to the TF domain<br />

and threshold the energy values of the TF functions corresponding<br />

to the directional interference on the TF plane. As<br />

illustrated in Figure I, the estimate ofthe interference ;(t) is<br />

reconstructed and subtracted from the received spread spectrum<br />

signal.<br />

4. EXPERIMENTAL RESULTS<br />

In our simulations, we used 128 chips per message bit for<br />

spreading the message signal and assumed the channel to be<br />

non-dispersive. We considered synthetic linear, quadratic,<br />

and multiple (linear and quadratic) chirps as the interference<br />

sources. We initially evaluated the bit error rates (BERs) re-<br />

sulting from the presence of a constant amplitude linear and<br />

quadratic chirps that sweep the entire frequency band oftbe<br />

spread spectrum signal, for different ISRs in the range of<br />

[0,50] dB. We assumed the SNR to be 10 dB for each case.<br />

When the ISR was below IO dB, the system was able to de-<br />

spread the interference so that no bit errors occurred at the<br />

receiver. For ISRs in the range of [ 10,501 dB, we supressed<br />

the single and multiple interferences using the proposed ex-<br />

cision algorithm before despreading. Multiple interferences<br />

included a linear and a quadratic chirp in the same TF do-<br />

main. We recorded no bit errors ajier the excision of single<br />

and mirltiple infeferences. We repeated the same process<br />

325<br />

127<br />

for different SNR values in the range of [-10,10] dB and<br />

also recorded no bit errors.<br />

One of the main reasons for this is an accurate TF repre-<br />

sentation of interferences in the adaptive TF plane and suc-<br />

cessful detection and filtering by HRT. A similar obsetva-<br />

tion was made by Bultan et al. in [I I], where they repre-<br />

sent the linear interferences with good TF localization us-<br />

ing adaptive chirplet decomposition. However, they do not<br />

report any results on the excision of quadratic and multiple<br />

interferences. Other TFD based methods reported bit errors<br />

for similar excision conditions [3,4, 51. Since interferences<br />

with different power levels were successfully removed from<br />

the signal resulting in no BERs, we evaluated our system<br />

by calculating the percentage of chips received in error for<br />

various SNR values. Figures 2 and 3 show the simulation<br />

results for the ISR values 40 and 5 dB, respectively. The<br />

fint ISR value is chosen as 40 dB because the system gives<br />

around 50% BER (the case when the system cannot reject<br />

any part of the interference) when there is no excision.<br />

Figure 2: Probability of chips in error for ISR=40 dB.<br />

The second ISR value is chosen as 5 dB, where the sys-<br />

tem can reject the interference without pre-processing prior<br />

to despreading. In some of the systems proposed, exci-<br />

sion of the interference with low power degrades the per-<br />

formance of the system [3], whereas our system substan-<br />

tially improves the chip error rate. For illustration purpose,<br />

TFDs of (i) the SS signal with a single interence (ISR = 5<br />

dB), (ii) the detected interference, and (iii) the interference<br />

suppressed SS signal are shown in Figure 4.<br />

The experimental results show that the proposed tech-<br />

nique can be successfully used for excision of single and<br />

multiple-component chirp-like interferences using adaptive<br />

TFDs and HRT, where as Bultan et al. [I I] focus only on<br />

suppression of linear chirps and Amin uses different kernels<br />

for different interferences [3].<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.


SNR (dB)<br />

. . .<br />

. . . . . . . . . . . . . .<br />

Figure 3: Probability of chips in error for ISR=5 dB.<br />

time sal”<br />

Figure 4: TFDs of (i) SS signal with a linear interference<br />

(chirp) (ii) estimate of the interference, (iii) interference fil-<br />

tered SS signal.<br />

5. CONCLUSIONS<br />

A new technique is introduced for the excisionof frequency-<br />

modulated interferences in spread spectrum communications<br />

The localization of the interference is provided by an adap-<br />

tive signal decomposition algorithm using Gaussian func-<br />

tions as bases and rejecting the cross WVDs of the Gaussian<br />

functions. Therefore, single and multiple time-varying FM<br />

interferences are represented with good resolution on the<br />

TF plane. The interference is then detected using a line de-<br />

tection algorithm, HRT. The estimated interference is suh-<br />

tracted from the signal, prior to despreading. The simu-<br />

lation results for the proposed algorithm had no bit errors<br />

after suppressing the interference for different ISR values<br />

even under very low SNR conditions of - IO dB. The perfor-<br />

mance of the system is evaluated by calculating the received<br />

chips in error before and after interference suppression. The<br />

proposed technique improves the performance of the system<br />

326<br />

128<br />

by reducing the number of chips received in error after exci-<br />

sion of the interference in both cases, wbenthe ISR is low or<br />

high. This technique can be used for any kind of chirp-like<br />

interference suppression with high accuracy.<br />

6. REFERENCES<br />

[I] J.D. Laster and J.H. Reed, “Interference rejection in<br />

digital wireless communication:’ IEEE <strong>Signal</strong> Pro-<br />

cessing Mag., pp. 37-62, May 1997<br />

[2] L. Cohen, “Time-frequency distributions - A review,”<br />

Proc. IEEE, vol. 77, pp. 941-981, 1989<br />

[3] M.G. Amin, “Interference mitigation in spread spec-<br />

trum communication systems using time-frequency<br />

distributions,” IEEE Trans. <strong>Signal</strong> Processing, vol. 45,<br />

no. 1, pp. 90-101, Jan 1997<br />

[4] S. Barbarossa and A. Scaglione, “Adaptive time-<br />

varying cancellation of wideband interferences in<br />

spread-spectrum communications based on time-<br />

frequency distributions:’ IEEE Trans. <strong>Signal</strong> Process-<br />

ing, vol. 47, no. 4, pp. 957-965, Apr. 1999<br />

[5] X. Ouyang and M.G. Amin, “Short-time Fourier trans-<br />

form receiver for nonstationary interference excision<br />

in direct sequence spread spect“ communications,”<br />

IEEE Trans. <strong>Signal</strong> Processing, vol. 49, no. 4, pp. 85 1 -<br />

863, Apr. 2001<br />

[6] S. Krishnan, “Adaptive <strong>Signal</strong> Processing Techniques<br />

for <strong>Analysis</strong> of Knee Joint Vibroarthrographic Sig-<br />

nals,” PhD. Thesis, <strong>University</strong> of Calgary, June 1999<br />

[7] L. Coben and T. Poscb, “Positive time-frequency dis-<br />

tribution functions,” IEEE Trans. Acousf. Speech Sig-<br />

nal Processing, vol. ASSP-33, no. 1, pp. 31-38, 1985<br />

[8] P.J. Loughlin, J.W. Pitton and L.E. Atlas, “Construc-<br />

tion of positive time-frequency distributions,” IEEE<br />

Trans. <strong>Signal</strong> Proc., vol. 42, no. 10, pp. 2697-2705,<br />

Oct 1994<br />

[9] S.G. Mallat and Z. Zhang, “Matching pursuit with<br />

time-frequency dictionaries,” IEEE Trans. on <strong>Signal</strong><br />

Proc., 41(12): 3397-3415, 1993<br />

[IO] R.M. Rangayyan and S. Krishnan, “Feature identifica-<br />

tion in the time-frequency plane by using the Hough-<br />

Radon transform:’ Pattern Recognition, vol. 34, pp.<br />

1147-1 158,2001<br />

[ 111 A. Bultan and A.N. Akansu, “A novel time-frequency<br />

exciser io spread spectrum communications for chirp-<br />

like interference:’ Proc. ICASSP-1998, pp. 3265-<br />

3268.1998<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.


A GENERAL PERCEPTUAL TOOL FOR EVALUATION OF AUDIO CODECS<br />

Karthikeyan Umapathy<br />

Dept. of Electrical and Computer Eng.,<br />

The <strong>University</strong> of Westem Ontario,<br />

London, Ontario, CANADA.<br />

Email: kumapath@uwo.ca<br />

Abstract<br />

Subjective evaluation forms an important part of any re-<br />

search work, where the feedback and perception of general<br />

public or a trained set of specialists are mandatory. Many<br />

audio and video coding techniques have emerged to tackle<br />

the bandwidth pmblems imposed by the Internet with data<br />

compression schemes either with lossless or perceptually<br />

lossless quality. In order to evaluate the performance of<br />

these techniques a Mean Opinion Score (MOS) test hns to<br />

be performed with wide variety of subjects. In this paper we<br />

present a MOS tool developed to evaluate the audio codecs<br />

both in controlled and uncontrolled listening envimnments.<br />

The technique is based on the International Telecommuni-<br />

cation Union - Radiocommunication sector (ITU-R) stan-<br />

dard. This novel appmach of performing distributed listen-<br />

ing tests in uncontmlled envimnments will help researchers<br />

to collect substantial feedback andperform statistical anal-<br />

ysis of an audio codec’sperformance in an eficient manner<br />

particularly for intemet driven applications. Results ofper-<br />

ceptual evaluation of 8 sample audio files of different va-<br />

rieties with an adaptive time-frequency transform (ATFTJ<br />

audio codec indicates the ease, independency, and the ef-<br />

fectiveness of performing MOS studies with the proposed<br />

technique.<br />

Keywords: mean opinion score (MOS), subjective evalua-<br />

tion, multimedia, audio coding, listening experiments.<br />

1. INTRODUCTION<br />

Subjective evaluation of audio quality is needed to assess<br />

the performance of audio codecs. Even though there are<br />

objective measures such as signal to noise ratio (SNR), total<br />

harmonic distortion (THD), and noise-to-mask ratio [l]<br />

they would not give a true evaluation of the audio codec<br />

particularly if they use lossy schemes as in the case of many<br />

existing well known audio codecs. This is due to the fact<br />

that for example in a perceptual coder SNR is lost however<br />

the audio quality is claimed to be perceptually distortion-<br />

Thanlrs to Minanet and NSERC organizations for funding ulis project.<br />

CCECE 2003 - CCGEI 2003, Montreal, May/mai 2003<br />

0-7803-7781-8/03/$17.00 @ 2003 IEEE<br />

Sridhar Krishnan and Garabet Sinanian<br />

Dept. of Electrical and Computer Eng.,<br />

<strong>Ryerson</strong> <strong>University</strong>,<br />

Toronto, Ontario, CANADA.<br />

Email: (!aishnan)(gsinania) @ee.ryerson.ca<br />

less. In this case SNR measure may not give the correct<br />

performance evaluation of the coder.<br />

The proposed technique uses the subjective evaluation<br />

method recommended by the Intemational Telecommuni-<br />

cation Union - Radiocommunication sector (ITU-R) stan-<br />

dards. It is called a “double blind triple stimulus with hid-<br />

den reference” [l]. In this method listeners are provided<br />

with three choices A, B and C for each sample under test. A<br />

is the referencdoriginal signal, B and C are assigned to be<br />

either the referencdoriginal signal or the compressed signal<br />

under test. The selection of reference or compressed signal<br />

for B and C is completely randomized. Figure 1 explains<br />

the choices A, B and C. For each sample audio signal, sub-<br />

jects listen to reference signal A, and compare B and C with<br />

the A. After each comparison of B with A, and C with A,<br />

they grade the quality of the B and C signals with respect<br />

to A in 5 levels as shown in Table 1. Both the listener and<br />

the test performer are made unaware of the combinations<br />

B and C can take, and it is called double-blind, and since<br />

three stimulus are provided it is called double-blind triple<br />

stimulus method.<br />

Fig. 1. Block diagram explaining MOS choices A, B, and<br />

C.<br />

Audio Quality<br />

Excellent<br />

Good<br />

Fair<br />

Poor<br />

Unsatisfactory<br />

Level of Distortion<br />

lmperceptible<br />

Just perceptible but not annoying<br />

Perceptible and slightly annoying<br />

Annoying but not objectionable<br />

Very annoying and objectionable<br />

Table 1. Description of the ratings used in the Mean Opin-<br />

ion Score.<br />

- 683 -<br />

129<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.


In this paper, we propose a subjective evaluation scheme<br />

in line with the above explained "double blind triple stim-<br />

ulus with hidden reference" for an adaptive time-frequency<br />

transform (ATFT) based audio codec. The paper is orga-<br />

nized as follows. In Section 2, under methodology a brief<br />

introduction to the ATFT coder and measurement procedure<br />

are discussed. Results and discussions are covered in Sec-<br />

tion 3.<br />

2. METHODOLOGY<br />

2.1. Adaptive Time-frequency Transform<br />

(ATFT) codec<br />

The ATFT audio codec is based on the matching pursuit<br />

(MP) [2] algorithm, where any signal z(t) is decomposed<br />

into a linear combination of TF functions g(t) selected from<br />

a redundant dictionary of TF functions.<br />

where<br />

a?<br />

z(t) = an gTn (t), (1)<br />

n=0<br />

and an are the expansion coefficients. The scale factor also<br />

called the octave parameters, is used to control the width<br />

of the window function, and the parameter p , controls the<br />

temporal placement. The parameters f, and +, are the fre-<br />

quency and phase of the exponential function respectively.<br />

The signal z(t) is projected over a redundant dictionary<br />

of TF functions with all possible combinations of scaling,<br />

translations and modulations. The dictionary of TF func-<br />

tions can either suitably be modified or selected based on<br />

the application in hand. In OUT technique, we are using the<br />

Gabor dictionary (Gaussian functions) which has the best<br />

TF localization properties [31. At each iteration, the best<br />

correlated TF functions to the local signal structures are se-<br />

lected from the dictionary. The remaining signal called the<br />

residue is further decomposed in the same way at each itera-<br />

tion subdividingthem into TFfunctions. After M iterations,<br />

signal z(t) can be expressed as<br />

M-I<br />

4t) = (R"z, 97") 97" (4 + R'z(t), (3)<br />

,=O<br />

where the first part of z(t) is the decomposed TF functions<br />

till M iterations, and the second part is the residue which<br />

will be decomposed in the subsequent iterations. This process<br />

is repeated till the total energy of the signaljs decomposed.<br />

The decomposition parametem (s,. pn. f,. Qn and<br />

a,) are further processed by applying energy thresholcling<br />

and perccptual filtering followed by quantization to obtain<br />

a compact representation of the audio signal. More details<br />

on ATET audio coding technique can be found in somi: of<br />

our earlier works [4, 51. The overall block diagram of the<br />

ATFT codec is shown in Figure 4. Two versions (standard<br />

and fast) of MP algorithm based ATFT codec was evJu-<br />

ated using the proposed subjective evaluation technique. A<br />

separate subjective evaluation of the perceptual model ,and<br />

quantization stage of the ATFT codec is also included.<br />

2.2. Measurement procedure<br />

Evaluation of any audio coding technique involves perfoim-<br />

ing subjective evaluation of the compressed audio quality.<br />

The standard procedure to obtain quantitative and qualita-<br />

tive data about a coder's performance is by performing ihe<br />

mean opinion scores (MOS) studies.<br />

2.2.1. MOS in controlled environment Experimental<br />

setup consists of a Pentium 111 PC with Windows 2000.<br />

Eight sample stereo signals were played through the creative<br />

sound blaster card to a professional high quality headset<br />

(Sennehiser) with a fixed volume output. A graphical uijer<br />

interface (GUI) was designed as shown in Figure 2. %ee<br />

stimuli A, B, and C are provided as explained in Section<br />

I. Listeners are allowed to do the tests by themselves under<br />

the supervision of the research team. Ratings are automat-<br />

ically recorded in a report file as the listener proceeds with<br />

all the 8 stereo samples. Each time the listener is allowed to<br />

advance to the next sample only after he/she grades the cix-<br />

rent sample. Twenty listeners (randomly selected) with con-<br />

sent agreement participated in the MOS studies. Once the<br />

testing was finished for all the subjects, the average MOS<br />

scores were computed for each sample. Table 2 shows the<br />

average MOS values obtained for the 8 signals. Figure 3<br />

shows the distribution of the MOS scores for each of the<br />

eight sample signals. It can be observed from the Table 2<br />

that classical-like audio signals such as Harp, Harpsichord,<br />

piano, and Tubularbell received high MOS scores compared<br />

to the rock-like (Acdc, Deflep) and signals with voice seg-<br />

ments (Enya, Visit).<br />

- 684 -<br />

130<br />

Table 2. Average MOS for 20 listeners.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.


Fig. 2. Snapshot of the GUI used for MOS studies<br />

. .<br />

UOSPOm. -<br />

1 1 a . o<br />

Fig. 3. Distribution of the MOS scores for 20 listeners<br />

In order to evaluate the performance of the developed<br />

perceptual model and the quantization stage of the ATFT<br />

cods, a second MOS study was conducted with 5 listeners.<br />

The procedure was repeated but the choices A, B and C were<br />

assigned as shown in Figure 4. The output of the TF decom-<br />

position (TF modeling stage) forms the input to the percep-<br />

tual filtering module hence the reference A was assigned to<br />

the reconstructed signal at the output of the TF modeling<br />

stage. Similarly choice B was assigned to the reconstructed<br />

signal at the output of the perceptual filtering module and C<br />

to the reconstructed signal at the output of the quantization<br />

stage. Listeners were asked to rate the choices B (percep-<br />

tual filtering output) and C (quantization stage output) with<br />

the reference A (TF modeling output) on a scale of 1 to 5 as<br />

explained in Section 1.<br />

The results were averaged for the five listeners and given<br />

in Table 3. From Table 3 it can be observed on an average,<br />

MOS scores of 4.6 and 4.3 were achieved for the perceptual<br />

filtering stage and the quantization stage respectively. The<br />

MOS scores indicate that the perceptual filtering technique<br />

proposed in the ATIT codec is performing exceedingly well<br />

with the eight sample signals and the noise introduced in the<br />

process ofquantization affects the output quality minimally.<br />

Sample PFO QO<br />

Deflep<br />

Enya 4.2<br />

Ham 4.2<br />

Harpsichord<br />

Piano<br />

Visit<br />

Average<br />

Table 3. Average MOS for PFO and QO. PFO - Perceptu-<br />

ally filtered output, QO - Quantisation output.<br />

The whole ATFT audio coding process was also evaluated<br />

with a faster version of the MP algorithm [6]. The faster<br />

version of the MP technique is based upon selecting a set of<br />

hest correlated TF functions at each iteration as opposed to<br />

one function selected at each iteration as done in the stan-<br />

dard MP. MOS were obtained by testing with 9 subjects and<br />

the results arc given in Table 4.<br />

Sample<br />

Acdc<br />

Deflep<br />

Enya<br />

Harpsichord<br />

Average MOS<br />

3.8<br />

3.2<br />

3.7<br />

Tubularbell 3.8<br />

Visit 2.9<br />

Table 4. Average MOS for 9 listeners of the ATFT codec<br />

with faster algorithm.<br />

2.2.2. MOS in uncontrolled environment As most of the<br />

audio compression formats are aimed at using with Internet,<br />

it is essential to evaluate the audio ccdec performance in an<br />

uncontrolled environment using Internet. The dismbuted<br />

MOS will give the me performance rating of the audio<br />

quality in terms of acceptance level in an average Internet<br />

listener environment. Many variability are involved in this<br />

MOS approach such as the quality of the audio hardware,<br />

audio playback software and the volume of playback. How-<br />

ever the average MOS results will justify the suitability of a<br />

media format over Intemet as it is tested in a more flexible<br />

environment with variety of internet listeners.<br />

A web based MOS as shown in Figure 5 was devel-<br />

oped using the standard web design tools. Similar to the<br />

standard MOS procedure a consent form will be displayed<br />

when the listener visits the main web page. After accept-<br />

ing the conditions, the web page is redirected to the actual<br />

MOS testing page. Listeners are provided with three stim-<br />

- 685 -<br />

131<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.


Widobad<br />

TF<br />

Modcb<br />

Rc-l=,lCVd<br />

--+F-1=<br />

TF<br />

Threshold<br />

M&a<br />

PW-Vr<br />

Mmldng m<br />

channel<br />

R0mESi"g<br />

___<br />

Fig. 5. Snapshot of the GUI used for web based MOS stud-<br />

ies.<br />

uli A, B and C as explained in Section 1. The interactive<br />

web page contains a form to receive the name of the listener<br />

and his rating of each sample. A validation is performed<br />

such that only after entering the name and rating the audio<br />

sample, the listener will be able to navigate to the next sam-<br />

ple. When the listeners select the stimuli, the audio samples<br />

are played using the media players associated with their re-<br />

spective web browsers. At this point of time all our testing<br />

on the web will be using the *.au format. This means the<br />

ATFT audio codec output will be converted to *.au format<br />

for the MOS purposes. The standard *.wav and *.au format<br />

at 44100 kHz sampling rate with 16 bit resolution is con-<br />

sidered to be a nearly lossless and a gold standard for CD<br />

quality music. Hence converting the ATFT codec output to<br />

one of this formats should not affect the quality or the MOS<br />

obtained. Once all the samples are rated the results are ap-<br />

pended in a data file in the main server with the usemame.<br />

Scripts written in Perl, handle the processing of data and<br />

redirecting web pages.<br />

3. RESULTS AND CONCLUSIONS<br />

The MOS study on the ATIT codec was performed on eight<br />

stereo sample signals in the following modes: 1. Controlled<br />

(a. with standard ATIT algorithm, b. with fast ATFT al-<br />

PeraplalWIc"ng<br />

________<br />

132<br />

___<br />

Rrmnalrucled RCaniUYCLed<br />

gorithm and c. evaluation of perceptual and quantization<br />

stages with standard ATFT algorithm) 2. Uncontrolled (web<br />

based MOS).<br />

It can seen from the Tables 2,3,4, and Figure 3, the sig-<br />

nificance of the proposed MOS study in evaluation of the<br />

audio coders. A broad and clear understanding of the out-<br />

put audio quality of the codec can be obtained with respert<br />

to (I) the type of signals the codec performs well or worst,<br />

(2) the speed of the algorithm versus output quality and (3)<br />

block-based evaluation of individual parts of the coder. De-<br />

tailed subjective testing using the web based MOS will be<br />

performed to obtain statistically significant results in evalu-<br />

ating coder performances.<br />

The advantage posed by web based MOS studies such as<br />

ease of subject recruitment with diverse music backgrounds;<br />

effectiveness in data/feedbackcollection; machine and envi-<br />

ronmental flexibility; and the availability of personal corn-<br />

puters ubiquitously will make it as an attractive tool for eva-<br />

uating the performance of next generation media compres-<br />

sion techniques over Internet.<br />

References<br />

[I] Thomas ryden, "Using listening tests to assess audio codecs:'<br />

in Collected Papers on Digital Audio Bit-Rare Reducrion,<br />

AES, 1996, pp. 115-125.<br />

[Z] Stephane Mallat, A wavelel tour of signal processing, Aca-<br />

demic press, San Diego, CA, 1998.<br />

[3] L. Cohen, 'Time-frequency distributions - a review," Pin-<br />

ceedings of the IEEE, vol. 77(7), pp. 941-981, 1989.<br />

[41 Karthikeyan Umapathy and Sridhar Krishnan, "Joint time-<br />

frequency coding of audio signals," in 51h WSES International<br />

multi conference on CSCC (Circuits, System. Communica-<br />

rims and Compulers), Crete, Greece, July 2001, pp. 32-36.<br />

[5] Karthikeyan Umapatby and Sridhar Krishnan, "Low bit-rite<br />

coding of wideband audio signals:' in Proceedings of IASTED<br />

International conference - SPPRA (<strong>Signal</strong> Pmessing, Pattern<br />

Recognition and Applicationr), Rhodes, Greece, July 2OEll.<br />

pp. 101-105.<br />

161 R. Gribonval, "Fast matching pursuit with multiscale dictio-<br />

nary of Gaussian chirps," IEEE Transactions on <strong>Signal</strong> Pro-<br />

cessing, vol. 9, no. 5, May 2001.<br />

- 686 -<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.


Non-Stationary Noise Cancellation in Infrared Wireless Receivers<br />

Sridhar Krishnan, Xavier Fernando and Hongbo Sun<br />

Department of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>,<br />

Toronto, Ontario, Canada<br />

(!aishnan)(femando)(hsun)@ee.ryerson.ca<br />

Abstract<br />

Infrared is attracting much attention for indoor<br />

wireless access due to its enormous bandwidth,<br />

inherent privacy and low cost. Intensiry modulated,<br />

directly detected infrared schemes do not experience<br />

multipath fading. However, ambient noise due to<br />

artificial lighting has been the major concern in<br />

infrared wireless systems in indoors. Conventionally,<br />

static or lowfrequency noise due to conventional light<br />

sources is removed using optical high pass filters.<br />

Nonetheless, interference from fluorescent lights<br />

equipped with electronic ballasts has periodic<br />

interference components up to I MHz and, cannot be<br />

filtered easily. In this paper, soft DSP filters are<br />

proposed to cancel the harmonics. ambient noise. and<br />

uncorrelated signal structures. Non-stationary noise is<br />

cancelled with an adaptive denoising filter, and a<br />

comb filter cancels interference from the electronic<br />

ballasts. Adaptive soft filters have the advantage that<br />

they can be easily updated and track the variations in<br />

noise characteristics. Simulation results show<br />

promising improvement in noise cancellation even<br />

under very low and varied SNR and noise source<br />

conditions.<br />

Keywords: infrared wireless. denoising, non-s tationary<br />

signals, comb filters, adaptive filters. wavelet-packets.<br />

1. INTRODUCTION<br />

Wireless communications has entered into a new<br />

phase. With each added application the demand for<br />

real-time, wideband wireless services increases. The<br />

overcrowded radio spectrum is simply unable to cope<br />

with all the demand. Infrared signal, on the other hand,<br />

is a promising new medium for wireless applications,<br />

especially at indoors. Considering the fact that the<br />

need for wideband multimedia type access is much<br />

high at indoors than at outdoors, infrared is an<br />

excellent choice. It has abundant untapped bandwidth<br />

that is freely available. Optical wireless techniques<br />

enjoy increased focus worldwide. The Wi-Fi (IEEE<br />

802.1 1) standard specifies infrared as a physical layer<br />

option.<br />

Optical energy is inherently confined within a room<br />

cavity resulting in inherent privacy. Therefore, the<br />

same infrared wavelength can be used in adjacent<br />

CCECE 2003 - CCGEI 2003, Montrkal, Maylmai 2003<br />

0-7803-7781-8/03/$17.00 0 2003 IEEE<br />

- 1945 -<br />

133<br />

room allowing device and frequency reusability.<br />

Furthermore, with Intensity Modulated Directly<br />

Detected (IMIDD) optical schemes, there is no<br />

multipath fading. The fading may degrade the signal<br />

strength by up to 30 dB in similar radio systems.<br />

However, ambient noise due to artificial lighting has<br />

been the major concem in infrared wireless systems in<br />

indoors [I]. This background light induces a white<br />

Gaussian shot noise that is 20 to 40 dB more than the<br />

signal induced shot noise. Furthermore, modem<br />

fluorescent lights with electronic ballasts generate<br />

switching noise up to 1 MHz, which introduces a much<br />

serious impairment. At times, the receiver thermal<br />

noise becomes dominant. The time-varying wireless<br />

channel determines weights of other noise sources. As<br />

a result, infrared wireless receivers experience high<br />

level of non-stationary noise.<br />

The objective of this paper is to develop signal<br />

processing algorithms to combat the high power non-<br />

stationary noise. Adaptive filters based on the least<br />

mean squares (LMS) and wavelet-packet based filters<br />

effectively cancel the noise in a non-stationary<br />

environment. A higher order comb filter cancels the<br />

periodic noise from the electronic ballast.<br />

2. NOISE AT INFRARED RECEIVERS<br />

Infrared receivers face with the challenge of a variety<br />

of noise sources, and the details of which are shown in<br />

Fig. I. There will he thermal noise from the electronics<br />

devices. This can he modeled as white Gaussian noise,<br />

and is relatively easy to tackle.<br />

Indoor infrared transmission systems are affected by<br />

interference induced by natural and dficial ambient<br />

light. The noise is directly proportional to the amount<br />

of light incident on the photo-detector, therefore, is a<br />

function of average optical power. The shot noise is<br />

due to the mean received infrared power and ambient<br />

light. However, the ambient light induced by shot<br />

noise typically has a power from 20 to 40 dl3 greater<br />

than the signal shot noise [2]. Therefore, the signal<br />

induced shot noise can be neglected. The ambient<br />

induced shot noise can be considered Gaussian and<br />

nearly white.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.


Natural ambient light noise is caused by sunlight. It<br />

can be considered steady with slow intensity variations<br />

in time. Artificial ambient light comes from several<br />

light sources: incandescent lamps, fluorescent lamps<br />

driven by conventional ballasts and fluorescent lamps<br />

geared by electronic ballasts. The use of fixed optical<br />

filters reduces out of band ambient light noise. The<br />

steady background irradiance produced by natural and<br />

artificial light sources is usually characterized by a<br />

direct current induced at the receiver photodiode that is<br />

directly proportional to that current. This current is<br />

referred as the background noise current.<br />

The interfering signal produced by incandescent lamps<br />

is an almost perfect sinusoid with a frequency of 100<br />

Hz. In addition to the 100 Hz sinusoid, only the fust<br />

harmonics (up to 2 kHz) cany a significant amount of<br />

energy, and for frequencies higher than 800 Hz, all<br />

components are 60 dB below the fundamental. So<br />

using electrical high pass filter can eliminate this<br />

interference without much signal degradation.<br />

For fluorescent lamps equipped with conventional<br />

ballasts driven at a power-line of 50 or 60 Hz, they<br />

induce interference at harmonics up to 50 kHz. This<br />

also can be eliminated by careful choice of modulation<br />

scheme to ensure there are no low frequency<br />

components and through electrical high pass filtering.<br />

For fluorescent lamps equipped with electronics<br />

ballasts, the ballast modulation frequency itself is 35<br />

Mz. Therefore, interference harmonics extending up<br />

to 1 MHz are introduced. This cannot be easily<br />

filtered. In case of interference overlapping the signal<br />

spectrum, sophisticated digital signal processing<br />

algorithms need to be developed and this is the focus<br />

of this paper.<br />

3. METHODOLGY<br />

The spectrum produced by an electronic-ballast-driven<br />

lamp consists of low and high frequency regions. The<br />

low frequency region resembles the spectrum of a<br />

conventional fluorescent lamp while the high<br />

frequency region is attributable to the electronic<br />

ballast. A deterministic expression that models the<br />

interfering signal at the output of the photodiode is [2]:<br />

where R is the photodiode responsivity (M), Pm is<br />

the mean optical power of the interfering signal, KI=<br />

CCECE 2003 - CCGEI 2003, Montreal, Mayimai 2003<br />

0-7803-7781-8/03/$17.00 0 2003 IEEE<br />

- 1946 -<br />

134<br />

5.9, K2 = 2.1, and {b}, {a} and {d} are constants, the<br />

frequency corresponding to the lamp type as jh= 37.5<br />

kHZ.<br />

...................................................................<br />

noises that are<br />

i by conventional<br />

~ ballasts<br />

~<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.<br />

~ : i . electronic ballasts<br />

/ : ~ *Thermal whitenoise ~<br />

: i * Shot noise by sun<br />

: .................................................................... , .........................................................................<br />

Fig. 1 Block diagram of Noise Removal Technique; at<br />

Infrared Wireless Receivers<br />

The even harmonics of37.5 kHz would correspond to<br />

75 kHz, 150 kHz, 225 kHz, 300 Hz, 375 IcHz, 450<br />

kHz, 525 IcHz, 600 kHz, 675 kHz, 750 kHz, and 82:j<br />

kHZ.<br />

If the low frequency and high frequency ambient<br />

signal model is the same as practical case, we can use<br />

this model and use adaptive noise cancellation method<br />

to eliminate this noise because this noise is<br />

uncorrelated to our desired signal. Also adaptive noise<br />

cancellation method can eliminate white Gaussian<br />

noise (the thermal noise model). The advantage in<br />

using adaptive noise cancellation method is that we<br />

can improve the SNR only if the reference signal is<br />

correlated to noise signal contained in primary signal<br />

but uncorrelated to desired signal. The limitation to use<br />

adaptive filtering is that: if the reference signal is<br />

completely uncorrelated with both the signal and noise<br />

components of the primary signal, the adaptive noise<br />

canceller has no effect on the primary signal, and the<br />

output signal to noise ratio remains unchanged. We<br />

used the ‘well known’ LMS algorithm [3] in removing<br />

the thermal noise from the signal. The ease of<br />

implementation of the LMS algorithm is achieved at<br />

the expense of convergence rate. To accelerate the<br />

converge rate of the LMS algorithm the step size<br />

parameter was selected as a function of the eigen<br />

values of the autocorrelation matrix of the input signal<br />

(which is dependant on the instantaneous energy ofthe<br />

signal).<br />

In case where the reference channel is not available or<br />

if the reference signal is uncorrelated with the noise in<br />

the primary channel then, signal decomposition<br />

techniques could be a better alternative. In cases where


signal and noise spectra overlap, fixed filtering or<br />

adaptive filtering of noise may not be the best<br />

approach. In such a situation noise filtering by using<br />

mathematical decomposition techniques might be the<br />

best alternative and such methods are commonly<br />

known as de-noising techniques. The de-noising<br />

approach bas been successfully applied for data such<br />

as knee sounds [4] and ultrasound signals [5]. The<br />

problem of enhancing signals degraded by<br />

uncorrelated additive noise when the noisy signal<br />

alone is available, has received much attention in the<br />

last years [6-lo]. The main problem arises when the<br />

de-noising filters cannot distinguish between noise and<br />

noise-like important signal components, and remove<br />

both thereby decreasing the intelligibility of the signal<br />

Among the mathematical transformation techniques, a<br />

time-frequency (TF) decomposition technique might<br />

be a suitable choice since it exploits the simultaneous<br />

overlap in time and frequency domain, and filters the<br />

noise accordingly.<br />

The complexity of structures present at infrared<br />

wireless receiver requires the development of adaptive,<br />

low level representations in order to provide a<br />

meaningful analysis. In Fourier, the basis functions<br />

sine and cosine are not suitable in capturing the subtle<br />

changes in speech signals because of their inability to<br />

localize time information. A better signal<br />

representation by using basis functions that can capture<br />

both temporal and spectral information would be more<br />

useful. <strong>Signal</strong> representation such as wavelet, and<br />

wavelet packet can provide this information. The<br />

signal decomposition techniques that are considered in<br />

this paper are wavelet-packets and are briefly<br />

described in the subsequent sections.<br />

In wavelets, any signal can be decomposed into<br />

components with good time and scale properties.<br />

Wavelets have the advantage to express any signal<br />

with a fewer coefficients. The design of basis functions<br />

must be optimized, so that the number of non-zero<br />

coefficients will be minimum and the input signal is<br />

approximated by projecting it over A4 basis functions<br />

selected adaptively. It is represented as follows:<br />

X(') = z (x2gm)gm<br />

me,.<br />

Where x(f) is the signal to be decomposed, and <br />

denotes the inner product between the signal and the<br />

basis functions. The basis functions are obtained by<br />

shifting and modulating the amplitude of a prototype<br />

function called mother wavelet, it is given by:<br />

CCECE 2003 - CCGEI 2003, Montrial, May/mai 2003<br />

0-7803-7781-8/03/$17.00 0 2003 IEEE<br />

- 1947 -<br />

135<br />

where s is the scale parameter, and U is the translation<br />

parameter. Wavelet analysis use long time intervals for<br />

low frequency detailed analysis and short time<br />

intervals for high frequency information. That offers<br />

good frequency resolution at low frequencies and good<br />

time resolution at high frequencies.<br />

The main difference between wavelet and<br />

wavelet packet analysis is that the later allows an<br />

adjustable resolution of frequencies through filter bank<br />

decomposition. It splits the whole spectrum into two<br />

equal bands at different levels, obtaining a general tree<br />

structure that is called the wavelet packet expansion.<br />

Basis functions are generated with an<br />

algorithm that uses quadrature mirror filter (QMF)<br />

banks, and divide the spectrum as a tree with multiple<br />

branches. Wavelet packet allows to search the<br />

optimum decomposition of the binary tree looking for<br />

the branch with the best entropy criterion of the input<br />

signal [7]. Once the wavelet or the wavelet packet<br />

decomposition of the signal is achieved, the next step<br />

is thresholding the resulting coefficients. This can be<br />

done in two ways, hard and soft thresholding.<br />

If w, denote the waveletiwavelet packet<br />

coefficients, then hard thresholding [7] is applied as:<br />

where Tis the selected threshold.<br />

For avoiding the de-noising effect of certain filters<br />

that remove the sharp features of the signals removing<br />

important components, soft-thresholding discards<br />

terms with small or insignificant contribution for the<br />

information.<br />

Soft thresholding is performed as:<br />

Different methods are used for selecting the<br />

best threshold T and also rescaling the coefficients<br />

according to the noise level.<br />

4. RESULTS and CONCLUSIONS<br />

As described in Section 3, three noise removal<br />

experiments were performed (1) removal of high<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.


frequency periodic interference due to fluorescent<br />

lamps equipped with electronic ballasts (2) removal of<br />

ambient noise of random nature with adaptive<br />

algorithms such as the LMS, and (3) automatic<br />

denoising of uncorrelated structures in a received<br />

signal by using the wavelet-packet technique.<br />

Removal of High Frequency Periodic Interference:<br />

Periodic interference in the signal due to electronics<br />

ballasts have even harmonics extending upto 1 MHz.<br />

In Fig. 2 a synthetic signal with periodic interference is<br />

shown. As the periodic interference is represented as<br />

spectral peaks in the signal's spectrum, a series of<br />

notch filters have to be designed to attenuate these<br />

spectral peaks. As spectral attenuation can be achieved<br />

by placing zeros on the unit circle or close to the unit<br />

circle in the z-plane at the exact frequency points, a<br />

linear phase fmite impulse response (FIR) filter was<br />

determined to be the best option. Detailed analysis for<br />

the right filter type revealed that a 26' order FIR filter<br />

could totally suppress the harmonics interference<br />

present in the signal caused due to electronic ballasts.<br />

The magnitude and phase response of the filter is<br />

shown in Fig. 3. It could be easily seen that the filter<br />

has linear phase response, and the magnitude response<br />

clearly represents the comb filter characteristics. The<br />

output of the filter is shown in Fig. 4, and it is evident<br />

that FIR comb filtering satisfactorily reduces the<br />

harmonics due to electronic ballast interference.<br />

Removal of Ambient Noise:<br />

In'this study, an infrared wireless system was modeled<br />

with a signal of interest, and noise was added at<br />

different Dower levels. The desired resuonse in the<br />

training stage of the filter was assumed tdbe a delayed<br />

version of the clean sienal free of ambient noise. A<br />

12* order transversal fiiter was trained with the LMS<br />

adaptive filter algorithm. The step size parameter in<br />

the LMS that gnvems the convergence rate of the<br />

algorithm was selected in an adaptive manner, in such<br />

a way that the step size is inversely proportional to the<br />

instantaneous energy of the signal. It was found that<br />

the step size selected in this manner provides an<br />

optimal convergence suitable in an infrared wireless<br />

communications environment. Fig. 5 shows the<br />

original and the denoised signal by using this<br />

approach. It could be seen in panel B, that a clear<br />

signal is obtained, but the conyergence of LMS has<br />

caused some transient-like disturbances in the filtered<br />

output. The transient like disturbance was minimized<br />

by selecting the step size based on the instantaneous<br />

energy of the signal of interest.<br />

CCECE 2003 - CCGEI 2003, Montreal, May/mai 2003<br />

0-7803-7781-8/03/$17.00 0 2003 IEEE<br />

- 1948 -<br />

136<br />

Removal of Uncorrelated Signamoise Structures:<br />

In case of noise spectra overlapping with the signal<br />

spectrum, and where the reference channel is not<br />

available or if the reference signal is uncorrelated with<br />

the noise in the primary channel then, signal<br />

decomposition techniques could be a better alternative.<br />

As explained in Section 3, wavelet-packet techniques<br />

are promising tools for removal of structures that .are<br />

not correlated to the signal of interest. Wavelet-packet<br />

picks the hest basis functions by using the entropy<br />

optimization criteria. In this study, Coiflet,<br />

Daubechies, Haar and Symmlets were tried as wavelet<br />

choices with soft thresholdmg criteria, and among<br />

them Daubechies (db6) seem to outperform the other<br />

commonly used wavelets in terms of removing<br />

structures that are not relevant in an infrared wireless<br />

receiver system. Fig. 6 shows the original and the<br />

denoised signals, and it could clearly seen that<br />

wavelet-packet has performed extremely well in<br />

removing irrelevant components from the signal of<br />

interest.<br />

.-<br />

..... . . . . ~, .............................. I<br />

* I2 e, a.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.<br />

OI I* ow 1- I<br />

Fig. 2: Fluorescent lamp geared by electronic ballas,ts.<br />

,~iwuF^nummri*rlol..girar<br />

" .............................. " ........................... :. ...<br />

:,<br />

,*,; .. , ...<br />

.......... .......... i .... ,.j. .ol. i:;; ........ ~ ............. i ...............<br />

e 91 (12<br />

ea xi 0s<br />

%-*laxrr,i.:r-<br />

~*<br />

v*<br />

...<br />

3<br />

:w,.,m"--m-- .......... "1""1<br />

w,..... _I" ......<br />

- ,<br />

4-1<br />

..w<br />

..<br />

. -<br />

, . '-': . ;,, .: . .;<br />

.<br />

. . ........ %-. ~:,- ' ' ";<br />

. . . . . '." ..'.<br />

.I. .<br />

-, I, ... ... "..ll.l"l_l"<br />

/j 3.x gz a, e4 3, "* "l rl/ 0s 1<br />

I . i C . r U p l i<br />

Fig. 3: Magnitude and phase response of FIR Comb<br />

filter of order 26.


Fig. 4: Output of the Comb filter.<br />

-***Unj--Lc<br />

........................ ~. .............<br />

.. ...~. .................<br />

aruurr^n,r---*m Ia<br />

., .................. .......... "* .............<br />

F' n: ss 6. *I U*<br />

*v=.,-:<br />

Fig. 5: Original ambient noise and LMS<br />

filtered output.<br />

; . ~~<br />

. ,<br />

CI <<br />

51<br />

> i.1 >I P3 61 1: F.3 i? 3.8 '.i i<br />

i.*,,m,<br />

Fig. 6: Original and wavelet-packet denoised.<br />

CCECE 2003 - CCGEI 2003, Montreal, Mayimai 2003<br />

0-7803-7781-8/03/$17.00 Q 2003 IEEE<br />

- 1949 -<br />

137<br />

References<br />

[I] R.Narasimhan, M.D.Audeh, J.M.Kahn, Effect of<br />

electronic-ballast fluorescent lighting on wireless<br />

infrared links, IEE Proc.-Optoelectron, Vol. 143, No.<br />

6, December 1996.<br />

[2] Moreia A.J.C., Valadas R.T, and de Oliveira<br />

Duam A.M. Optical inteiference produced by<br />

ortijicial light, Wireless Networks, Vo1.3, 1997, pp<br />

131-140.<br />

[3] Haykin, S. Adaptive filter theory, Prentice Hal!,<br />

New Jersey, 2002.<br />

[4] S. Krishnan and R. Rangayyan. Automatic de-<br />

noising of knee joint vibration signals using adaptive<br />

time-frequency representations, Medical and<br />

Biological engineering and Computing. Vol. 38, No I,<br />

pp. 2-8, January 2000.<br />

[SI S. Johnston, A. Diaz and S. Doctor. De-noising of<br />

Ultrasonic <strong>Signal</strong>s Backscattered from Coarse-<br />

Grained materials: Wavelet Processing and<br />

Maximum-Entropy Reconstruction, 67th Annual<br />

meeting of the Southeastem section of the American<br />

Physical Society.<br />

[6] X. Xie, J. Kuang. A noise canceller for mobile<br />

communications utilizing time-frequency. <strong>Analysis</strong>,<br />

Fourth Asia-Pacific conference on optoelectronics and<br />

communications. Vol. 1,pp. 504-507, October 1999.<br />

[7] D. Donoho. Nonlinear wavelet methods for<br />

recovery of signals. densities, and spectra from<br />

indirect and noisy data, Proceedings of Symphosia in<br />

Applied Mathematics.Vol.00, pp. 173-205, 1993.<br />

[8] M. Bohoura and J. Rouat. Vmelet speech<br />

enhancement based on the Teager energy operator,<br />

IEEE signal processing letters, Vol. 8, No I, Jan 2001.<br />

[9] N. Virag. Single Channel speech enhancement<br />

based on masking properties of the human masking<br />

properties of the human auditory system, IEEE<br />

Transactions on Speech and audio processing. Vol. 7,<br />

ISSUE 2, March 1999.<br />

[lo] L. Arslan, A. McCree, V. Viswanathan. New<br />

mefhoh for adaptive noise suppression. Proceedings<br />

of the Intemationa! Conference on Acoustics, Speech<br />

and <strong>Signal</strong> Processing, Vol. 1, pp. 812-815, Detroit,<br />

USA, May 1995.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.


Adaptive denoising at Infrared wireless receivers<br />

Xavier N. Fernando, Sridar Krishnan, Hongbo Sun and Kamyar Kazemi-Moud<br />

Department of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong><br />

Toronto,ON, M5B 2K3, Canada<br />

(fernando@ee.ryerson.ca)<br />

ABSTRACT<br />

This paper proposes an innovative approach for noise cancellation at infrared (IR) wireless receivers. Ambient noise due<br />

to artificial lighting and the sun has been a major concern in infrared systems. The background induced shot noise<br />

typically has a power from 20 to 40 dB more than the signal induced shot noise and varies with time. Due to these<br />

changing conditions, infrared wireless receivers experience high level of non-stationary noise. The objective of the work<br />

mentioned in this paper is to develop digital signal processing algorithms at the infrared wireless system to combat high<br />

power non-stationary noise. The noisy signal is decomposed using a joint time and frequency representation such as<br />

wavelets and wavelet packets, into transform domain coefficients and the lower order coefficients are removed by<br />

applying a threshold. Denoised version is obtained by reconstructing the signal with the remaining coefficients. In this<br />

paper, we evaluate different wavelet methods for denoising at an infrared wireless receiver. Simulation results indicate<br />

that if the noise is uncorrelated with the signal and the channel model is unavailable the wavelet denoising method with<br />

different wavelet analyzing functions improves the signal to noise ratio (SNR) from 4 dB to 7.8 dB.<br />

Keywords: optical wireless, infrared, receiver, noise, wavelet transform, denoising<br />

1. INTRODUCTION<br />

The emerging technologies like mobile portable computing and multimedia terminals at living and work environments<br />

are the main forces driving companies ,scientists and researchers to progress in the challenging field of wireless local<br />

area networks(WLAN). The need for higher speed and wider bandwidth in data communication networks is gradually<br />

replacing electrical transmission medium to optical. Wireless infrared LANs are important part of indoor transmission<br />

systems and enable high bit-rate data transferring over short distances [1]. Infrared systems occupy no radio frequency<br />

(RF) spectrum and they can be used where electromagnetic interference is critical. The infrared spectral region offers a<br />

large, virtually unlimited, bandwidth that is unregulated worldwide. Since infrared communications are confined to<br />

rooms, there is no interference between communication systems operating in different rooms, which result in secure<br />

communications. In contrary to RF transmission systems, the light is reflected diffusely on the wall surface of the rooms<br />

and the channel estimation will be a non-trivial subject for infrared systems. A non-directed wireless optical<br />

communication system can be either line-of-sight (LOS) or diffuse. A LOS system is designed under the assumption that<br />

the LOS path between transmitter and receiver is unobstructed. A diffuse system is defined as one which does not rely<br />

upon the LOS path, but instead relies on reflections from a large diffusive reflector such as the ceiling. In both cases, an<br />

optical signal in transit from transmitter to receiver undergoes temporal dispersion due to reflections from walls and<br />

other reflectors; the intersymbol interference (ISI) that results is an impediment to communication at high speeds. Single<br />

diffuse infrared links can operate with bit rates as high as 100 Mb/s [2]. Since it is possible to operate at least one<br />

infrared link in every room of a building without interference, the potential capacity of an infrared-based network is<br />

extremely high. The propagation characteristics of diffuse infrared signals resemble those of radio signals. The measured<br />

received power at different positions using a photodetector much smaller than the light wavelength will result in<br />

multipath fading like fluctuations in received power. In the real diffuse infrared systems, however, the detector size is<br />

much larger than the wavelength, so that the multipath fading like power fluctuations are averaged out effectively. While<br />

multipath propagation does not lead to fading, it causes temporal dispersion. The tail caused by higher order taps of the<br />

indoor channel impulse response induces ISI in high bit-rate communications.<br />

Infrared Technology and Applications XXIX, Bjørn F. Andresen, Gabor F. Fulop, Editors,<br />

Proceedings of SPIE Vol. 5074 (2003) © 2003 SPIE · 0277-786X/03/$15.00<br />

138<br />

199


Indoor infrared transmission suffers from a number of impairments the most important ones being shot noise<br />

from the ambient light and restricted symbol rate due to multipath dispersion. Noise plays a severe role in the<br />

performance of wireless infrared networks. Background illumination has two distinct effects in the performance of<br />

optical receivers; one is noise due to the steady and invariant irradiance from undesired light sources which results to<br />

shot noise at the photodetector, the other one is interference generated by high frequency components of some light<br />

sources. Typically, natural and artificial ambient light contribute to high levels of shot noise in a photodetector which<br />

degrades the performance of the transmission system. For data-rates up to 10 Mbps, the major degrading factor of the<br />

infrared communication systems is the shot noise induced in the receiver due to ambient light. Unfortunately, ambient<br />

light sources (sunlight and artificial light) also radiate in the same spectral wavelengths used by infrared transducers.<br />

Thus shot noise presents a strong spatial and temporal dependence. Several advanced techniques for the design of nondirected<br />

wireless infrared communication systems have been already proposed in order to minimize these signals to<br />

noise ratio (SNR) fluctuation effects. These ambient light levels to a significant degree determine the optical power<br />

required for reliable transmission. The shot noise induced by ambient light may vary over several decades during a day<br />

in a typical indoor environment.<br />

The interfering signal from the fluorescent light is periodic and deterministic. The spectrum of fluorescent<br />

lights driven by electronic ballasts may extend up to frequencies around 1MHz interference of which will cause serious<br />

degradation at infrared receivers even after high-pass electrical filtering [3-5].<br />

The objective of the work mentioned in this paper is to develop a digital signal processing algorithm at the<br />

infrared wireless system to combat uncorrelated noise without a reference channel model. In Section 2, we introduce and<br />

classify different noise sources at the infrared receivers [3]. Section 3 will focus on the definition of wavelet transform<br />

and analyzing functions which will be used in Section 4 to introduce a new methodology for noise cancellation. The new<br />

wavelet–based denoising technique and the results of wavelet denoising are discussed Section 4. The conclusions are<br />

provided in Section 5.<br />

2. NOISE AT THE RECEIVERS<br />

Noise in the infrared optical receivers is a critical parameter of performance analysis. There are different sources of noise<br />

that contribute to overall performance of the wireless network link. Thermal noise of the photodetector is dominant for<br />

weak steady background illumination. Thermal noise is critically dependent to the front-end design of the receiver (e.g.<br />

preamplifier). Shot noise is induced by the quantum nature of photons randomly arriving at the photodetector. It is<br />

proportional to the average received optical power. Natural and artificial background light may come from different light<br />

sources. Different background noise source contributions are from sun, incandescent lamps, fluorescent lamps with<br />

conventional ballasts and electronic ballasts. The slow variations in intensity of the light coming from the Sun make it a<br />

strong source of shot noise. The spectrum of natural light coming from the Sun in a shiny day is spread over entire<br />

responsivity curve of the PIN photodetector resulting to a steady background noise current of an order of a mA stronger<br />

than a well artificially illuminated room. Shot noise is larger under directional lamps and near windows exposed to<br />

sunlight. Furthermore, it can vary drastically during a normal day with the position of the sun and with the indoor<br />

lighting conditions. Due to the temporal variation and directional nature of both signal and noise, the SNR at the receiver<br />

can vary significantly.<br />

Artificial light sources also contribute to shot noise as well as interference at the infrared receiver. Incandescent<br />

lamps interference is periodic with a frequency of 100 Hz. Its spectrum has frequency components up to 2 KHz.<br />

Harmonics at the frequencies of higher than 800 Hz do not carry a significant amount of energy and they are 60 dB<br />

below the fundamental harmonic.<br />

In case of the incandescent lamps the amplitude of the interference is one tenth of the current generated by the<br />

slow variations of intensity. <strong>Research</strong>ers have already extracted an experimental interference model for typical<br />

incandescent lamps [3].<br />

200 Proc. of SPIE Vol. 5074<br />

139


Fluorescent lamps equipped with conventional ballasts driven at power-line of 50 or 60 Hz, they induce<br />

interference at harmonics up to 20 KHz. This interference is periodic with a frequency of 50 Hz and its harmonics are 50<br />

dB below the 100 Hz component for frequencies higher than 5 KHz. Interference amplitude in this case is 2 to 6 times<br />

lower than the shot noise current Interference model for the fluorescent lamps driven by conventional ballasts has also<br />

been extracted experimentally [3].<br />

The fluorescent lamps with electronics ballasts have higher power efficiencies and use the same concept of<br />

switching power supplies. Interference generated by fluorescent lamps with electronics ballasts has lower amplitude<br />

compared to other types of ambient lights but its spectrum is very broad and has frequency harmonics up to 1MHz. The<br />

spectrum produced by an electronics-ballast-driven lamp consists of low and high frequency regions. The low frequency<br />

region resembles the spectrum of a conventional fluorescent lamp while the high frequency region is attributable to the<br />

electronic ballast. These two components of the spectrum have been modeled using the same experimental approaches as<br />

the other noise sources [3].<br />

In these model equations the relative amplitude and phase of the harmonics can be easily identified. For<br />

different class of lamps all the average parameters for interference models can be easily identified [3]. Several schemes<br />

have been proposed in order to reduce the power penalty induced by ambient artificial light sources in an indoor infrared<br />

wireless system [5-7].<br />

3. WAVELET TRANSFORM<br />

Wavelets are functions that satisfy certain mathematical requirements and are used in representing data or other<br />

functions. The wavelet analysis procedure is to adopt a wavelet prototype function, called an "analyzing wavelet". The<br />

wavelet transform has become a powerful tool for signal analysis and is widely used in many applications, including<br />

signal detection and denoising.<br />

The complexity of structures presented at infrared wireless receivers requires the development of an adaptive,<br />

low level representation in order to provide a meaningful analysis of the system. In Fourier, the basis functions are sine<br />

and cosine which are not suitable in capturing the subtle changes of received signals at the infrared receivers because of<br />

their inability to localize the temporal information.<br />

The wavelet transformation is a time-frequency decomposition technique and with the choice of smooth multiresolution<br />

wavelet analyzing functions that use long time intervals for capturing low frequency information of the<br />

desired signal and short time intervals for high frequency information, one can have a joint temporal and spectral<br />

representation of that signal. Temporal analysis is performed with a contracted, high-frequency version of the prototype<br />

wavelet, while frequency analysis is performed with a dilated, low-frequency version of the prototype wavelet. Because<br />

the original signal or function can be represented in terms of a wavelet expansion (using linear combination of the<br />

coefficients and the wavelet basis functions), data operations can be performed using just the corresponding wavelet<br />

coefficients. In wavelet transformation, any signal can be decomposed into components with good time and scale<br />

properties. Wavelets have the advantages to express any signal with fewer coefficients [9].<br />

The basis functions are obtained by shifting and modulated the amplitude of the “analyzing wavelet”. The<br />

design of basis functions must be optimized, so that the number of non-zero coefficients will be minimal and the input<br />

signal is approximated by projecting it over the basis functions selected adaptively. In wavelet-based denoising, the<br />

noisy signal is decomposed into transform domain coefficients, and the lower order coefficients are removed by applying<br />

a threshold. If we assume that Ψ(t) is the analyzing wavelet function then the continuous multi-resolution wavelet frame<br />

transform, F[m,n], of a signal f(t) is defined:<br />

m ∫<br />

F[ , n<br />

m,<br />

n<br />

−∞<br />

m,<br />

n]<br />

=< Ψ ( t),<br />

f ( t)<br />

>= Ψ ( t)<br />

⋅ f ( t)<br />

⋅dt<br />

140<br />

+∞<br />

Proc. of SPIE Vol. 5074 201


The inverse wavelet transform is defined as<br />

f ( t)<br />

( t)<br />

= F[<br />

m,<br />

n]<br />

⋅ Ψm,<br />

n<br />

m∈ℑn∈ℑ here m and n belong to the set ℑ , the set of integer numbers defining each wavelet basis function, Ψ(t) , in the two<br />

dimensional wavelet space.<br />

∑∑<br />

The main difference between wavelet and wavelet packet analysis is that the latter allows an adjustable<br />

resolution of frequencies through filter bank decomposition. Filter banks split the whole spectrum into two equal bands<br />

at different frequency levels, obtaining a general tree structure that is called the wavelet packet expansion. Wavelet<br />

packet allows searching the optimum decomposition of the tree looking for the branch with the best entropy criterion of<br />

the input signal [7].<br />

<strong>Research</strong>ers in related engineering and applied mathematics areas have developed many different wavelet<br />

transform systems each with specific properties. The difference between these wavelet transforms is mainly their<br />

analyzing functions and the way that they are developed. There are two major classes of wavelet transform systems. One<br />

class consists of orthogonal wavelets and the other one consists of biorthogonal wavelets. Other wavelet transform<br />

systems, not included in the two main categories, have generally limited applications [8].<br />

4. NOISE CANCELATION METHOD<br />

In order to cancel the effect of uncorrelated Gaussian noise in the indoor infrared wireless channel we introduce the<br />

wavelet transform applied to the signal in electrical domain. Figure 1 shows the schematic diagram of the wireless<br />

infrared link and the receiver with the wavelet transform denoising block.<br />

202 Proc. of SPIE Vol. 5074<br />

Figure 1 – Schematic of the wavelet based denoising wireless infrared link<br />

141


In this system, the high pass electrical filter will reduce the interference induced by incandescent light and<br />

fluorescent light by conventional ballasts. The comb filter block will cancel the high frequency interference from the<br />

fluorescent lamps driven by electronics ballasts [11]. In the wavelet denoising block, the received signal is being<br />

transformed using pre-defined analyzing function. Once the wavelet decomposition of the signal is achieved the next<br />

step is thresholding. Thresholder block will remove the coefficients of the signal which have smaller absolute value than<br />

a predefined threshold. Different methods can be used to determine the threshold level that results in performance<br />

improvement in addition to rescaling the coefficients to the noise level. If wm denotes the wavelet coefficients of the<br />

decomposed signal and A the threshold level then the hard thresholding can be described mathematically as:<br />

wˆ<br />

m<br />

=<br />

⎧⎪<br />

⎨<br />

⎪⎩<br />

w<br />

0<br />

m<br />

In order to avoid the denoising effect of certain filters that remove the sharp features of the signals, soft thresholding will<br />

discard the coefficients with small and insignificant contribution to the information and can be performed as:<br />

wˆ<br />

where the Sgn(.) is the signum function.<br />

m<br />

=<br />

⎧⎪<br />

⎨<br />

⎪⎩<br />

Sgn(<br />

w<br />

0<br />

m<br />

)( w<br />

m<br />

w<br />

w<br />

m<br />

m<br />

− A)<br />

The remaining wavelet coefficients produce the denoised signal which will be demodulated and decoded. The<br />

aim is to alleviate the shot noise generated by incandescent light, the thermal noise from the receiver electronics by this<br />

denoising block. For simulation the denoising algorithm is applied to a pulse train with frequency of 10 KHz that passes<br />

through an infrared channel that contributes additive Gaussian noise with a SNR of 4 dB. Data signal with additive white<br />

Gaussian noise and its spectrum is shown in Figure 2-a and 2-b respectively.<br />

≥ A<br />

< A<br />

w<br />

w<br />

m<br />

m<br />

≥ A<br />

< A<br />

(a) (b)<br />

Figure 2: The received signal passed over an additive white Gaussian noise channel (a) the spectrum of the received signal (b).<br />

142<br />

Proc. of SPIE Vol. 5074 203


The simulations has been done using seven different wavelet analyzing functions and the results of SNR<br />

improvements are summarized in Table 1. The “SNR improvement” is defined as the value that SNR after denoising<br />

subtracted by SNR before denoising. Orthogonal wavelet transforms used in the simulation were Haar, Daubechies,<br />

Coiflets, Symlets and discrete Meyers’s wavelet transform.<br />

Figure 3 shows the original received noisy signal (above) and denoised version of the same signal after<br />

applying discrete Meyer’s wavelet transform (below). SNR improvement of the denoised signal in this case is 3.8 dB. In<br />

the thresholding block the wavelet coefficients obtained from signal decomposition that are lower than the threshold<br />

level are discarded. Figure 4 shows the original Gaussian noise of the channel (above) and the temporal representation of<br />

the discarded coefficients (below).<br />

Waveform SNR improvement<br />

‘Haar’ 2.3279<br />

‘db’ 3.4801<br />

‘sym’ 3.4522<br />

‘coif’ 3.5583<br />

‘bior’ 3.7485<br />

‘dmey’ 3.8281<br />

Table 1: SNR improvement of wavelet denoising method using different analyzing functions<br />

Figure 3: The original noisy 10 KHz pulse train (above) and the denoised version using the discrete Meyer’s transform (below)<br />

204 Proc. of SPIE Vol. 5074<br />

143


Figure 4: The original Gaussian noise (above) and the reconstruction of wavelet coefficients discarded by thresholder (below)<br />

Figure 5: The original noisy 10 KHz pulse train (above) and the denoised version using the Haar transform (below)<br />

144<br />

Proc. of SPIE Vol. 5074 205


Figure 6: The original Gaussian noise (above) and the reconstruction of wavelet coefficients discarded by thresholder (below)<br />

In Figure 5 shows the denoised version of the received signal using the Haar wavelet transform has been shown<br />

(below). Reconstructed signal from the discarded coefficients in the thresholder is shown in Figure 6 (below). By using<br />

Haar wavelet transform a SNR improvement of 2.3 dB has been achieved.<br />

Haar wavelet analyzing has sharp edges compared to the Meyer’s wavelet mother function which is smoother<br />

and this results to the loss of signal information over those sharp edges therefore a lower SNR improvement. Overall the<br />

use of the wavelet deoinsing method with any of the analyzing functions results to a SNR improvement of approximately<br />

3 to 4 dB which means a signal twice more powerful than the noisy one. This improvement can be achieved for a noise<br />

which is uncorrelated with the information signal, and where a reference channel for noise is not available.<br />

5. CONCLUSIONS<br />

Different noise contributions at the infrared wireless receivers have been mentioned. A new denoising method for<br />

uncorrelated noise in wireless infrared receivers was introduced using the wavelet transform. In this new method<br />

denoised version is obtained by reconstructing the signal with the remaining coefficients after passing through a<br />

thresholder. We evaluated Coiflet, Daubechies, Haar, Symmlets, Biorthogonal and Meyer wavelet analyzing functions<br />

for denoising at an infrared wireless receiver. Overall using the wavelet with any of the analyzing functions in the<br />

simulation has resulted to a SNR improvement of approximately 3 to 4 dB with the input SNR of 4 dB. If the power<br />

density function of the noise which is uncorrelated to the information signal is known and the reference channel model is<br />

unknown, the use of self-defined adaptive wavelet analyzing functions can improve the SNR of received signal whose<br />

spectrum overlaps with that of the noise.<br />

A comparison of SNR improvement for different wavelet analyzing functions has been done. Results also<br />

indicate that the smoother wavelet analyzing functions can preserve more signal information hence they will result to a<br />

higher SNR improvement. But one should consider that overall SNR improvement using the wavelet decomposition<br />

method for denoising is between 3 and 4 dB for different wavelets therefore we suggest the use of wavelets that can be<br />

implemented easier on digital signal processors (DSP) chips and have efficient calculation time in order to satisfy speed<br />

constraints of the electronics used in the lightwave system.<br />

206 Proc. of SPIE Vol. 5074<br />

145


6. REFERENCES<br />

1. F. R. Gfeller and U. Bapst,”Wireless in-house data communication via diffuse infrared radiation'' Proceedings of the<br />

IEEE, vol. 67, pp. 1474-1486, November 1979.<br />

2. Audeh, M.D.; Kahn, J.M.; “Performance simulation of baseband OOK modulation for wireless infrared LANs at<br />

100 Mb/s “, Proceedings IEEE International Conference on Selected Topics in Wireless Communications, vol: 25-<br />

26, pp: 271 -274, Jun 1992.<br />

3. Moreira, A.J.C.; Valadas, R.T.; de Oliveira Duarte, A.M.; ‘Characterisation and modeling of artificial light<br />

interference in optical wireless communication systems’, Personal, Indoor and Mobile Radio Communications,<br />

1995. PIMRC'95, Volume: 1 , 27-29 Page(s): 326 -331, Sep 1995.<br />

4. O'Farrell, T.; Kiatweerasakul, M.;’ Performance of a spread spectrum infrared transmission system under ambient<br />

light interference ‘,Symposium on Personal, Indoor and Mobile Radio Communications, The Ninth IEEE<br />

International, Volume: 2, Page(s): 703 -707, 8-11 Sep 1998<br />

5. Moreira, A.J.C.; Valadas, R.T.; de Oliveira Duarte, A.M.; “Reducing the effects of artificial light interference in<br />

wireless infrared transmission systems”, IEE Colloquium on Optical Free Space Communication Links, Page(s): 5/1<br />

-510, 19 Feb 1996.<br />

6. A. J. C. Moreira, R. T. Valadas, and A. M. de Oliveira Duarte, “Optical interference produced by artificial light“<br />

Wireless Networks, vol. 3, no. 2, pp. 131-140, 1997.<br />

7. C.J. Georgopoulos, “Suppressing background-light interference in an in-house infrared communication system by<br />

optical filtering”, Internat. J. Optoelectronics vol 3,(3) (1988).<br />

8. Donoho, D.;”Nonlinear wavelet methods for recovery of signals, densities and spectra from indirect and noisy data”,<br />

Proceedings of Symposia in Applied Mathematics, vol 00, pp. 173-205, 1993.<br />

9. Strang, G. and Nguyen, T., “Wavelets and Filter Banks.” Wellesley-Cambridge Press, Wellesley Massachusetts,<br />

1996.<br />

10. Narasimhan, R.; Audeh, M.D.; Kahn, J.M.; “Effect of electronic-ballast fluorescent lighting on wireless infrared<br />

links “,ICC 96, IEEE International Conference on Conference Record, Converging Technologies for Tomorrow's<br />

Applications, Volume: 2, , Page(s): 1213 -1219, 23-27 Jun 1996.<br />

11. Krishnan, S.; Fernando, X.; Sun, H.,” Non-stationary interference cancellation in infrared wireless receivers”, In<br />

press, Proc. IEEE Canadian conference on Electrical and Computer Engineering, Montreal, Quebec, May 2003.<br />

146<br />

Proc. of SPIE Vol. 5074 207


147<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.


148<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.


149<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.


150<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.


151<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.


152<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:20 from IEEE Xplore. Restrictions apply.


153<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:20 from IEEE Xplore. Restrictions apply.


154<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.


155<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.


156<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.


157<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.


158<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:43 from IEEE Xplore. Restrictions apply.


159<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.


160<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.


161<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.


162<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.


163<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:38 from IEEE Xplore. Restrictions apply.


Fixed Block-based Lossless Compression of Digit a1 Mammograms<br />

Marwan Y. Al-Saiegh and Sridhar Krishnan<br />

Dept. of Electrical and Computer Engineering,<br />

<strong>Ryerson</strong> Polytechnic <strong>University</strong>, Toronto, ON M5B 2K3, CANADA.<br />

Email : (malsaie@ee.ryerson.ca) (krishnan0ee.ryerson.ca)<br />

Abstract: Breast cancer is a leading cause of death<br />

among women in Canada. Computer-aided diagnosis of<br />

mammograms (X-ray films of breast tissue) is a non-<br />

invasive and an inexpensive way of diagnosing breast can-<br />

cer. The objective of this project is to investigate im-<br />

age compression schemes for faithful transmission and re-<br />

production of digital mammography data over a commu-<br />

nication link. A fixed block-based (FBB) near lossless<br />

compression scheme for mammograms has been devel-<br />

oped which runs in conjunction with traditional compres-<br />

sion schemes such as Huffman Coding and LZW (Lempel-<br />

Ziv Welch) coding. The algorithm codes blocks of pixels<br />

within the image that contain the same intensity value (the<br />

odds of having blocks of the same pixel values in a mam-<br />

mography image are very high), thus reducing the size of<br />

the image substantially while encoding the image at the<br />

same time. The proposed compression scheme was ap-<br />

plied on 44 mammograms (22 benign and 22 malignant),<br />

and the compression scheme provided a compression ra-<br />

tio of 1.7:l. When Huffman coding and LZW coding were<br />

used in conjunction with the FBB compression scheme, the<br />

compression ratio was 3.81:l for Huffman, and 5:l for LZW<br />

coding. The proposed FBB lossless compression technique<br />

seems to he promising for teleradiology applications.<br />

1 Introduction<br />

Breast Cancer is one of the leading cause of death in the<br />

world for women. In the US. alone in 2000, more than<br />

40,000 women died of breast cancer. Therefore, early di-<br />

agnosis is extremely important to reduce the mortality<br />

rate [l]. American cancer society guidelines for women<br />

aged 40-50 advocate screening every 1-2 years, with fre-<br />

quency based on the patient risk factor. The above pro-<br />

cedure would result in some 20 million mammograms per<br />

year. Archiving and retaining these data for at least three<br />

years will expensive and difficult, requiring sophisticated<br />

data compression techniques [2]. Also screening of mam-<br />

mograms in rural clinics is a growing concern, especially<br />

due to the scarcity of radiologists a subject has to wait<br />

for weeks to get her diagnosis result. The delay in pro-<br />

ducing the results are mainly due to infrequent visits of<br />

radiologists to rural clinics, and non-availability of an effi-<br />

- 0937 -<br />

164<br />

I<br />

1 I Digiitized Mammogram ,+! FBB Compression ;-‘Huffman Coding ’i j<br />

~ i<br />

~ 1<br />

~<br />

Didtzied Mammogram ;- hLempel-Ziv Coding, I<br />

Figure 1: Block diagram of the FBB technique using Huff-<br />

man and Lempel-Ziv Welch coding<br />

cient communication link through which a mammographic<br />

image could be faithfully transmitted to a city clinic. Tel-<br />

eradiology of digital mammograms could significantly al-<br />

leviate this problem, and may facilitate an early diagnosis<br />

and reduce the incidence of this killer disease. The above<br />

facts warrants an efficient data compression scheme for<br />

mammograms.<br />

Physicians or radiologists are reluctant to consider a<br />

technique that would discard even a small amount of in-<br />

formation from a mammographic image. By exploiting<br />

the redundancy or correlation information of pixels in an<br />

image a data compression technique can be designed to<br />

efficiently compress an image. Current compression tech-<br />

niques are based upon transform coding techniques such as<br />

discrete wavelet transform and discrete cosine transform<br />

[3]. Although transform coding techniques have claimed<br />

a compression ratio of lO:l, they are lossy compression<br />

schemes and need extensive receiver operating character-<br />

istic (ROC) studies of compressed images. The proposed<br />

near lossless compression scheme is shown in block dia-<br />

gram form in Fig. l. The proposed technique is “min-<br />

imally,’ lossy and does not require any ROC studies to<br />

evaluate compressed images. The paper is organized as<br />

follows: section 2 covers fixed block-based (FBB) com-<br />

pression scheme, Huffman coding and LZW coding are<br />

briefly covered in section 3, section 4 covers results and<br />

discussion, and finally the paper is concluded in section 5.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.


Figure 2: Block diagram of the FBB compression<br />

Go B Ck U> I I<br />

2 Fixed block-based (FBB) com-<br />

pression scheme<br />

Fixed block-based (FBB) compression scheme takes ad-<br />

vantage of the pixel correlation while scanning the image<br />

from left to right. It is known that the adjacent pixels in a<br />

mammographic image are highly correlated. The adjacent<br />

pixels can be combined to reduce the redundancy and that<br />

is what the proposed FBB algorithm is based upon. The<br />

FBB compression scheme will read the pixels one at a time<br />

and store them in a two dimensional array (i.e 448 X 448).<br />

The histogram of the mammogram is used to identify pix-<br />

els that do not appear in the image. This procedure is<br />

essential, because it introduces two redundant pixels that<br />

are used as keys through out the algorithm to avoid over-<br />

lapping, and represent blocks of zeros in the output file.<br />

2.1 Algorithm of the compression scheme<br />

The proposed FBB compression scheme is shown in block<br />

diagram form in Fig. 2. The steps involved are<br />

1. If the difference between the current pixel (i.e x[i]lj]),<br />

and the surrounding pixels x[i][j+l], x[i+l]L+l], and<br />

x[i+l][j] is O,l, or -1, then go to step 3. Otherwise go<br />

to step 2.<br />

2. Write the current pixel (i.e x[i]Ij]) to the output file,<br />

and move the sliding window to the next column of<br />

the two dimensional array and go back to step 1.<br />

3. If current pixel was not zero, then write (-l)*current<br />

pixel to the output file. If the current pixel is 0 then<br />

write the second 'key' pixel to the output file and go<br />

to step 4.<br />

4. Replace the block of pixel in the two dimensional ar-<br />

ray by the first 'key' pixel to avoid overlapping when<br />

the algorithm is implemented and go to step 5.<br />

5. Enforce the sliding window to skip one column, and<br />

go back to step1 (i.e instead of sliding the window<br />

from column two, start from column three).<br />

It is important to realize that the sliding window algorithm<br />

approach can be improved by making the window<br />

size bigger to absorb more pixels when needed, but the<br />

Figure 3: 'lock diagram of the FBB decompression tradeoff is to choose more 'key' pixels for every window size<br />

- 0938 -<br />

165<br />

to distinguish between the different window sizes. Since 0<br />

cannot be positive or negative, thus a 'key' pixel is needed<br />

for the 0 pixel. Thus, whenever the sliding window algo-<br />

rithm finds a block of 0 pixels, then the second 'key' pixel<br />

is written to the output file.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.


2.2 Algorithm of<br />

scheme<br />

the decompression<br />

The FBB decompression algorithm is shown in block dia-<br />

gram form in Fig. 3. While performing decompression the<br />

compressed file is read from the standard input and stored<br />

in a one dimensional array. A temporary two dimensional<br />

array is used to reconstruct the image back. The tem-<br />

porary array is initialized with the first ’key’ value. The<br />

alogrithim is as follows:<br />

3<br />

1. If the current pixel in the one dimensional array (i.e<br />

x[i]) is negative, then write the current pixel value as<br />

a block (i.e 4x4 matrix) in a positive form (i.e x[i]*(-<br />

1)) to the temporary two dimensional array and go to<br />

step 4.<br />

2.<br />

3.<br />

4.<br />

5.<br />

If the current pixel value is the same as the second<br />

’key’ value then write a block of 0’s (i.e 4x4 matrix)<br />

to the temporary two dimensional array and go to<br />

step 4.<br />

If the current pixel in the one dimensional array is<br />

positive, then write the current pixel value to the tem-<br />

porary array and go to step 5.<br />

skip the following column in the temporary two di-<br />

mensional array to account for the block of pixels and<br />

go to step 5.<br />

increment the index value in the one dimensional by<br />

one and go back to step 1.<br />

Once the decompression alogrithim is completed the<br />

temporary two dimensional array is written to an out-<br />

put file (i.e a decompressed version of the original file).<br />

Huffman coding and Lempel-<br />

Ziv-Welch coding<br />

After performing a FBB compression of the mammpogram<br />

image, it is further compressed by using popular lossless<br />

compression schemes such as Huffman coding and LZW<br />

coding.<br />

3.1 Huffman coding<br />

Huffman codes belong to a family of codes with a variable<br />

codeword length, which means that individual symbols<br />

which makes a message are represented (encoded) with<br />

bit sequences that have distinct length [4]. This helps to<br />

decrease the amount of redundancy in message data. De-<br />

creasing the redundancy in data by Huffman codes is based<br />

on the fact that distinct symbols have distinct probabilities<br />

of incidence. This helps to create code words, which really<br />

contribute to redundancy. Symbols with higher probabili-<br />

ties of incidence are coded with shorter code words, while<br />

- 0939 -<br />

166<br />

symbols with higher probabilities are coded with longer<br />

code words.<br />

3.2 Lempel-Ziv-Welch coding<br />

The LZW algorithm relies on re-occurrence of byte se-<br />

quences (strings) in its input [5]. It maintains a table<br />

mapping input strings to their associated output codes.<br />

The goal of LZW compression is to replace repeating in-<br />

put strings with n-bit codes. This is done by generating a<br />

string table on the fly, which is a mapping between pixel<br />

values and compression codes. This string table is built<br />

by the encoder as it processes the data, and due to the<br />

encoding method the decoder can reconstruct the string<br />

table as it processes the compressed data. This differs<br />

from other compression algorithms, such as Huffman cod-<br />

ing, where the lookup table needs to be included with the<br />

compressed data.<br />

LZW works based on the fact that many groupings of<br />

pixels are common in images: it goes through the image<br />

data and tries to encode as large a grouping of pixels as<br />

possible with an encoding from the string table, placing<br />

unrecognized groupings into the string table so they can<br />

be compressed on later occurrences. For an image with n-<br />

bit pixel values, it uses compression codes that are n + 1<br />

bits or larger. While a smaller compression code helps gain<br />

larger amounts of compression, the size of the compression<br />

code limits the size of the string table.<br />

4 Results and discussion<br />

The proposed FBB compression scheme was tested on<br />

MiniMammographic database of 44 images from Mammo-<br />

graphic Image <strong>Analysis</strong> Society (MIAS). The MIAS is an<br />

organisation of UK research groups interested in the un-<br />

derstanding of mammograms. Films taken from the UK<br />

National Breast Screening Programme have been digitized<br />

to 50 micron pixel edge with a Joyce-Loebl scanning micro-<br />

densitometer, a device linear in the optical density range<br />

0 to 3.2 and representing each pixel with an &bit word.<br />

The database also includes radiologist’s ‘truth’-markings<br />

on the locations of any abnormalities that may be present.<br />

The total number of benign and malignant images in<br />

the database are 22 and 22 respectively. A benign mam-<br />

mograhic image is shown in Fig. 4, and its compressed<br />

version is shown in Fig. 5. Perceptually there is no differ-<br />

ence in image quality between the original image and it’s<br />

compressed version. Fig. 6 illustrates a malignant image.<br />

The compressed image is shown in Fig. 7. Also in this<br />

case there is no difference between the original and the<br />

compressed images.<br />

Table 1 shows the advantage of using FBB in conjuc-<br />

tion with Huffman coding, and LZW coding. The mean<br />

compression ratio of benign and malignant images using<br />

FBB scheme alone was approximately 1.7:l. The mean<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.


Figure 4: Benign mammogram before FBB compression<br />

Figure 5: Same benign ima,ge in Fig. 4 after FBB compression<br />

- 0940 -<br />

167<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.


Figure 6: Malignant mammogram before FBB compression<br />

Figure 7: Same malignant image in Fig. 6 after FBB compression<br />

- 0941 -<br />

168<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.


.<br />

Table 1: Compression ratios of benign and malignant im-<br />

ages. Legend CR- compression ratio<br />

compression ratio of benign and malignant images using<br />

FBB with Huffman coding was approximately 3.81:l. The<br />

mean compression ratio of benign and malignant images<br />

using FBB with LZW coding was approximately 5:l.<br />

Figure 8: Bar graph for different compression schemes ap-<br />

plied to benign images<br />

The two bar graphs illustrate the usefulness of combin-<br />

ing FBB with other standard compression schemes such<br />

as Huffman coding and LZW coding. The first bar graph<br />

in Fig. 8 is for benign images, and the second bar graph in<br />

Fig. 9 is for malignant images. The y axis in the bar graph<br />

denotes the mean file size of malignant or benign images<br />

in bytes, while the x axis denotes the scheme applied on<br />

these images (e.g. Huffman with FBB).<br />

5 Conclusion<br />

In this paper a novel method of compressing mammographic<br />

images is proposed. The scheme is based upon<br />

FBB scanning of pixels of an image. The FBB codes blocks<br />

of pixels within the image that contain the same value (the<br />

odds of having blocks of the same pixel values in a mammography<br />

image are very high), thus reducing the size of<br />

the image substantially while encoding the image at the<br />

same time. The FBB compression scheme alone provided<br />

a compression ratio of 1.7:l. When Huffman coding and<br />

- 0942<br />

169<br />

Figure 9: Bar graph for different compression schemes ap-<br />

plied to malignant images<br />

LZW coding were used in conjunction with the FBB com-<br />

pression scheme, the compression ratio was 3.8:1 for Huff-<br />

man and 5:l for LZW coding. The proposed FBB lossless<br />

compression technique seems to be promising for teleradi-<br />

ology applications. Future work involves investigation of<br />

the compression scheme for transmission of mammography<br />

data over internet protocol.<br />

Acknowledgment<br />

We would like to acknowledge the Mammographic Image<br />

<strong>Analysis</strong> Society (MIAS) for granting us permission to use<br />

their database. We would also like to acknowledge <strong>Ryerson</strong><br />

<strong>University</strong> and NSERC for providing financial support.<br />

References<br />

S. A. Feig. Decreased breast cancer mortality through<br />

mammographic screening: results of clinical trials. Radiology,<br />

167:659-665, 1988.<br />

H. A. Fkazer. Computerized diagnosis comes to mam-<br />

mography. Diagnostic Imaging, pages 91-95, June<br />

1991.<br />

Z. Yang, M. Kallergi, R. A. DeVore, B. Lucier,<br />

W. Qian, R. A. Clark, and L. P. Clarke. Effect of<br />

wavelet bases on compressing digital mammograms.<br />

IEEE Engineering in Medicine and Biology Magazine,<br />

14(5):570-577, Sep/Oct 1995 1995.<br />

D. A. Huffman. A method for the construction of min-<br />

imum redundancy codes. Proc. IRE, 40:1098-1101,<br />

1952.<br />

J. Ziv and A. Lempel. Compression of individual sequences<br />

via variable-rate coding. IEEE Trans. INformation<br />

Theory, IT-24:530-536, 1978.<br />

-<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.


Instantaneous mean frequency estimation using<br />

adaptive t ime-frequency distributions<br />

Sridhar Krishnan<br />

Dept. of Electrical and Computer Engineering,<br />

<strong>Ryerson</strong> Polytechnic <strong>University</strong>, Toronto, ON M5B 2K3, CANADA.<br />

Email : krishnan@ee. ryerson .ca<br />

Abstract: <strong>Analysis</strong> of non-stationary signals is a chal-<br />

lenging task. True non-stationary signal analysis in-<br />

volves nionitoring the frequency changes of the signal over<br />

time (ie, monitoring the instantaneous frequency (IF)<br />

changes). The IF of a signal is traditionally obtained<br />

by taking the first derivative of the phase of the signal<br />

with respect to time. This poses some difficulties because<br />

the derivative of the phase of the signal may take negative<br />

values thus misleading the intrepretation of instantaneous<br />

frequency. In this paper, a novel approach to extract<br />

the IF from its adaptive time-frequency distribution is<br />

proposed. The adaptive time-frequency distribution of a<br />

signal is obtained by decomposing the signal into compo-<br />

nents with good time-frequency localization and by com-<br />

bining the Wigner distribution of the components. The<br />

adaptive time-frequency distribution thus obtained is free<br />

of cross-terms and is a positive time-frequency distribu-<br />

tion with good time and frequency localization. The IF<br />

may bc obtained as the first central moment of the adap-<br />

tive time-frequency distribution. The proposed method<br />

of IF estimation is very powerful for applications with low<br />

SNR. The proposed technique was tested with synthetic<br />

signals of known IF dynamics, and the method success-<br />

fully extracted the IF of the signals.<br />

Keywords: instantaneous frequency, non-stationary sig-<br />

nals, positive time-frequency distributions, matching pur-<br />

suit, average frequency.<br />

1 Introduction<br />

The instantaneous frequency (IF) of a signal is a param-<br />

eter of practical importance in situations such as seismic,<br />

radar, sonar, communications, and biomedical applica-<br />

tions. In all these applications the IF describes some<br />

physical phenomenon associated with them. Like most<br />

other signal processing concepts, the IF of the signal was<br />

originally used in describing FM modulation in communi-<br />

cations. In a typical radar application, the IF aids in the<br />

detection, tracking, and imaging of targets whose radial<br />

velocities change with time. When the radial velocity is<br />

- 0141 -<br />

170<br />

not constant, the radar’s Doppler induced frequency has<br />

a nonstationary spectrum, which can be tracked by IF es-<br />

timation techniques. Also, in biomedical signal analysis,<br />

IF is used in studying the electroencephalogram (EEG)<br />

signals to monitor key neural activities of the brain.<br />

The importance of the IF concept arises from the fact<br />

that in most applications a signal processing engineer<br />

is confronted with the task of processing signals whose<br />

spectral characteristics (in particular the frequency of the<br />

spectral peaks) are varying with time. These signals are<br />

often referred to as non-stationary signal. A chirp signal<br />

is a simple example of a non-stationary signal, in which<br />

the frequency of the sinusoidal changes linearly with time.<br />

It is theoretically difficult to describe the IF of a signal<br />

since most signals are multicomponent, and it is diffcult<br />

to define a unique parameter for each time instant. Also,<br />

since frequency is usually defined as a number of oscilla-<br />

tions or vibrations undergone in a unit time period, the<br />

association of the words “instantaneous” and “frequency”<br />

is still controversial.<br />

Several authors have tried to define the IF of a signal.<br />

In this paper the IF is defined by using adaptive time-<br />

frequency distribution (TFD). The paper is organized as<br />

follows: a brief review on the topic of IF is discussed<br />

in Section 2. The proposed technique of adaptive TFD is<br />

described in Section 3. Results with synthetic signals and<br />

real world signals are discussed in Section 4. The paper<br />

is concluded in Section 5.<br />

2 Review<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />

The classical definition of the IF a signal [l] is defined as<br />

Ville formulated a joint TFD of the signal energy called<br />

the Wigner-Ville distribution (WVD), and defined the IF<br />

as the first central moment (average frequency) of the<br />

WVD,


Most Cohen’s class TFD derived from WVD yield the<br />

IF by correct first moment calculation but this is often<br />

computationally expensive and is adversely affected by<br />

noise.<br />

Most TFDs such as WVD provide high signal energy<br />

concentration in time and frequency, therefore it is tempt-<br />

ing to try to use it to measure the spread of frequencies<br />

with time. Unfortunately, the spread of the IF of the<br />

WVD is only positive for certain types of signals. Even<br />

when the spread is positive some negative distribution<br />

values may appear in the calculation, and thus its useful-<br />

ness is questionable. From the literature it appears that<br />

still there are many unresolved issues regarding the IF of<br />

the signal (A detailed review on the fundamentals of IF<br />

is available in [2]). It has been shown that the usual way<br />

of interpreting the IF as the average frequency at each<br />

time brings out unexpected results with Cohen’s class of<br />

bilinear TFDs. If the IF is interpreted as the average fre-<br />

quency then the IF need not be a frequency that appears<br />

in the spectrum of the signal. If the IF is interpreted as<br />

the derivative of the phase, then the IF can extend be-<br />

yond the spectral range of the signal. It has been recently<br />

reported that the estimation of IF of a signal using a posi-<br />

tive TFD brings out meaningful interpretation about the<br />

IF of the signal [3]. The motivation behind this paper<br />

is in adaptively constructing a positive TFD suitable for<br />

estimating the IF of a signal.<br />

3 Adaptive Time-Frequency Dis-<br />

tribut ions<br />

The purpose of this paper is to explore the best available<br />

TFD for estimating the IF of a signal. For simple appli-<br />

cations, Cohen’s class TFDs or model-based TFDs may<br />

be applied. It is widely accepted that, in case of com-<br />

plex signals with multiple frequency components there is<br />

no definite TFD that will satisfy all the criteria and still<br />

give optimal performance. The purpose of this section is<br />

to construct TFDs according to the application in hand,<br />

i.e., to tailor the TFD according to the properties of the<br />

signal being analyzed. It is appropriate to call such TFDs<br />

as adaptive TFDs. In the present work, the concept of<br />

adaptive TFDs is based on signal decomposition.<br />

In practice, no TFD may satisfy all the requirements.<br />

In the method proposed in this section, by using con-<br />

straints, the TFDs are modified to satisfy certain specified<br />

criteria. It is assumed that the given signal is somehow<br />

decomposed into components of a specified mathematical<br />

representation. By knowing the components of a signal,<br />

the interaction between them can be established and used<br />

to remove or prevent cross-terms. This avoids the main<br />

drawback associated with Cohen’s class TFDs; numerous<br />

efforts have been directed to develop kernels to overcome<br />

- 0142 -<br />

171<br />

the cross-term problem [4, 5, 61.<br />

The key to successful design of adaptive TFDs lies in<br />

the selection of the decomposition algorithm. The com-<br />

ponents obtained from a decomposition algorithm depend<br />

largely on the type of basis functions used. For example,<br />

the basis function of the Fourier transform decomposes<br />

the signal into tonal (sinusoidal) components, and the<br />

basis function of the wavelet transform decomposes the<br />

signal into components with good time and scale prop-<br />

erties. For TF representation, it will be beneficial if the<br />

signal is decomposed using basis functions with good TF<br />

properties. The components obtained by decomposing<br />

a signal using basis functions with good TF properties<br />

may be termed as TF atoms. An algorithm that can<br />

decompose a signal into TF atoms is the MP algorithm<br />

described in next section.<br />

3.1 Matching Pursuit<br />

The MP algorithm decomposes the given signal using ba-<br />

sis functions that have excellent TF properties. The MP<br />

algorithm selects the decomposition vectors depending<br />

upon the signal properties. The vectors are selected from<br />

a family of waveforms called a dictionary. The signal z(t)<br />

is projected on to a dictionary of TF atoms obtained by<br />

scaling, translating, and modulating a window function<br />

dt):<br />

where<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />

00<br />

4t) = an sm (t), (3)<br />

n=O<br />

and a, are the expansion coefficients. The scale factor sn<br />

is used to control the width of the window function, and<br />

the parameter p, controls temporal placement. & is a<br />

normalizing factor that restricts the norm of gm (t) to 1.<br />

The parameters fn and 4n are the frequency and phase of<br />

the exponential function, respectively. yn represents the<br />

set of parameters (sn,p,, fn, &).<br />

In the present work, the window is a Gaussian function,<br />

i.e., g(t) = 2i exp(-.irt2); the TF atoms are then called<br />

Gabor atoms, and they provide the optimal TF resolution<br />

in the TF plane.<br />

In practice, the algorithm works as follows. The signal<br />

is iteratively projected on to a Gabor function dictionary.<br />

The first projection decomposes the signal into two parts:<br />

4t) = (~:,gYo)gYo(t) + R W), (5)<br />

where (2, gr,) denotes the inner product (projection) of<br />

s(t) with the first TF atom gTO(t). The term R1z(t) is the


esidue after approximating x(t) in the direction of gYo (t).<br />

This process is continued by projecting the residue on to<br />

the subsequent functions in the dictionary, and after AI iterations<br />

M-1<br />

x(t) = (R%gYn) 9Yn(t) + RMx(t), (6)<br />

n=O<br />

with Rox(t) = z(t), There are two ways of stopping the<br />

iterative process: one is to use a pre-specified limiting<br />

number M of the TF atoms, and the other is to check the<br />

energy of the residue RMx(t). A very high value of M<br />

and a zero value for the residual energy will decompose<br />

the signal completely at the expense of increased compu-<br />

tational complexity.<br />

3.2 Matching Pursuit TFD<br />

A signal decomposition-based TFD may be obtained by<br />

taking the WVD of the TF atoms in Eq. ??, and is given<br />

as [7]:<br />

w,w) = c2' I ( R ~ ~ wYn(t,w) x , ~ ~ ~ ) ~ ~<br />

where Wgrn (t, w) is the WVD of the Gaussian window<br />

function. The double sum corresponds to the cross-terms<br />

of the WVD indicated by W[g7n,g7m~(t,~), and should be<br />

rejected in order to obtain a cross-term-free energy distribution<br />

of z(t) in the TF plane. Thus only the first term<br />

is retained, and the resulting TFD is given by<br />

w'<br />

A4-1<br />

(t, = c I wnx, 91" > I 2 W9Yn (t, U). (8)<br />

n=O<br />

This cross-term-free TFD, also known as matching pur-<br />

suit TFD (MPTFD), has very good readability and is ap-<br />

propriate for analysis of nonstationary, multicomponent<br />

signals. The extraction of coherent structures makes MP<br />

an attractive tool for TF representation of signals with<br />

unknown SNR.<br />

3.3 Minimum Cross-Entropy Optimization<br />

of the MPTFD<br />

One of the drawbacks of the MPTFD is that it does<br />

not satisfy the marginal properties. If a TFD is pos-<br />

itive and satisfies the marginals, it may be considered<br />

to be a proper TFD for extraction of time-varying fre-<br />

quency parameters such as IF. This is because positiv-<br />

ity coupled with correct marginals ensures that the TFD<br />

- 0143 -<br />

172<br />

is a true probability density function, and the parame-<br />

ters extracted are meaningful [8]. The MPTFD may be<br />

modified to satisfy the marginal requirements, and still<br />

preserve its other important cha.racteristics. One way to<br />

optimize the MPTFD is by using the cross-entropy min-<br />

imization method [9, 101. Cross-entropy minimization is<br />

a general method of inference about an unknown prob-<br />

ability density when there exists a prior estimate of the<br />

density and new information in the form of constraints on<br />

expected values is available. If the optimized MPTFD or<br />

OMP TFD (an unknown probability density function) is<br />

denoted by M(t,w), then it should satisfy the marginals<br />

/M(t,w) dw = Ix(t)12 = m(t), and (9)<br />

Eqs. 9 and 10 may be treated as constraint equations<br />

(new information), for optimization. Now, M(t, U) may be<br />

obtained from W (t,w) (a prior estimate of the density)<br />

by minimizing the cross-entropy between them, given by<br />

As we are interested only in the marginals, OMP TFD<br />

may be written as [lo]:<br />

AI(t,W) = W'(t,w) exp{-(ao(t) + Po(w))), (12)<br />

where the CY'S and P's are Lagrange multipliers which may<br />

be determined using the constraint equations. An iter-<br />

ative algorithm to obtain the Lagrange multipliers and<br />

solve for M(t, w) is presented next.<br />

At the first iteration, we define<br />

M'(t,w) = ~ '(t,w) exp(-ao(t)). (13)<br />

As the marginals are to be satisfied, the time marginal<br />

constraint has to be imposed in order to solve for ao(t).<br />

By imposing the time marginal constraint given by Eq. 9<br />

on Eq. 13, we obtain<br />

where m(t) is the desired time marginal and m'(t) is the<br />

time marginal estimated from W' (t, U). Now, Eq. 13 can<br />

be rewritten as<br />

Ml(t,w) = W'(t,w) - m (t<br />

m' (t) '<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />

At this point, M1 (t, U) is a modified MPTFD with the de-<br />

sired time marginal; however, it need not necessarily have


the desired frequency marginal m(w). In order to obtain<br />

the desired frequency marginal, the following equation<br />

has to be solved:<br />

W(t,w) = ~ ‘(t,w). exp(-,&(w)).<br />

(16)<br />

Note that the TFD obtained after the first iteration<br />

M1(t,w) is used as the incoming estimate in Eq. 16.<br />

By imposing the frequency marginal constraint given by<br />

Eq. 10 on Eq. 16, we obtain<br />

(17)<br />

4<br />

The proposed method of extracting the IF of a signal was<br />

applied to a synthetic signal with known IF, and a real<br />

world example of knee joint sound signal.<br />

4.1<br />

Results<br />

Synthetic <strong>Signal</strong><br />

The synthetic signal “synl” is composed of nonoverlapping<br />

chirp, transient, and sinusoidal FM components, and<br />

is shown in Fig. 1. The frequency behavior of the signal<br />

are shown in Fig. 2. “synl” is an example of a monocomponent<br />

signal with linear and nonlinear frequency dy-<br />

where m(w) is the desired frequency margin?, and m‘ (U) Ilamics. To simulate noisy signa’ conditions, synl was<br />

is the frequency marginal estimated from W (t, w). Now, corrupted by adding random noise to an SNR Of lo dB.<br />

Eq. 17 can be rewritten as<br />

I I ’ I<br />

By incorporating the desired marginal constraint, the<br />

M’(t,w) TFD may be altered and need not necessarily<br />

give the desired time marginal. Successive iteration could<br />

overcome this problem and modify the desired TFD to<br />

get closer to M(t, w). This follows from the fact that the<br />

cross-entropy between the desired TFD and the estimated<br />

TFD decreases with the number of iterations [lo].<br />

As the iterative procedure is started with a positive dis-<br />

tribution W’(t, U), the TFD at the nth iteration Mn(tlu)<br />

is guaranteed to be a positive distribution. Such a class of<br />

distributions belongs to the Cohen-Posch class of positive<br />

distributions [SI. The OMP TFDs may also be taken to<br />

be adaptive TFDs because they are constructed on the<br />

basis of the properties of the signal being analyzed.<br />

A method for constructing a positive distribution using<br />

the spectrogram as a priori knowledge was developed by<br />

Loughlin et al. [ll]. The major drawback of using the<br />

spectrogram as a priori knowledge is the loss of TF resolution;<br />

this effect may be minimized by taking multiple<br />

spectrograms with different sizes of analysis windows as<br />

initial estimates of the desired distribution. The method<br />

proposed in this section start with the MPTFD, overcomes<br />

the problem of using multiple spectrograms as initial<br />

estimates, and produces a high-resolution TFD tailored<br />

to the signal properties. The OMP TFD mag be<br />

used to derive higher moments by estimating the higherorder<br />

Lagrange multipliers. Such measures are not necessary<br />

in the present work, and are beyond the scope of<br />

this paper.<br />

The IF of a signal can be computed as the first moment<br />

of TFD(t, U) along each time slice, given by<br />

-15L I I 4<br />

100 200 300 400 500 600 700 800 900 loo0<br />

!,me Samples<br />

Figure 1: Monocomponent, nonstationary, synthetic sig-<br />

nal “synl” consisting of a chirp, an impulse, and a sinu-<br />

soidal FM component (SNR = 10 dB).<br />

The MP method has given a clear picture of the IF<br />

representation: the three simulated components are per-<br />

fectly localized in the TFDs shown in Fig. 3. This is<br />

because the OMP TFD provides adaptive representation<br />

of signal components, and due to the possibility that each<br />

high-energy component is analyzed by the TF represen-<br />

tation independent of its bandwidth and duration. The<br />

good localization of transients produced by MP is because<br />

of the good TF localization properties of the basis func-<br />

tions, whereas with other techniques such as Fourier and<br />

wavelets, the transient information gets diluted across the<br />

whole basis and the collection of basis functions is not as<br />

large as compared to that in the MP dictionary.<br />

E:, wTFD(t, w)<br />

4.2 Real World Example<br />

IF(t) =<br />

(19)<br />

TFD(t, w) .<br />

The -proposed - technique was applied to real-world sig-<br />

IF characterizes the frequency dynamics of the signal. nals viz the knee sound signals. Due to the differences in<br />

- 0144 -<br />

173<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />

i


0.5<br />

100 200 300 4W 500 600 700 800 900 1000<br />

lime sBmDIes<br />

Figure 2: Ideal TFD depicting the frequency laws of the<br />

signal "synl" in Fig. 1.<br />

100 200 300 400 f<br />

limc<br />

c<br />

1<br />

f<br />

@ t<br />

600 700 800 900 1000<br />

ples<br />

Figure 3: OMP TFD of the signal synl in Fig.1.<br />

- 0145 -<br />

174<br />

the cartilage surface between normal and abnormal knees,<br />

sound signals with different IFS are produced [12]. Fig. 4<br />

shows the knee sound signal of a normal subject. The IF<br />

of the same signal is shown in Fig. 5. Automatic classifi-<br />

cation of the sound signals using IF as a feature for pat-<br />

tern classification has produced good results in screening<br />

abnormal knees from normal knees.<br />

80<br />

-40<br />

~<br />

I III<br />

I<br />

I<br />

I 1 I I<br />

300<br />

250<br />

1200<br />

E<br />

a<br />

pi50<br />

p 100<br />

50<br />

0<br />

I I l l 1<br />

1 1 1 1 I<br />

1 1 1 1 , I<br />

I<br />

1000 2000 3000 4000 5000 6000<br />

Itme samples<br />

L<br />

)O 80<br />

Figure 5: IF estimated from the OMP TFD of the normal<br />

knee sound signal in Fig. 4.<br />

5 Conclusion<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />

A novel method of extracting the IF of a signal is pro-<br />

posed in this paper. The extraction of IF is based on


constructing an adaptive TFD and extracting the IF as<br />

first central moment for each time slice. The method was<br />

tested on synthetic signals with known IF, and the results<br />

were found to be satisfactory even for low SNR cases.<br />

[ll] P. Loughlin, J. Pitton, and L. Atlas. Construction of<br />

positive time-frequency distributions. IEEE Trans.<br />

<strong>Signal</strong> Processing, 42 : 2697-2 705, 1994.<br />

Acknowledgment<br />

[12] S. Krishnan, R.M. Rangayyan, G.D. Bell, and<br />

C.B. Frank. Adaptive time-frequency analysis of<br />

knee joint vibroarthrographic signals for non-invasive<br />

screening of articular cartilage pathology. IEEE<br />

We would like to acknowledge Micronet and NSERC for Transactions on Biomedical Engineering, page in<br />

providing financial support. press, 2000.<br />

References<br />

[l] J. Carson and T. Fry. Variable frequency electric<br />

circuit theory with application to the theory of fre-<br />

quency modulaion. Bell System Technical Journal,<br />

16:513-540, 1937.<br />

[2] B. Boashash. Estimating and interpreting the in-<br />

stantaneous frequency of a signal - Part 1: Funda-<br />

mentals. Proc. IEEE, 80(4):519-538, April 1992.<br />

[3] P. J. Loughlin. Comments on the interpretation of in-<br />

stantaneous frequency. IEEE <strong>Signal</strong> Processing Let-<br />

ters, 4(5):123-125, May 1997.<br />

[4] H. I. Choi and W. J. Williams. Improved time-<br />

frequency representation of multicomponent signals<br />

using exponential kernels. IEEE Trans. Acoustics,<br />

Speech, and <strong>Signal</strong> Processing, 37(6):862-871, 1989.<br />

[5] Z. Guo, L. G. Durand, and H. C. Lee. The<br />

time-frequency distributions of nonstationary signals<br />

based on a Bessel kernel. IEEE Trans. <strong>Signal</strong> Pro-<br />

cessing, 42:1700-1707, 1994.<br />

[6] R. G. Baraniuk and D. L. Jones. <strong>Signal</strong>-dependent<br />

time-frequency representation: optimal kernel de-<br />

sign. IEEE Trans. <strong>Signal</strong> Processing, 41:1589-1602,<br />

1993.<br />

[7] S. G. Mallat and Z. Zhang. Matching pursuit with<br />

time-frequency dictionaries. IEEE Trans. <strong>Signal</strong><br />

Processing, 41( 12):3397-3415, 1993.<br />

[8] L. Cohen and T. Posch. Positive time-frequency dis-<br />

tribution functions. IEEE Trans. Acoustics, Speech,<br />

and <strong>Signal</strong> Processing, 33:31-38, 1985.<br />

[9] J. Shore and R. Johnson. Axiomatic derivation of<br />

the principle of maximum entropy and the principle<br />

of minimum cross-entropy. IEEE Trans. Information<br />

Theory, 26(1):26-37, 1980.<br />

[lo] J. Shore and R. Johnson. Properties of cross-entropy<br />

minimization. IEEE Trans. Information Theory,<br />

27(4):472-482, 1981.<br />

- 0146 -<br />

175<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.


Proceedings of the 22"d Annual EMBS International Conference, July 23-28,2000, Chicago IL.<br />

Sonification of Knee-joint Vibration <strong>Signal</strong>s<br />

Sridhar Krishnanl, Rangaraj M. Ranga~yan~?~,<br />

G. Douglas Bel12,314, and Cyril B. F'rank2~3~4<br />

'Dept. of Electrical and Computer Engineering, <strong>Ryerson</strong> Polytechnic <strong>University</strong><br />

Toronto, Ontario, M5B 2K3, CANADA. (Email: krishnan@ee.ryerson.ca)<br />

2Dept. of Electrical and Computer Engineering, 3Dept. of Surgery, 4Sport Medicine Centre<br />

<strong>University</strong> of Calgary, Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca)<br />

Abstract: Sounds generated due to rubbing of<br />

knee-joint surfaces may be a potential tool for non-<br />

invasively assessing articular cartilage degenera-<br />

tion. In this paper, an attempt is made to per-<br />

form computer-assisted auscultation of knee joints<br />

by auditory display (AD) of the vibration sig-<br />

nals (also known as vibroarthrographic or VAG<br />

signals) emitted during active movement of the<br />

leg. The AD technique is based on a sonifica-<br />

tion algorithm, in which the instantaneous mean<br />

frequency and envelope of the VAG signals were<br />

used in improving the potential diagnostic qual-<br />

ity of VAG signals. Auditory classification experi-<br />

ments were performed by two orthopedic surgeons<br />

with a database of 37 VAG signals that includes<br />

19 normal and 18 abnormal cases. Sensitivities of<br />

31% and 83% were obtained with direct playback<br />

and the sonification method, respectively.<br />

1 Introduction<br />

Auscultation, the method of examining functions and con-<br />

ditions of physiological systems by listening to the sounds<br />

they produce, is one of the ancient modes of diagnosis.<br />

The first use of vibration or acoustic emission as a diag-<br />

nostic aid for bone and joint disease is found in Laennec's<br />

treatise on mediate auscultation, as cited by Mollan et<br />

al. [l]. Laennec was able to diagnose fractures by aus-<br />

cultating crepitus caused by the moving broken ends of<br />

bone. As quoted by Mollan et al. [l], Heuter, in 1885,<br />

used a myodermato-osteophone (a type of stethoscope) to<br />

localize loose bodies in human knee joints. In 1929, Wal-<br />

ters reported on auscultation of 1600 joints and detected<br />

certain sounds before symptoms become apparent [2]; he<br />

suggested that the sounds might be an early sign of arthri-<br />

tis.<br />

After 1933, most of the works reported on kneejoint<br />

sounds have been on objective analysis of the sound or vi-<br />

bration signals (also known as vibroarthrographic or VAG<br />

signals) for non invasive diagnosis of kneejoint pathology<br />

0-7803-6465-1/00/$10.00 02000 IEEE<br />

13, 4, 5, 6, 71. Although auscultation of knee joints using<br />

stethoscopes is occasionally practised by clinicians, there<br />

is no published evidence of their diagnostic value. Also,<br />

no study has been reported on computer-aided ausculta-<br />

tion of knee-joint sounds. This paper proposes methods for<br />

computer-aided auscultation of knee-joint sounds based on<br />

an auditory display (AD) technique.<br />

2 Data Acquisition<br />

Each subject sat on a rigid table in a relaxed position<br />

with the leg being tested freely suspended in air. The<br />

VAG signal was detected at the mid-patella position of<br />

the knee by using vibration sensors (accelerometers) as the<br />

subject swung the leg over an approximate angle range of<br />

135O + Oo + 135O in 4s. Informed consent was obtained<br />

from every subject. The experimental protocol has been<br />

approved by the Conjoint Health <strong>Research</strong> Ethics Board<br />

of the <strong>University</strong> of Calgary.<br />

The VAG signal was prefiltered (10 Hz to 1 kHz) and<br />

amplified before digitizing at a sampling rate of 2 kHz.<br />

The details of data acquisition may be found in Krish-<br />

nan et al. [7]. The database consists of 37 signals (19<br />

normal and 18 abnormal). The abnormal signals were col-<br />

lected from symptomatic patients scheduled to undergo<br />

arthroscopy, and there was no restriction imposed on the<br />

type of pathology.<br />

3 Sonification<br />

AD may be defined as aural representation of a stream<br />

of data. The field of AD is emerging, and has recently<br />

drawn attention in the areas of geophysics, biomedical en-<br />

gineering, speech signal analysis, image analysis, aids for<br />

the handicapped, and computer graphics [8]. AD has to<br />

be performed in such a manner as to take advantage of<br />

the psychoacoustics of the human ear. The AD technique<br />

considered in the present work is a sonification technique.<br />

In sonification, features extracted from the data are used<br />

176<br />

1995<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.


Proceedings of the 22"d Annual EMBS International Conference, July 23-28,2000, Chicago IL.<br />

to control a sound synthesizer. The sound signal gener-<br />

ated does not bear a direct relationship to the data being<br />

analyzed. A simple example of a sonification technique is<br />

mapping of parameters derived from a data stream to AD<br />

parameters such as pitch and loudness.<br />

4 Motivation for AD of VAG<br />

Prior to graphical recording and analysis of VAG sig-<br />

nals, auscultation of knee joints was the only noninva-<br />

sive method available to distinguish normal joints from<br />

degenerative joints. Significant success has been claimed<br />

by several researchers using the auscultation technique 111.<br />

However, classification of knee joints by auscultation is a<br />

highly subjective technique. Further, a significant portion<br />

of VAG signal energy lies below the threshold of auditory<br />

perception of the human ear in terms of frequency and/or<br />

intensity. The situation may be ameliorated by developing<br />

AD methods for computer-aided auscultation of knee-joint<br />

vibrations. The main motivating factors in applying AD<br />

techniques to VAG are:<br />

0 It has been established through objective signal anal-<br />

ysis of VAG that sounds generated by abnormal knees<br />

are distinctive and different from those produced by<br />

normal knees [3, 4, 5, 6, 91. Sounds of diagnostic<br />

value may be made prominent by applying suitable<br />

AD techniques to VAG.<br />

0 AD of VAG obtained using vibration sensors may<br />

facilitate relatively noise-free and localized auditory<br />

analysis when compared to direct auditory analysis<br />

of the acoustic sensor data.<br />

The work described in this paper hypothesizes that<br />

auditory analysis of VAG data may aid an orthopedic sur-<br />

geon in making diagnostic inferences. In the next section,<br />

a sonification technique for AD of VAG data is developed.<br />

This study is the first attempt to listen to knee sounds<br />

detected by vibration sensors.<br />

5 Sonification of VAG <strong>Signal</strong>s<br />

In an effort to facilitate AD of only the important charac-<br />

teristics of VAG signals, a sonification algorithm is pro-<br />

posed. The sonification algorithm involves amplitude<br />

modulation (AM) and frequency modulation (FM). The<br />

instantaneous mean frequency (IMF) FP(t) is an impor-<br />

tant parameter in characterizing multicomponent nonsta-<br />

tionary signals such as VAG [7]. The IMF of a signal could<br />

be extracted from a positive time-frequency distribution<br />

(TFD) of the signal [7]. The FM part of the sonified signal<br />

is obtained by frequency modulating a sinusoidal waveform<br />

with the IMF'of the signal. The auditory characteristics<br />

of the FM part alone will be tonal, which could quickly<br />

U)<br />

30<br />

10<br />

0<br />

4 -10<br />

1-20<br />

0-7803-6465-1/00/$10.00 02000 IEEE 1996<br />

177<br />

-30<br />

I -60<br />

I<br />

Figure 1: An abnormal VAG signal of a patient with chon-<br />

dromalacia patella.<br />

P<br />

E<br />

P<br />

Figure 2: Spectrogram of the VAG signal in Fig. 1.<br />

cause boredom and fatigue. To obviate this problem, an<br />

AM part a(t) is obtained as the absolute d ue of the an-<br />

alytic version of the VAG signal. The AM part provides<br />

an envelope to the signal and contributes to the frequency<br />

deviation (bandwidth) about the IMF.<br />

For the sake of illustration, plots of an abnormal VAG<br />

signal with chondromalacia patella (a type of cartilage<br />

pathology), and the processed versions of the signal are<br />

presented. Fig. 1 shows the original VAG; the spectro-<br />

gram (a joint time-frequency representation) of the signal<br />

is shown in Fig. 2. The related entities of the sonified<br />

versions of the signal are shown in Figs. 3 to 5. The en-<br />

velope and the IMF of the signal are shown in Figs. 3 and<br />

4, respectively. The spectrogram shown in Fig. 5 clearly<br />

illustrates the envelopeIMF behlavior of the sonified signal<br />

with a time-scale factor of two.<br />

The advantages of the IMF-based sonification method<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.


Proceedings of the 22"' Annual EMBS International Conference, July 23-28,2000, Chicago IL.<br />

time Sam*<br />

I I '<br />

Figure 3: Envelope of the VAG signal in Fig. 1.<br />

Figure 4: IMF of the VAG signal in Fig. 1 estimated using<br />

its positive TFD.<br />

0-7803-6465-1/00/$10.00 02000 IEEE 1997<br />

178<br />

time Samples<br />

time samp(es<br />

Figure 5: Spectrogram of the IMF-based sonified version<br />

of the VAG signal in Fig. 1. A time-scale factor of 2 was<br />

used. Note that the figure window has been divided into<br />

two parts to show the time-scale expansion.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.<br />

I lo'


are:<br />

Proceedings of the 22"d Annual EMBS International Conference, July 23-28,20:00, Chicago IL.<br />

0 It helps in auditory analysis of a multicomponent non-<br />

stationary signal in terms of its main features such as<br />

FP(t) and a(t).<br />

0 FP(t) takes high values for transients and noise.<br />

However, by making use of the envelope (intensity)<br />

information, noise can be made less audible as com-<br />

pared to transients.<br />

0 Integration of FP(t) ensures a continuous phase, and<br />

the method does not require any phase unwrapping.<br />

0 Integration of FP(t) makes the method insensitive to<br />

noisy FP(t) estimates.<br />

The IMF-based method has the following disadvantages:<br />

0 In case of a noisy signal, FP(t) will have an almost<br />

uniform waveform, and does not provide much infor-<br />

mation unless the envelope can contribute some in-<br />

formation. In the present study, this problem is over-<br />

come by processing denoised versions of the VAG sig-<br />

nals [lo].<br />

0 It is obvious that the method may not be applicable to<br />

information-rich signals such as speech: the formant<br />

structure of voiced speech cannot be represented by<br />

the relatively simple IMF.<br />

6 Experiments and Results<br />

Auditory analysis of VAG signals was performed by two<br />

orthopedic surgeons (GDB and CBF) with significant ex-<br />

perience in knee-joint auscultation and arthroscopy. The<br />

experiment was conducted in two stages: In the fist stage,<br />

familiarization and training were provided through the re-<br />

sults of application of the IMF-based sonification methods<br />

to a speech signal and four VAG signals (two normals and<br />

two abnormals). In the second stage, the methods were<br />

tested with the database of 37 VAG signals.<br />

From the initial evaluation (first stage), GDB selected<br />

the two-times time-scaled IMF-based sonification method<br />

for the test (second) stage. The purpose of the classifi-<br />

cation experiment in the test stage was to determine the<br />

diagnostic improvement provided by the processed sounds<br />

when compared to direct playback. The test stage in-<br />

cluded auditory classification experiments performed with<br />

the same database three times: twice by GDB with a time<br />

gap of 45 days between the repeat experiments and once<br />

by CBF. The direct playback of VAG signals provided a<br />

- sensitivity of 31% and a specificity of 74%, whereas au-<br />

ral analysis of the sonified signals provided a sensitivity of<br />

83% and a specificity of 32%.<br />

The results suggest that computer-aided auscultation<br />

of VAG signals may be a potential tool for improved diag-<br />

nosis of knee-joint cartilage pathology. The specificity and<br />

sensitivity may be increased with more auditory training.<br />

0-7803-6465-1 /00/$10.00 02000 IEEE 1998<br />

179<br />

Acknowledgements<br />

We gratefully acknowledge supplort from the Alberta Her-<br />

itage Foundation for Medical <strong>Research</strong> and the Faculty of<br />

Engineering, <strong>Ryerson</strong> Polytechnic <strong>University</strong>.<br />

References<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.<br />

R. A. B. Mollan, G. C. McCullagh, and R. I. Wilson.<br />

A critical appraisal of auscultation of human joints.<br />

Clinical Orthopaedics and Elelated <strong>Research</strong>, 170:231-<br />

237, 1982.<br />

C. F. Walters. The value of joint auscultation. Lancet,<br />

1~920-921, 1929.<br />

M. L. Chu, I. A. Gradisar, amd R. Mostardi. A nonin-<br />

vasive electroacoustical evalution technique of carti-<br />

lage damage in pathological knee joints. Medical and<br />

Biological Engineering and Computing, 16:437442,<br />

1978.<br />

Y. Nagata. Joint-sounds in gonoarthrosis - clinical application<br />

of phonoarthrography for the knees. Journal<br />

of UOEH, 10(1):47-58, 1988.<br />

N. P. Reddy, B. M. Rothschild, M. Mandal, V. Gupta,<br />

and S. Suryanarayanan. Noninvasive acceleration<br />

measurements to characterize knee arthritis and<br />

chondromalacia. Annals o,F Biomedical Engineering,<br />

23:78-84, 1995.<br />

R. M. Rangayyan, S. Krishnan, G. D. Bell, C. B.<br />

Frank, and K. 0. Ladly. Parametric represen-<br />

tation and screening of knee joint vibroarthre<br />

graphic signals. IEEE %ns. Biomedical Engineer-<br />

ing, 44(11):1068-1074, Nov. 1997.<br />

S. Krishnan. Adaptive signal processing techniques<br />

for analysis of knee joint vibroarthrographic signals.<br />

Ph. D. dissertation,<strong>University</strong> of Calgary, Calgary,<br />

AB, Canada, June 1999.<br />

G. Kramer. An introduction to auditory display. In<br />

G. Kramer, editor, Auditory Display: Sonification,<br />

Audafication, and Auditor;y Interfaces, pages 1-78.<br />

Addison Wesley, Reading, ILIA, 1994.<br />

S. Krishnan, R.M. Rangayyan, G.D. Bell, and C.B.<br />

Frank. Adaptive time-frequency analysis of knee joint<br />

vibroarthrographic signals for non-invasive screening<br />

of articular cartilage pathology. IEEE Zkansactions<br />

on Biomedical Engineering, page in press, 2000.<br />

S. Krishnan and R.M. Rangayyan. Automatic de-<br />

noising of knee joint vibration signals using adaptive<br />

time-frequency representat ions. Medical and Biologi-<br />

cal Engineering and Computing, page in press, 2000.


Proceedings of the 1999 IEEE Canadian Conference on Electrical and Computer Engheering<br />

Shaw Conference Center, Edmonton, Alberta, Canada May 9-12 1999<br />

Denoising Knee Joint Vibration <strong>Signal</strong>s Using Adaptive<br />

Time-Frequency Representations<br />

Sridhar Krishnan and Rangaraj M. Rangayyan<br />

Dept. of Electrical and Computer Engineering, Univ ersity of Calgary,<br />

2500 <strong>University</strong> Drive NW, Calgary , Alberta T2N 1N4, CANAR.<br />

Email: (krishnan)( ranga)@enel.ucalgary .ca<br />

Abstmct - A novel denoising method for improv-<br />

ing the signal-to-noise ratio (SNR) of knee joint vibra-<br />

tion signals (also known as vibroarthrographic or VAG<br />

signals) is proposed. The denoising methods consid-<br />

ered are based on signal decomposition techniques such<br />

as wavelets, wavelet packets, and the matching pur-<br />

suit method. Performance evaluation with synthetic<br />

signals simulated with characteristics expected of VAG<br />

signals indicated good denoising results with the match-<br />

ing pursuit method. Nonstationary signal features ex-<br />

tracted and identified from time-frequency distributions<br />

of denoised VAG signals have shown good potential in<br />

screening for articular cartilage pathology.<br />

Keywords: denoising, time-frequency distributions,<br />

matching pursuit, knee joint sounds, vibroarthrography.<br />

I. INTRODUCTION<br />

Vibration signals sensed using an accelerometer at<br />

the mid-patella position of the knee joint during normal<br />

leg movement could be used to develop a non-invasive<br />

tool for monitoring and screening of articular cartilage<br />

pathology. The knee joint vibration signals are referred<br />

to as vibroarthrographic or VAG signals.<br />

VAG signals have the following important charac-<br />

teristics:<br />

They are nonstationary and multicomponent in na-<br />

ture.<br />

Although the accelerometer placed at the mid-<br />

patella position has excellent immunity to back-<br />

ground noise, random noise is expected to com bine<br />

with VAG signal during leg mowment and data<br />

acquisition.<br />

e There is no underlying model available as yet for<br />

VAG signal generation from which the signal-to-<br />

noise ratio (SNR) could be determined a priori.<br />

In order to analyze VAG signals and to extract<br />

discriminant features, nonstationary and multiconpo-<br />

nent signal analysis tools such as timefrequency dis-<br />

tributions (TFDs) could be used. TFDs give the sig-<br />

nal energy distribution at different time instants and<br />

frequencies. The features extracted from a TFD will<br />

contain the combined time-frequency (TF) dynamics of<br />

the given signal as opposed to features along either the<br />

0-7803-5579-2/99/$10.000 1999 BEE 1495<br />

time or the frequency axis alone as provided by con-<br />

ventional techniques. However, TFD features may be<br />

biased due to the presence of random noise. Because<br />

of random behavior and wide-frequency range, a noise<br />

process is localizable neither in time nor in frequency,<br />

and appears all over the TF plane.<br />

Filtering of noise from VAG signals may help in<br />

extracting and identifying significant TF features use-<br />

ful in screening applications. In circumstances where<br />

the SNR of a signal is not known a priori, optimal lin-<br />

ear filtering techniques such as Wiener filtering may<br />

not be the best solution. In such cases, approaches<br />

based on signal decomposition using orthogonal or<br />

non-orthogonal bases may be an interesting alterna-<br />

tive. This paper is a first attempt to automatically<br />

denoise VAG signals using signal decomposition. The<br />

common1 y-used denoising methods sum as wavelets and<br />

wavelet packets are compared with an adaptive TF de-<br />

composition method such as matching pursuit with a<br />

Gabor dictionary.<br />

In Section I1 the methodology is described. Section<br />

I11 presents the results and discussion on the perfor-<br />

mance of the denoising methods studied with synthetic<br />

and real VAG signals. The paper is concluded in Sec-<br />

tion IV with a brief summary.<br />

11. METHODS<br />

The Wiener filter is an optimal filter for removing<br />

Gaussian random noise provided the noise statistics are<br />

available a priori. In real-world situations, signals ac-<br />

quired from an unknown system may have an unknown<br />

SNR. In cases where the SNR is not known a priori, sig-<br />

nal decomposition using an appropriate basis may help<br />

in extracting the coherent structures of a signal with<br />

respect to the basis dictionary. In the following sec-<br />

tions, methods for linear and nonlinear approximation<br />

or decomposition of signals are briefly described.<br />

A. Linear Approximation<br />

In linear approximation, the given signal is pro-<br />

jected over M orthogonal basis vectors that are chosen<br />

a priori. Linear approximation of a discrete signal z(n)<br />

180<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.


may be written as<br />

M-1<br />

= (z,grn)gm, (1)<br />

m=O<br />

where (z,gm) denotes the inner product of z(n) with<br />

the orthogonal basis vectors gm’s that are selected a<br />

priori. It has been shown that an optimal linear ap-<br />

proximation is provided by the Karhunen-Loitve basis<br />

[l]. The approximation may be improwd by choosing<br />

the M orthogonal basis vectors depending on the prop-<br />

erties of the given signal rather than selecting them be-<br />

fore hand. The selection of signal-adaptive basis func-<br />

tions leads to the concept of nonlinear decomposition.<br />

B. Nonlinear Approximation<br />

In the case of nonlinear approximation, the given<br />

signal is approximated with M vectors selected adap-<br />

tively. The nonlinear decomposition of a signal z(n)<br />

may be written as<br />

md&j<br />

where IM denotes a group of basis functions from a dic-<br />

tionary that provides the first M inner product values<br />

(z,gm) arranged in decreasing order. The M vectors<br />

in IM are the basis vectors that correlate best with<br />

x(n), and may be interpreted as the main features of<br />

z(n). One such possible approximation is the wavelet<br />

transform, where the basis vectors are obtained by di-<br />

lating and translating a prototype function (also known<br />

as wavelets), given by<br />

1 t-U<br />

= -J 9 (8)<br />

1 (3)<br />

where s denotes the dilation parameter and U is the<br />

translation parameter. Nonlinear decomposition based<br />

on wavelets outperforms linear decomposition because<br />

the former is equivalent to the construction of an irreg-<br />

ular sampling grid adapted to the local sharpness of the<br />

signal variations. Efficient denoising may be performed<br />

using wavelets by approximating the signal with a small<br />

number of non-zero wavelet coefficients; thresholding of<br />

the wavelet coefficients may be hard or soft [2].<br />

To further optimize nonlinear signal approxima-<br />

tion, one could adaptively choose the basis depending<br />

upon the given signal. This approach of selecting the<br />

“best” basis among a dictionary of bases by minimiz-<br />

ing a cost function or entropy is known as the method<br />

of wavelet packets (WP) [3]. The WP approac h uses<br />

a large family of orthogonal bases that include differ-<br />

ent types of local TF functions (also known as TF<br />

atoms). The bases are computed using a quadrature<br />

0-7803-5579-2/99/$10.00 0 1999 IEEE 1496<br />

181<br />

mirror filter-bank algorithm. WP decomposes the sig-<br />

nal into TF atoms that are adapted to the TF structures<br />

present in the signal. A denoised version of a signal may<br />

be obtained by soft thresholding or hard thresholding<br />

the WP coefficients.<br />

Another way to optimize a TF decomposition is<br />

by using non-orthogonal basis functions. An example<br />

of such a decomposition is the matching pursuit (MP)<br />

algorithm [4]. In this case, the non-orthogonal basis<br />

functions are Gaussian functions with good time and<br />

frequency localization characteristics. In MP, the signal<br />

is first projected onto the dictionary, and the Gabor<br />

TF atom with the highest correlation with the signal<br />

is selected. The residue of the signal is then projected<br />

onto the dictionary, and the component with the highest<br />

correlation is selected. The decay parameter, denoted<br />

by<br />

may be used as the stopping criterion of the decompo-<br />

sition process. .In Eq. 4, 11 Rmx denotes the residual<br />

energy level at the mth iteration. The decomposition<br />

is continued until the decay parameter does not reduce<br />

any further. At this stage, the selected components rep<br />

resent coherent structures and the residue represents<br />

incoherent structures in the signal with respect to the<br />

dictionary. The residue may be assumed to be due to<br />

random noise, since it does not show any TF localiza-<br />

tion.<br />

111. RESULTS AND DISCUSSION<br />

Before denoising VAG signals for feature extrac-<br />

tion, the best available denoising method was selected<br />

on the basis of performance with synthetic signals sim-<br />

ulated with characteristics similar to those expected of<br />

VAG signals. The synthetic signal used for illustra-<br />

tion in the present paper includes a linear frequency-<br />

modulated (FM) componen t, a nonlinear FM compo-<br />

nent, and a transient. The synthetic signal possesses<br />

the multicomponent and nonstationary characteristics<br />

typical of VAG signals. The reason to use FM compe<br />

nents in synthetic signals in the present study is that<br />

dominant pole analysis of VAG signals has indicated<br />

timevarying frequency characteristics [5]. Transients<br />

may depict joint clicks produced during movement of<br />

the knee. Random noise at different levels was added<br />

to the synthetic signal to simulate good and poor signal<br />

recording conditions.<br />

To evaluate the performance of the denoising rneth-<br />

ods chosen for the present study, the normalized root<br />

mean squared (NRMS) error measure w as used. NRMS<br />

is given by<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.<br />

(4)


A . : . : . ;. ... I... . . .... .. ' .. . .:. .. ..; 4<br />

a<br />

-4-<br />

Fig. 1. Multicomponent, nonstationary, synthetic signal corn-<br />

posed with a linear FM component, a nonlinear FM cornp-<br />

nent, and a transient.<br />

I ' ' ' I ' " ' ' ' 1<br />

0 50 100 152 200 250 300 350 400 450 500<br />

llme samples<br />

Fig. 2. The synthetic signal in Fig. 1 with noise added (SNR =<br />

0 dB).<br />

where s(n) is the original signal without noise, d(n) is<br />

the denoised signal, and N is the number of samples<br />

in the signal. A small NRMS measure indicates good<br />

denoising performance.<br />

A. Results with Synthetic <strong>Signal</strong>s<br />

The denoising methods were applied to the syn-<br />

thetic signal with two levels of Gaussian random noise<br />

added. The noise levels were such that the resulting<br />

signals had an SNR of 10 dB and 0 dB. The symmlet 4<br />

wavelet [l] was used for wavelet-based denoising. A soft<br />

0-7803-5579-2/99/$10.00 0 1999 IEEE 1497<br />

182<br />

Fig. 3. Wavelet-based denoised version of the noisy signal in<br />

Fig. 2.<br />

't I 1<br />

Fig. 4. Wavelet packet-based denoised version of the noisy signal<br />

in Fig. 2.<br />

me SamFleO<br />

Fig. 5. Matching pursuit-based denoised version of the noisy<br />

signal in Fig. 2. ,<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.


e<br />

1<br />

os<br />

07-<br />

Ohzo5-<br />

04-<br />

09-<br />

02-<br />

01-<br />

OD<br />

SNR - 10 10<br />

Fig. 6. Comparison of the NRMS error values of the denoised<br />

versions of the synthetic signal with SNR = 10 dB.<br />

threshold was applied to the wavelet coefficients; coef-<br />

ficients that did not pass the soft threshold test were<br />

discarded. In case of the WP method, the “best” basis<br />

was selected on the basis of the Schur concavity cost<br />

function [3], and the denoised version was obtained by<br />

soft thresholding the WP coefficien ts. Gaussian func-<br />

tions were used for the MP method. Gaussian functions<br />

provide the optimal TF resolution and satisfy the equal-<br />

ity criteria of the uncertainty principle. The threshold<br />

for denoising was based on the decay parameter as given<br />

by Eq. 4.<br />

Fig. 1 shows the original synthetic signal, and<br />

Fig. 2 shows the signal with noise added to an SNR<br />

of 0 dB. The denoised versions of the signals using the<br />

wavelets method, the WP method, and the MP method<br />

are shown in Figs. 3, 4, and 5 respectively. Visual com-<br />

parison indicates that the MP-denoised result has pre-<br />

served most of the important characteristics, especially<br />

the transient component.<br />

Fig. 6 shows a bar graph comparing the NRMS<br />

values of the results of the three denoising methods ap-<br />

plied to the synthetic signal with SNR of 10 dB. From<br />

the Fig. 6 it is evident that adaptive denoising using<br />

MP provides good denoising for a moderately high SNR<br />

case. The case of the low SNR of 0 dB was simulated<br />

to depict very poor signal recording conditions (not ex-<br />

pected in VAG signals). From the bar chart in Fig. 7 we<br />

can deduce that the MP technique has again provided<br />

the best denoising result (lowest NRMS value) of the<br />

three methods studied.<br />

It is worthwhile to mention that the denoising re-<br />

sults with wavelets and WP are highly dependent on<br />

the selection of the threshold value for the coefficients.<br />

In the case of MP, the decay parameter is a more ob-<br />

jective measure. Fig. 8 shows the reduction of the de-<br />

cay parameter with the number of TF atoms used for<br />

0-7803-5579-2/99/$10.00 0 1999 IEEE 1498<br />

183<br />

03.<br />

025<br />

1<br />

os<br />

07-<br />

Ob-<br />

io5; 04<br />

05-<br />

02 -<br />

01 -<br />

0 0<br />

SNR-OdB<br />

Fig. 7. Comparison of the NRMS error d ues of the denoised<br />

versions of the Synthetic signal with SNR = 0 dB.<br />

.. ..I .. ...... .. ..... ..~ .... . . .: . . ..... , . .<br />

MmbrOl w slornl<br />

Fig. 8. Plot of the decay parameter versus the number of TF<br />

atoms for the synthetic signal with SNR = 10 dB and SNR<br />

= 0 dB.<br />

the synthetic signal with SNR = 10 dB and SNR = 0<br />

dB. It is clearly evident that, in denoising the signal<br />

with SNR = 0 dB, the MP method has been able to<br />

extract fewer coherent structures as compared to the<br />

10 dB case. This result indicates that the higher level<br />

of noise has destroyed some of the low-energy coherent<br />

structures in the 0 dB version of the signal.<br />

The WP method may give better results if the<br />

threshold is selected in an optimal manner. The per-<br />

formance of the WP method for denoising cannot be<br />

appreciated much in the present application, since for<br />

highly nonstationary signals such as the synthetic sig-<br />

nals shown, the WP method produces a mismatch be-<br />

tween the “best” orthogonal basis and many local signal<br />

components. On the contrary, MP is a “greedy” algo-<br />

rithm that locally optimizes the choice of the wavelet<br />

packet function for the signal residue at each stage. The<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.


' I I<br />

1<br />

40<br />

0 05 1 15 2 25 3 35 4<br />

lime m s<br />

Fig. 9. Abnormal VAG signal of a subject with cartilage pathol-<br />

ogy.<br />

-80 '<br />

0 05 1 15 2 25 3 35 4<br />

it- m D<br />

Fig. IO. MP-denoised version of the VAG signal in Fig. 9.<br />

Fig. 11. Difference between the original VAG signal in Fig. 9 and<br />

the MP-denoised version in Fig. 10.<br />

0-7803-5579-2/99/$10.00 0 1999 IEEE 1499<br />

I<br />

1 I<br />

184<br />

Fig. 12. TFD of the abnormal VAG signal in Fig. 9 computed<br />

using the spectrogram.<br />

Tim<br />

0 05 1 15 2 25 3 35 4<br />

Tim<br />

Fig. 13. TFD of the MP-denoised VAG signal in Fig. 10 com-<br />

puted using the spectrogram.<br />

good optimization property of MP is achieved at the ex-<br />

pense of increased computational load as a result of the<br />

greedy approach. Also, in the case of a multicompo-<br />

nent signal where different types of energy structures<br />

are located at different times but in the same frequency<br />

interval, there is no WP basis that is well adapted to all<br />

of them. WP-based decomposition using an orthogonal<br />

basis lacks translation invariance and is thus difficult<br />

to use for pattern recognition. MP is a translation-<br />

invariant method if a translation-invariant dictionary<br />

such as a Gabor dictionary is used. Based on these ob-<br />

servations, the MP technique was selected for denoising<br />

VAG signals.<br />

B. Results with VAG signals<br />

The MP technique was applied to 90 VAG signals<br />

(51 normal and 39 abnormal). TFDs were constructed<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.


using the denoised signals. As an illustration, an abnormal<br />

VAG signal of a subject with cartilage pathology<br />

is shown in Fig. 9, its MP-denoised version is shown in<br />

Fig. 10, and the difference between the original and the<br />

denosied versions of the abnormal VAG signal is shown<br />

in Fig. 11. From Fig. 11, we could observe that a signif-<br />

icant amount of random noise has been removed from<br />

the original signal by the MP-denoising method.<br />

The TFD computed using the spectrogram of the<br />

original signal in Fig. 9 is shown in Fig. 12. The TFD of<br />

the MP-denoised version of the same VAG signal com-<br />

puted using the spectrogram is shown in Fig. 13. The<br />

spectrograms of the original and the denoised VAG sig-<br />

nals were computed using the same short-time Fourier<br />

transform parameters. Tonal and FM components are<br />

clearly seen in the TFD of the denoised VAG signal,<br />

thus facilitating enhanced feature identification. In-<br />

stantaneous features based on energy and frequency pa-<br />

rameters were computed as marginal v alues of the TFDs<br />

[6] of the MP-denoised VAG signals; pattern analysis of<br />

the features indicated screening accuracy of up to 70%.<br />

IV. CONCLUSION<br />

A novel approach to denoise VAG signals for en-<br />

hanced feature extraction and identification was pro-<br />

posed. The denoising methods considered were based<br />

on nonlinear decomposition of signals. The MP method<br />

of denoising is more promising for application to non-<br />

stationary signals such as VAG than the commonly-<br />

used wavelet-based denoising and WP-based denoising<br />

techniques. The wavelet techniques are best adapted<br />

to global signal properties, whereas the MP method is<br />

based on local optimization. Nonstationary signal fea-<br />

tures extracted from the TFDs of MP-denoised VAG<br />

signals have shown good potential for screening normal<br />

knees from abnormal knees.<br />

Acknowledgements: W e gratefully ahowledge support<br />

from the Alberta Heritage Foundation for Medical Re-<br />

search and the Natural Sciences and Engineering Re-<br />

search Council of Canada.<br />

REFERENCES<br />

[l] S. Mallat. A wavelet tour of signal processing. Academic<br />

Press, San Diego, CA., 1998.<br />

[2] D. Donoho. Unconditional bases are optimal bases for data<br />

compression and for statistical estimation. Journal of Appl.<br />

and Cornput. Harmonic <strong>Analysis</strong>, 1:lOO-115, 1993.<br />

[3] M.V. Wickerhauser. Adapted wavelet analysis from theory to<br />

software. IEEE press, Piscataway, NJ., 1994.<br />

[4] S.G. Mallat and Z. Zhang. Matching pursuit with timefrequency<br />

dictionaries. IEEE Tmns. on <strong>Signal</strong> Processing,<br />

41( 12):3397-3415,1993.<br />

[5] R.M. Rsngayyan, S. Krishnan, G.D. Bell, C.B. Frank, and<br />

K.O. Ladly. Parametric representation and screening of knee<br />

joint vibroarthrographic signals. IEEE Trans. on Biomedical<br />

Engineering, 44(11):1068-1074, Nov. 1997.<br />

0-7803-5579-2/99/$10.00 0 1999 IEEE 1500<br />

[6] S. Krishnan, R.M. Rsngayyan, G.D. Bell, C.B. Frank, and<br />

K-0. L~IY. Time-frequency signal feature extraction and<br />

screening of knee joint vibroarthrographic signals using the<br />

matching pursuit method. CDROMproceedings, lgth A,,nual<br />

International Conference of The IEEE Engineering in<br />

Medicine and BiO[OgY Society, Chicago, IL, October 1997.<br />

185<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.


Comparative <strong>Analysis</strong> of the Performance of the Time-Frequency<br />

Distributions with Knee Joint Vibroart hrographic <strong>Signal</strong>s<br />

Rangaraj M. Rangayyan and Sridhar Krishnan<br />

Dept. of Electrical and Computer Engineering, <strong>University</strong> of Calgary,<br />

2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />

Email: (ranga)( krishnan)Oenel.ucalgary.ca<br />

Abstract ~ Vibroarthrographic (VAG) signals emitted<br />

by human knee joints can be used to develop a non-invasive<br />

diagnostic tool t,o detect articular cartilage degeneration.<br />

VAG signals are nonstationary and multicomponent in nature;<br />

time-frequency tlistribut,ions (TFDs) provide powerful<br />

means to analyze such signals. The objective of this paper is<br />

to determine the TFD suitable for identification and extraction<br />

of VAG signal features of clinical significance. The TFDs<br />

considered are: autoregressive (AR) model-based TFD; the<br />

reassigned, smoothed. pseudo-Wigner-Ville (RSPWV) distribution;<br />

and a TFD based on signal decomposition using<br />

the matching pursuit (MP) algorithm. As the true TFD of a<br />

VAG signal is not known, the results of the TFDs were compared<br />

based on tmhe expected characteristics using synthetic<br />

signals. The MP TFD shows good potential in analyzing<br />

multicomponent signals with low signal-to-noise ratio when<br />

compared to the AR model-based TFD and the RSPWV<br />

method. The TFD techniques were also tested on VAG signals<br />

with additional information provided by auscultation<br />

and arthroscopy. The results indicate that the MP TFD is<br />

the best, available TFI) to analyze VAG signals.<br />

I. INTRODUCTION<br />

Vibroarthrography (VAG) , the recording of human knee<br />

joint vibration/acoustic signals during active movement of<br />

the leg, can be used as a non-invasive diagnostic tool to<br />

detect articular ca.rtil;ige degenerat,ion. The currently used<br />

“gold standard” for assessment of cartilage surface degeneration<br />

is arthroscopy, where the cartilage surface is inspected<br />

and palpated with a cable. The disadvantage with<br />

arthroscopy is that it cannot be applied to patients whose<br />

knees are in a highly degenerated stmate due to osteoarthritis,<br />

ligamentous insta.bility, meniscectomy, or patellectomy.<br />

The drawbacks with arthroscopy and the limitations of<br />

imaging techniques have motivated researchers to look for<br />

tools such as VAG. In our work, the VAG signal is detected<br />

at the mid-patella position on the surfa.ce of the knee as the<br />

leg is swung over the angle range of 135’ + 0’ -+ 135’ in a<br />

t8ime period of 4s. The signals a.re filtered to the range 10 Hz<br />

t,o 1 kHz and amplified before sampling at a rate of 2 kHz.<br />

The cartilage surfaces of a normal knee are smooth and<br />

slippery. Vibrat,ions generated due to friction between articulating<br />

surfaces of ‘degenerated cartilage are expected to<br />

be different in aniplitude and frequency from those of normal<br />

knees. The important characteristics of VAG signals are<br />

listed below.<br />

0-7803-5073-1/98/’$10.00 01998 IEEE 273<br />

186<br />

The VAG signal is expected to be a multicomponent<br />

signal due to the possibility that during movement, of<br />

the knee, the rubbing of the femoral condyle on the<br />

patella surface provides multiple sources of vibration,<br />

and also due to the possibility that the signal from a<br />

single source can propagate through different channels<br />

of tissue to the mid-patella position, thus giving rise to<br />

multiple energy components at different frequencies for<br />

a given time.<br />

VAG signals are nonstationary due to the fact that the<br />

quality of joint surfaces coming into contact may not be<br />

the same from one angular position (point of time) to<br />

another during articulation of the joint.<br />

Due to the differences in cartilage structures in normal<br />

and abnormal knees, VAG signals with different<br />

frequency law components may be generated. Identification<br />

of such frequency dynamics may help in classification<br />

of normal and abnormal knees.<br />

Our previous approaches tackled the nonstationarity of VAG<br />

signals by adaptively segmenting trhe signals into stationary<br />

components; each segment was parametrically represented<br />

using a separate set of autoregressive coefficients,<br />

dominant poles, or cepstral coefficients. Dominant poles<br />

(poles corresponding to dominant spectral peaks in the signal)<br />

of each segment have provided good discriminant information<br />

for classifying signals into normal and abnormal<br />

groups [l], validating the assumption that the frequency dynamics<br />

of normal VAG signals differ from those of abnormal<br />

signals. A major drawback of the segmentation-based technique<br />

lies in associating the clinical information obtained<br />

during arthroscopy or auscultation with the segments of a<br />

signal. This problem can be overcome by using nonstationary<br />

signal analysis tools such as time-frequency distributions<br />

(TFDs) . TFDs reveal frequency and temporal information<br />

simultaneously, and are particularly attractive for analysis<br />

of multicomponent signals, depiction of frequency laws, and<br />

noise suppression. The purpose of this work is to identify<br />

the best available TFD for objective identification and extraction<br />

of TF structures in VAG signals.<br />

II. TIME-FREQUENCY DISTRIBUTIONS<br />

The right TFD would be one t


signals are: 1) model-based TFD, 2) Cohen's class TFDs,<br />

and 3) TFD based on decomposition of signals.<br />

il. Autowgressave Model-based TFD<br />

In the model-based TFD, the autoregressive (AR) model<br />

coefficients of t,he signal segments are used in estimating the<br />

power spectral density of each segment. In our work, the<br />

model coefficients were estimated using the Burg method.<br />

Fixed segment length was used for synthetic signals, and<br />

adaptive segment length was used for real VAG signals.<br />

B. Reassigned Smoothed Pseudo Wigner- Ville Distribution<br />

The Wigner-Ville distribution (WVD) is the most pop-<br />

ular TFD of Cohen's class. The main drawback with the<br />

WVD is that, in the case of multicomponent signals, cross-<br />

terms are generated in the TFD. Cross-terms can be min-<br />

imized by using two-dimensional low-pass filtering in the<br />

ambiguity domain, and the smoothed version of the WVD<br />

can be obtained. In this paper, the most commonly used<br />

smoothed version of the WVD, namely the smoothed pseudo<br />

Wigner-Ville distribution (SPWVD) , is considered. The SP-<br />

WVD reduces cross-terms significantly. The extent of reduc-<br />

tion in cross-terms depends upon the type of signal being an-<br />

alyzed. In our applications with synthetic and VAG signals,<br />

the smoothing windows used are Gaussian funct,ions.<br />

The smoothing windows suppress cross-terms in the<br />

WVD but smear localized components, leading to less ac-<br />

curate TF localization of signal components as compared to<br />

the WVD. Recently, a reassignment method has been pro-<br />

posed by Auger and Flandrin [a] to improve TF localization<br />

in smoothed TFDs such as SPWVDs.<br />

In the reassignment method, the window is moved from<br />

the geometric center (t,w) to the energy center (i!,ij) of the<br />

TFD. The reassigned SPWVD (RSPWVD) improves the TF<br />

localization of smeared components and provides good read-<br />

ability in the TFD.<br />

C. Matching Pursuit<br />

The TFD generated by the matching pursuit (MP)<br />

method is based on signal decomposition. The MP algo-<br />

rithm selects the decomposition vectors depending upon the<br />

signal properties. The vectors are selected from a family of<br />

waveforms called a dictionary. The signal z(t) is projected<br />

on to a dictionary of Gabor atoms obtained by scaling, trans-<br />

lating, and modulating a Gaussian window function g(t):<br />

where<br />

n=O<br />

and an are the expansion coefficients. The scale factor s,<br />

is used to control the width of the window function, and<br />

the parameter p, controls temporal placement. & is a<br />

274<br />

187<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:19 from IEEE Xplore. Restrictions apply.<br />

normalizing factor which keeps the norm of g,, (t) equal to 1.<br />

The parameters f, and 4, are the frequency and phase of the<br />

exponential function, respectively. In our application, the<br />

window is a Gaussian function, i.e., g(t) = 21/4exp(-7rt2);<br />

the TF atoms are then called Gabor functions<br />

In practice, the algorit,hm works a.s follows: The signal<br />

is iteratively projected on tro a Gabor function dictionary.<br />

The first projection decomposes the signal into two parts:<br />

where (x, grn) denotes the inner product (projection) of z(t)<br />

with the first TF atom gro (t). The term R1x(t) is the residue<br />

after approximating z(t) in the direction of grn (t). This process<br />

is continued by projecting the residue on to the subsequent<br />

functions in the dictionary, and aft,er M iterat,ioiis<br />

M-1<br />

n=O<br />

with Roz(t) = z(t). There are two ways of stopping the it,-<br />

erative process: one is to use a pre-specified limiting number<br />

M of the TF atoms, and the other is to check the energy of<br />

the residue RMx(t). A very high value of M and a zero value<br />

for the residual energy will decompose the signal complet#ely<br />

at t8he expense of increased computational complexity.<br />

In this work, M was limited to 1000 atoms and the resid-<br />

ual energy limit was set to be 0.5% of the total energy. For<br />

VAG signals, the maximum octave length given by log, N<br />

(where N is the number of samples) was set to 11 due t.o t,he<br />

nonstationary nature of the signal. Also, in NIP analysis,<br />

only coherent structures [3] of the signals can be extraded:<br />

the residual components that do not have a high correlation<br />

with the vectors in the dictionary are rejected.<br />

The Wigner distribution of x(t) based on the TF atoms<br />

is given as [3]:<br />

where Wgyn(t,W) is the Wigner transform of the Gaussian<br />

window function. The double sum corresponds to<br />

the cross-terms of the Wigner distribution indicated by<br />

W[grn ,g7ml (t, U) , and should be rejected in order to obtain<br />

a cross-term-free energy distribution of ~(t) in the TF plane.<br />

Thus only the first term is computed, and the resulting TFD<br />

is given by<br />

M-1<br />

The cross-term free TFD W'(t, w) has very good readability<br />

and is appropriate for multicomponent signal analysis. The<br />

extraction of coherent structures makes MP an attractive<br />

tool for TF representation of signals with unknown SNR.


a- .. ............ L .. . :.<br />

-U) ~ .......................<br />

; ........... : .................................... I ............ : .....<br />

-a,<br />

0 1 w o 2 o o o 3 w o u a m m w s o a , ” s w o<br />

umr vmwr Fig. 1. (a) VAG signal of a. normal subject. Grinding sound was heard<br />

during auscultation at an angle range of 50° -+ 0’ (3400 to 4000<br />

samples) during extension of the knee. au- acceleration units.<br />

Fig. 2. TFDs of the signal in Figure 1. (a) AR model-based TFD. X’s<br />

denote dominant poles. (b) RSPWV distribution. (c) MP TFD.<br />

~<br />

275<br />

188<br />

111. RESULTS<br />

A. Results with Synthetic <strong>Signal</strong>s<br />

Before applying the TFDs to real VAG signals, the<br />

TFDs were evaluated with synthetic signals. For exam-<br />

ple, one of the synthetic signals “syn” was composed with<br />

overlapping chirp, impulse, and sinusoidal FM components.<br />

The signal “syn” is a nonstationary signal since its spectrum<br />

varies with time. Transients such as an impulse represent<br />

clicks heard during knee movement. Chirp and sinusoidal<br />

FM components are examples of linear and nonlinear fre-<br />

quency dynamics; their physiological relevance needs to be<br />

studied. AS VAG classification experiments based on dom-<br />

inant poles (adaptive pole or spectral peak tracking) have<br />

provided very good accuracy [l], we believe it is worthwhile<br />

to study VAG signals in terms of their frequency dynamics<br />

with improved time tracking.<br />

To simulate noisy signal conditions the synthetic signals<br />

were corrupted by adding Gaussian noise to an SNR of 10<br />

dB, and to simulate worse signal recording conditions the<br />

synthetic signals were corrupted by adding Gaussian noise<br />

to an SNR of 0 dB.<br />

The results obtained with synthetic signals are pre-<br />

sented below in a summarized version for the sake of brevity.<br />

The synthetic signals were segmented into fixed segments,<br />

and each segment was AR modeled using the Burg lattice<br />

method; a model order of 15, determined empirically, was<br />

used for each segment. The advantage of this method as<br />

compared with other segment-based methods such as the<br />

spectrogram is that the model presumes that the signal out-<br />

side the segment is nonzero as opposed to the spectrogram,<br />

where the signal outside the window is assumed to be zero.<br />

The TFD was free of cross-terms with reasonable TF local-<br />

ization, a,nd the TFD did not include the impulse component;<br />

this is because a transient of very short duration cannot be<br />

modeled by prediction inherent in the AR model. In the<br />

case of low SNR of 0 dB, AR modeling failed to give good<br />

spectral estimates.<br />

Although cross-terms were suppressed in the RSP-<br />

WVDs, the TFDs generated by RSPWVD had negative val-<br />

ues, and may not be suitable for feature extraction as an<br />

accurate estimate of the mean frequency or spread for each<br />

time instant cannot be reliably obtained. The method of<br />

reassignment improved the localization of the components<br />

significantly, but the problem of negative distribution values<br />

exists. In the case of the lower SNR of 0 dB, it was hard to<br />

distinguish the components of interest from cross-terms.<br />

The MP method gave a clear picture of the TF rep-<br />

resentation; the three simulated components were perfectly<br />

localized in the TFDs. This is because the MP TFD pro-<br />

vides adaptive representation of signal components, and due<br />

to the possibility that each high-energy component is ana-<br />

lyzed by the TF representation independent of its bandwidth<br />

and duration. The poor localization of transients by other<br />

techniques such as Fourier and wavelets is due to the fact<br />

that the transient information gets diluted across the whole<br />

basis and t,he collection of ba.sis functions is not as large as<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:19 from IEEE Xplore. Restrictions apply.


compared to that in the MP dictionary.<br />

In the lower SNR, the MP TFD was better than those<br />

obtained using the other techniques. The MP TFD could be<br />

made more readable by extracting only the coherent struc-<br />

tures of the signal. The MP technique has the facility to<br />

include automatic denoising of the signal, which is useful in<br />

situations where the SNR of the signal is not known.<br />

B. Results with VAG signals<br />

The TFD methods were tested on ten VAG signals.<br />

For computing the AR model-based TFDs, the signals were<br />

adaptively segmented into quasi-stationary segments using<br />

the recursive least-squares lattice (RLSL) algorithm [4]. The<br />

segments were AR modeled using the Burg-lattice algorithm<br />

and the model order used was 40 [4]. For the sake of illustra-<br />

tion, the VAG signal (“vagl”) of a normal subject is shown<br />

in Figure l(a). Grinding sound was heard during auscul-<br />

tation at an angle range of 50’ + 0’ (approximately in the<br />

range of 2500 to 4000 samples) for this subject. From the AR<br />

model-based TFD of t,he signal “vagl” in Figure 2(a), we can<br />

observe that the TF representation is cross-term-free. The<br />

grinding sound is shown as a high-frequency activity. The<br />

localization of the component, corresponding to the grinding<br />

sound is coarse, and the precise angle (or time) at which the<br />

sound was heard cannot, be readily det,ermined. Because of<br />

the coa,rse estimation of components, the AR model-based<br />

TFD may not be appropriate for instantaneous parameter<br />

extraction. The most dominant pole indicated by the ‘X’<br />

marks superimposed on t,he AR model-based TFD in Figure<br />

2(a), indicat,es the dominant spectral peaks in the signal. As<br />

t,he dominant, poles are selected on a segment-by-segment<br />

basis, the dominant poles are also not suitable for instanta-<br />

neous parameter tracking.<br />

Figure 2(b) shows the TFD obtained using the RSPWV<br />

method. The TFD is obviously not readable except for the<br />

component corresponding to the “grinding” sound. Further,<br />

the TFD has negative values due to cross-terms. The neg-<br />

ative values may mislead parameter calculation, and hence<br />

the RSPWVD may not be appropriate for feature extraction.<br />

The MP TFD is shown in Figure 2(c). The TFD was<br />

constructed using the coherent structures of the signal only,<br />

and the number of TF atoms was 441. The TFD has clearly<br />

represented the “grinding” sound with very good localiza-<br />

t,ion. The TFD obtained is a positive distribution and is free<br />

of cross-terms, and is suitable for feature extraction.<br />

C. Classification Experiments<br />

A database of 90 VAG signals was compiled, including<br />

51 normal and 39 abnormal signals. Although the MP TFD<br />

does not satisfy true marginal properties, time-varying pa-<br />

rameters with discriminant information can be computed as<br />

marginal values of an MP TFD. The time-varying parame-<br />

ters that were extracted from the MP, TFD were:<br />

1. Energy Parameter: the mean of W- (t,w) along each time<br />

slice, which gives a measure of energy variation with time.<br />

2. Energy Spread Parameter: the standard deviation of<br />

W’ (t, U) along each time slice.<br />

276<br />

189<br />

3. Frequency Parameter: the first moment along each time<br />

slice as given by the expression<br />

4. Frequency Spread Parameter: the second central moment<br />

along each time slice as given by<br />

IMFS(t) =<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:19 from IEEE Xplore. Restrictions apply.<br />

cy::[w - IMF(t)12W’(t, w)<br />

cy=”,” Mi’ (t, U)<br />

(7)<br />

The mean and standard deviation of the parameters over<br />

the duration of each signal were comput,ed, and each VAG<br />

signal was represented by a set of eight features. Statistical<br />

pattern classification based on stepwise logistic regression<br />

analysis [5] of the features of the 90 VAG signals as nor-<br />

mal/abnormal was achieved with an accuracy of 74.4%. The<br />

frequency parameter significantly contributed towards accu-<br />

rate classification of VAG signals. This gives motivation to<br />

search for linear and nonlinear frequency components in the<br />

TF plane of VAG signals.<br />

IV. CONCLUSION<br />

Segmentation-based analysis of VAG signals has limita-<br />

tions in correlating the estimated angle range of pathology as<br />

observed during arthroscopy with the segments of the signal.<br />

The problem of segmentation can be avoided by using non-<br />

stationary signal analysis tools such as TFDs. It is difficult<br />

to interpret VAG TFDs, and even harder t80 train clinicians<br />

in interpreting TFDs. Therefore, TFDs should be selected so<br />

as to facilitate objective feature identification and extrahon.<br />

<strong>Analysis</strong> of the performances of different TFDs shows that<br />

the MP TFD is the most suitable TFD for VAG signal anal-<br />

ysis. Preliminary results with 90 VAG signals suggest that<br />

the parameters extracted from the MP-based TFD provide<br />

good discriminant information. Compared with our previ-<br />

ous methods, the proposed method does not need any joint<br />

angle and clinical information, and shows good potential for<br />

noninvasive diagnosis of articular cartilage pathology.<br />

Acknowledgements: We gratefully acknowledge support from<br />

the Alberta Heritage Foundation for Medical <strong>Research</strong>.<br />

REFERENCES<br />

R.M. Rangayyan, S. Krishnan, G.D. Bell, C.B. Frank, and K.O.<br />

Ladly. Parametric representation and screening of knee joint vi-<br />

broarthrographic signals. IEEE Transactions on. Biomedical Engi-<br />

neering, 44(11):1068-1074, NOV. 1997.<br />

F. Auger and P. Flandrin. Improving the readability of time-<br />

frequency and time-scale representations by the reassignment<br />

method. IEEE Transactions on <strong>Signal</strong> Processing, 43(5):1068-<br />

1089, May 1995.<br />

S.G. Mallat and Z. Zhang. Matching pursuit with time-frequency<br />

dictionaries. IEEE Transactions on <strong>Signal</strong> Processing, 41( 12):3397-<br />

3415, 1993.<br />

S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank, and K.O.<br />

Ladly. Adaptive filtering, modelling, and classificat,ion of knee joint<br />

vibroarthrographic signals for non-invasive diagnosis of articular<br />

cartilage pathology. Medical and Biological Engineering and Com-<br />

puting, 35577-684, Nov. 1997.<br />

A.P. Afifi and S.P. Azen. Statistical <strong>Analysis</strong>: il computer Oriented<br />

approach. Academic Press, Inc., New York, NY., 2nd edition, 1979.


DETECTION OF NONLINEAR FREQUENCY-MODULATED<br />

COMPONENTS IN THE TIME-FREQUENCY PLANE USING AN<br />

ARRAY OF ACCUMULATORS<br />

Sridhar Krishnan and Rangaraj M. Rangayyan<br />

Dept. of Electrical and Computer Engineering, <strong>University</strong> of Calgary,<br />

2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />

Email: (krishnan) (ranga)@enel.ucalgary.ca<br />

Abstract - We propose a novel approach to detect<br />

nonlinear frequency-modulated (FM) components such<br />

as sinusoidal and hyperbolic FM components in multi-<br />

component, nonstationary signals in the time-frequency<br />

(TF) plane. The approach, based upon the use of an<br />

array of accumulators, can be used to detect nonlinear<br />

FM components with varying energy in low signal-to-<br />

noise ratio environments.<br />

I. INTRODUCTION<br />

Instantaneous frequency (IF) is an important pa-<br />

rameter in characterizing the nonstationary behavior<br />

of a signal. IF could be frequency modulated (FM)<br />

as a linear component (e.g. chirp) or as a nonlinear<br />

component (e.g. quadratic FM) with time. Detection<br />

of linear and nonlinear FM components in a nonsta-<br />

tionary signal has been studied extensively by using<br />

time-frequency (TF) representations [l], [2] and poly-<br />

nomial phase transforms (PPT) [3]. In PPT, the FM<br />

component is detected by estimating the phase coeffi-<br />

cients of the given complex signal. The disadvantages<br />

with PPT are that it can only be applied to signals<br />

whose amplitude variations are slower than their phase<br />

variations, and that reliable estimates of the phase co-<br />

efficients are not guaranteed under low signal-to-noise<br />

ratio (SNR) conditions. Barbarossa and Lemoine es-<br />

timated nonlinear FM parameters by using the reas-<br />

signed, smoothed, pseudo- Wigner-Ville representmation<br />

and the Hough transform. Although the method is at-<br />

tractive, accurate estimation of FM parameters is not<br />

possible in the presence of cross-terms.<br />

In our work, the nonlinear frequency parameters of<br />

a signal are estimated via its TF representation. The<br />

TF representation is treated as an image, where each<br />

pixel corresponds to the energy present at a particular<br />

time and frequency.<br />

11. TIME-FREQUENCY DISTRIBUTIONS<br />

The main conditions under which a TF distribution<br />

(TFD) can be treated as an image are:<br />

The TFD should be positive.<br />

0-7803-5073- 1/98/$10.00 01998 IEEE 557<br />

The TFD should satisfy the marginal properties.<br />

Cross-terms should be negligible in order to avoid<br />

false search.<br />

The widely used Cohen’s class TFDs do not satisfy the<br />

above requirements as the kernel used is functionally<br />

independent of the signal. TFDs based on linear combinations<br />

of the Wigner distributions of TF atoms, as<br />

given by a decomposition algorithm such as matching<br />

pursuit [4], are positive distributions and are cross-term<br />

free; however, they do not satisfy the marginal properties.<br />

TFDs that are positive and satisfy the marginals<br />

do exist, and one can obtain an infinite number of them<br />

for any signal. Such TFDs are nonlinear functions of the<br />

signal; the kernels for these TFDs are generally signaldependent,<br />

and are known as Cohen-Posch TFDs [5].<br />

Accordingly, while the Cohen-Posch TFDs can, in theory,<br />

be obtained from Cohen’s general formulation, the<br />

signal-dependence of the kernel, coupled with its possible<br />

unbounded nature, calls for alternative formulations<br />

of practical implementation of the Cohen-Posch<br />

TFDs. One formulation which is particularly tractable<br />

and readily demonstrates the positivity and marginal<br />

conditions is:<br />

P(t,w) = P(t)P(w) Q(u(t), vb)), (1)<br />

where P(t) = ls(t)I2 and P(w) = IS(w)I2 are the<br />

marginal densities with s(t) being the given time-<br />

domain signals and S(w) the Fourier transform of the<br />

signal, and Q(u, w) is any positive function of the variables<br />

( U, w) over 0 5 ( U, w) 5 1, normalized to one:<br />

s s<br />

In Eq.1, we have<br />

Q(U, w)du = Q(U, v)dv = 1. (2)<br />

U = u(t) = fm P(t’)dt’,<br />

‘U =.(U) = J:mP(w’)dw’.<br />

(3)<br />

It is obvious that the density P(t,w) is positive, and<br />

straightforward to show that the marginals are satis-<br />

190<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.


fied. In addition to being positive and yielding the cor-<br />

rect marginals, the Cohen-Posch TFDs are also shift-<br />

invariant and scale-invariant. An algorithm to effi-<br />

ciently compute the Cohen-Posch TFDs has been pro-<br />

posed by Loughlin et al. [ti]. The algorithm is based<br />

on minimum cross-entropy (MCE) optimization of the<br />

density functions.<br />

USING AN ARRAY OF<br />

111. TFD ANALYSIS<br />

ACCUMULATORS<br />

The approach of the present work is to apply the<br />

Hough transform based on an array of accumulators<br />

to positive TFDs obtained using the MCE method to<br />

detect energy-varying quadratic FM components such<br />

as sinusoidal and hyperbolic FM components with un-<br />

known parameters. Sinusoidal and hyperbolic FM sig-<br />

nals are common in synthetic aperture radar, multi-<br />

path communication channels , helicopter recog nit ion,<br />

and sonar.<br />

The detection algorithm is based upon the use of<br />

an array of accumulators; the dimensionality of the ar-<br />

ray depends upon the number of parameters to be es-<br />

timated. The TFD is treated as an image with gray<br />

values corresponding to the normalized energy values<br />

of the components.<br />

Let us first consider the procedure for detecting a<br />

sinusoidal FM component in the TF plane. In practice,<br />

a sinusoidal FM component may occur at any location<br />

in the TF plane, and hence a generalized expression of<br />

a sine wave is considered: fk = A + m sin(2.rr fo k + 4),<br />

where A is the frequency shift in the TF plane, fk is the<br />

frequency at time L, fo is the number of cycles of the<br />

sinusoidal FM, 4 is the phase shift, and m is the am-<br />

plitude. In practice, a sinusoidal FM component may<br />

not have a constant amplitude. In order to minimize<br />

the effect of amplitude variations, an attempt is made<br />

to make the waveform continuous by using edge-linking<br />

techniques based upon a gradient method (i.e., by ap-<br />

plying an image processing algorithm to the TFD).<br />

The algorithm works as follows:<br />

1. Each parameter is bounded by a minimum and<br />

a maximum value. For each point (fk, IC) in the<br />

TF plane carrying a nonzero value, we let the pa-<br />

rameters m, fo, and q5 equal each of the allowed<br />

(quantized or binned) values and solve for the<br />

corresponding A using the equation A = fk -<br />

m sin(2.rr fo k + 4). The parameter A is rounded<br />

to the nearest allowed quantized value.<br />

2. If the choice of the parameters results in a nonzero<br />

value for A, we increment the corresponding four-<br />

dimensional array (initialized to zero) and the en-<br />

ergy value corresponding to the cell. It is obvious<br />

that sinusoidal FM components will correspond to<br />

high-intensity hypersurfaces in the Hough param-<br />

eter domain.<br />

3. A threshold is then applied to the total energy<br />

value and the number of points. This facilitates the<br />

detection of energy-varying sinusoidal FM compo-<br />

nents of significant duration.<br />

4. The mean and standard deviation of the army in-<br />

dices are computed using those accumula.t.or cells<br />

whose values have passed t>he threshold t,est,. A<br />

high value for st.anda.rd deviation of t.he parame-<br />

ters indicates the presence of multiple sinusoidal<br />

FM components in the TF plane.<br />

For the detection of hyperbolic FM, the parametric<br />

equation fk = A + is considered, where b is a con-<br />

stant related to the time shift. The two parameters A<br />

and b can provide a generalized representat,ion of any<br />

hyperbolic FM phenomenon. The est.imation procedure<br />

is sinii1a.r to that for sinusoidal FM; however, t,he dimen-<br />

sion of the array is two (.4 and b) instlead of four.<br />

IV. RESULTS<br />

The proposed method was tested on synthetic sig-<br />

nals containing sine and hyperbolic FM components.<br />

The TFDs were obtained using the MCE algorithm with<br />

narrowband and wideband spectrograms as the a pri-<br />

orz estimate. The TFDs were of high TF resolution and<br />

free of cross-terms.<br />

Figure l(a) shows a nonstationary signal wit,h a<br />

sinusoidal FM component with SNR = 0 dB. The TFD<br />

obtained using the MCE method is shown in Figure<br />

l(b). The Hough parameter domain (which cannot be<br />

displayed due to its dimensionality of four) indicated<br />

the presence of a sinusoidal FM component, and the<br />

estimation of the parameters -4, fk, fo, 4, and m of<br />

the sinusoidal FM was accurate. The threshold for the<br />

accumulator cells was set to be 3% of the total energy<br />

value of the signal.<br />

Figure 2(a) shows a nonstationary signal with a hy-<br />

perbolic FM component embedded in white noise at an<br />

SNR of 0 dB. Figure 2(b) shows the TFD of the signal<br />

computed using the MCE method. The Hough param-<br />

eter space is shown in Figure 2(c) with the co-ordinates<br />

corresponding to A and b. The threshold was set to 9%<br />

of the total energy value of the signal. From the Hough<br />

parameter space we can infer that the highest intensity<br />

point corresponding to A = 23 and b = 507 are the<br />

parameters of the hyperbolic FM.<br />

A multicomponent nonstationary signal consisting<br />

of a sinusoidal FM component and a hyperbolic FM<br />

component corrupted by random noise to an SNR of<br />

0 dB is shown in Figure 3(a). The MCE-based TFD<br />

is shown in Figure 3(b). The sinusoidal FM detection<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.<br />

558<br />

191


1<br />

08<br />

-08C I I<br />

100 203 300 400 5W 6W 700 800 900 1000<br />

flme 5amDIe5<br />

(a)<br />

I<br />

100 200 3W 400 500 600 7W 800 900<br />

I<br />

1OW<br />

t,me samples<br />

(b)<br />

Fig. 1. (a) Nonstationary signal with a sinusoidal FM component<br />

at an SNR of 0 dB. (b) TFD of the signal in (a) computed<br />

using the MCE method.<br />

method successfully indicated the presence of a sinu-<br />

soidal FM component, and the hyperbolic FM detec-<br />

tion method indicated the presence of a hyperbolic FM<br />

component. The Hough parameter space of hyperbolic<br />

FM detection is shown in Figure 3(c).<br />

V. CONCLUSION<br />

The proposed method successfully detected non-<br />

linear FM component,s, and the parameters estimated<br />

were accurate within *lo% of their actual values even<br />

at a low SNR of -5 dB. The nonlinear FM components<br />

were not detected at an SNR of -10 dB. Better esti-<br />

mates of the parameters under low-SNR conditions can<br />

be achieved by increasing the number of quantization<br />

levels for each parameter and by denoising the signals<br />

before computing their TFDs. A difficulty with the pro-<br />

posed method lies in the selection of a suitable thresh-<br />

old (lower thresholds have to selected for low SNR).<br />

J<br />

8<br />

::I . t<br />

l<br />

Fig. 2. (a) Nonstationary signal with a hyperbolic FM compo-<br />

nent at an SNR of 0 dB. (b) TFD of the signal in (a) computed<br />

using the MCE method. (c) Hough parameter space.<br />

The performance of the method needs to be evaluated<br />

in comparison with the existing methods under differ-<br />

ent SNR conditions. The method can be extended to<br />

detect any pattern (signature) in the TFD provided the<br />

pattern can be expressed in a parametric form.<br />

The proposed method may find application in the<br />

detection of the presence of nonlinear FM components<br />

in biomedical signals such as knee joint sound signals,<br />

and facilitate screening of normal and abnormal signals.<br />

559<br />

Acknowledgements: We gratefully acknowledge support<br />

192<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.


08<br />

[3] S. Peleg and B. Friedlander. Multicomponent signal analysis<br />

06 using the polynomial-phase transform. IEEE Transactzons<br />

on Aerospace and Electronic Systems, 32( 1):378-387, January<br />

04<br />

1996.<br />

02 [4] S.G. Mallat and 2. Zhang. Matching pursuit with timej<br />

0<br />

$4,<br />

frequency dictionaries. IEEE Transactzons on Szgnal Proceeszng,<br />

41(12):3397-3415, 1993.<br />

[5] L. Cohen and T. Posch. Positive time-frequency distribution<br />

fuctions. IEEE Transactzons on Acoustzcs, Speech, and Szgnal<br />

-0 4<br />

-0 e<br />

Processzng, 33:31-38, 1985.<br />

[6] P. Loughlin, J. Pitton, and L. Atlas. Construction of positive<br />

4* time-frequency distributions. IEEE Transactzons on Szgnal<br />

Processzng, 42:2697-2705, 1994.<br />

100 200 300 400 so0 500 700 800 900 ,000<br />

r,me samples<br />

(a)<br />

(C)<br />

Fig. 3. (a) Nonstationary signal with a sinusoidal FM component<br />

and a hyperbolicFM component at an SNR of 0 dB. (b) TFD<br />

of the signal in (a) computed using the MCE method. (c)<br />

Hough parameter space of hyperbolic FM detection.<br />

from the Natural Sciences and Engineering <strong>Research</strong><br />

Council of Canada (NSERC) and the Alberta Heritage<br />

Foundation for Medical <strong>Research</strong> (AHFMR).<br />

REFERENCES<br />

[I] S. Barbarossa. <strong>Analysis</strong> of multicomponent LFM signals by<br />

combined Wigner-Hough transform. IEEE Transactions on<br />

<strong>Signal</strong> Processing, 43(6):1511-1515, June 1995.<br />

[2] S. Barbarossa and 0. Lemoine. <strong>Analysis</strong> of nonlinear FM sig-<br />

nals by pattern recognition of their time-frequency represen-<br />

tation. IEEE <strong>Signal</strong> Processing Letters, 3(4):112-115, April<br />

1996.<br />

193<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.<br />

560


Proceedings - 19th international Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />

TIME-FREQUENCY SIGNAL FEATURE EXTRACTION AND<br />

SCREENING OF KNEE JOINT VIBROARTHROGRAPHIC<br />

SIGNALS USING THE MATCHING PURSUIT METHOD<br />

Sridhar Krishnan' , Rangaraj M. Rangayyan1v2, G. Douglas Bel11*213, Cyril B.<br />

'Dept. of Electrical and Computer Engineering, 2Dept. of Surgery, 3Sport Medicine Centre<br />

The <strong>University</strong> of Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca)<br />

Abstmct-Nonstationary features of knee joint vi-<br />

broarthrographic (VAG) signals were extracted from<br />

their time-frequency distributions (TFDs) obtained us-<br />

ing the matching pursuit method. Features computed<br />

as marginal calculations of the TFDs were instantaneous<br />

energy, instantaneous energy spread, instantaneous mean<br />

frequency, and instantaneous mean frequency spread.<br />

The features carry information about the combined time-<br />

frequency dynamics of the signals. The mean and stan-<br />

dard deviation of the features were also computed, and<br />

each VAG signal was represented by a set of just eight<br />

parameters. The method was tested on 37 VAG signals<br />

(19 normal and 18 abnormal) with no restriction on the<br />

type of articular cartilage pathology. Discriminant analy-<br />

sis of the parameters showed an accuracy of 89.5% at the<br />

training stage and 77.8% at the test stage. Compared<br />

to our previous methods, the proposed method does not<br />

need any joint angle and clinical information, and shows<br />

good potential for noninvasive diagnosis and monitoring<br />

of articular cartilage pathology.<br />

Keywords: Vibroarthrography, Knee sounds, Time-<br />

frequency analysis, Articular cartilage, Matching pursuit.<br />

I. INTRODUCTION<br />

Knee joint vibration or sound signals, also known as<br />

vibroarthrographic (VAG) signals, emitted during active<br />

movement of the leg are expected to be associated with<br />

pathological conditions of the articular cartilage. VAG<br />

signal analysis could lead to a clinical tool for diagno-<br />

sis and monitoring of true articular cartilage pathology<br />

such as chondromalacia of the patella. A variety of VAG<br />

signal analysis techniques have been proposed in the lit-<br />

erature [I], [2], [3], [4], [5]. All of the previous methods<br />

used standard signal processing techniques based on the<br />

Fourier transform or autoregressive modeling, by assum-<br />

ing the signal to be either stationary or by segmenting<br />

the signal into quasi-stationary parts.<br />

In the present work, the nonstationarity of VAG sig-<br />

nals is taken into consideration, which arises due to the<br />

fact that different joint surfaces come in contact during<br />

movement, and the nature and quality of the joint sur-<br />

faces coming in contact may not be same from one posi-<br />

tion to the next. Hence both intra- and inter-subject vari-<br />

(O-7803-4262-3/97/$10.00 (C) 1997 IEEE)<br />

1309<br />

ability of signal characteristics are expected. Although<br />

our previous approaches [4], 151 addressed nonstationar-<br />

ity to some extent by using robust adaptive segmentation<br />

algorithms, there was a difficulty in labeling individual<br />

segments as normal or abnormal. This is because an<br />

accurate estimation of the joint angle corresponding to<br />

pathology as observed during arthroscopy could not be<br />

achieved. The problem could be completely avoided by<br />

using nonstationary signal analysis tools such as time-<br />

frequency (TF) and wavelet transforms. The objective<br />

of our current work is to extract and identify relevant<br />

features in the TF plane which could discriminate abnor-<br />

mal knees from normal knees based solely on VAG signal<br />

features.<br />

A. Data Acquisition<br />

11. METHODS<br />

Each subject sat on a rigid table in a relaxed position<br />

with his/her leg freely suspended in air. The VAG signal<br />

was recorded at the mid-patella position of the knee as<br />

the subject swung his/her leg over an approximate an-<br />

gle range of 135' + 0' -+ 135' in 4s. The signal was<br />

prefiltered and amplified before digitizing at a sampling<br />

rate of 2kHz. A database of 37 signals was prepared, in-<br />

cluding 18 signals of symptomatic patients scheduled to<br />

undergo arthroscopy. There was no restriction imposed<br />

on the type of pathology, and the abnormal signals in-<br />

cluded chondromalacia of different grades at the patella,<br />

meniscal tear, tibial chondromalacia, and anterior cruci-<br />

ate ligament injuries.<br />

B. Time-Fwquency <strong>Analysis</strong><br />

Features of VAG signals were extracted from their<br />

time-frequency distributions (TFDs) obtained using the<br />

matching pursuit (MP) method [SI. In MP analysis,<br />

the given signal is decomposed into a linear expansion<br />

of waveforms, known as TF atoms, selected from a re-<br />

dundant dictionary of functions. The TF atoms in the<br />

dictionary are generated by scaling, translating, and fre-<br />

quency modulating a normalized window function g7(t).<br />

194<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.


Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />

The MP method represents a signal z(t) as:<br />

where<br />

and a, are the expansion coefficients. The scale factor s,<br />

is used to control the width of the envelope of grn(t), and<br />

the parameter p, controls the temporal placement. -&<br />

is a normalizing factor, which keeps the norm of gyn(t)<br />

equal to 1. The parameters fn and 4, are the frequency<br />

and phase of the exponential function, respectively. In<br />

our application, the envelope function is a Gaussian func-<br />

tion, i.e., g(t) = 2'14 exp(--Kt2); the TF atoms are then<br />

called Gabor functions.<br />

In practice, the algorithm works as follows: First,<br />

the signal is projected on to a Gabor function dictionary.<br />

The projection decomposes the signal into two parts:<br />

4t) = (",gro)g-Yo(t) + R14% (3)<br />

where (2, gyo) denotes the inner product (projection) of<br />

z(t) with the first TF atom gro(t). The second term<br />

R'z(t) is the residual vector after approximating z(t)<br />

in the direction of gyo(t). This process is continued by<br />

projecting the residue on to the dictionary, and after M<br />

iterations<br />

M-1<br />

42) = (R"z,g,,)g,,(t) + R'4% (4)<br />

n=O<br />

with Roz(t) = ~ (2). There are two ways of stopping the<br />

iterative process: one is to use a pre-specified limiting<br />

number A4 of TF atoms, and the other is to verify the<br />

energy of the residue RMz(t). A very high value of M<br />

and a zero value for the residual energy will decompose<br />

the signal completely at the expense of increased compu-<br />

tational complexity.<br />

In this work, M was chosen to be 1000 atoms, the<br />

residual energy limit was set to be zero, and only coherent<br />

structures were extracted (i.e., components determined to<br />

be noise by the MP algorithm were rejected).<br />

Now, the Wigner distribution of t(t) based on the<br />

TF atoms is given as [6]:<br />

M-1<br />

W(t,w) = I(Rnz,g,n)12 Wgyn(t,u)+<br />

n=O<br />

-<br />

i<br />

P o<br />

-20<br />

-40<br />

Time samples (1 to 8 I92 samples)<br />

@)<br />

Fig. 1. (a) A normal VAG signal. (b) TFD of (a) obtained using<br />

the matching pursuit method. au: Acceleration units.<br />

Time samples ( I to R 192 samples)<br />

(b)<br />

M-1 M-1<br />

Fig. 2. (a) The VAG signal of a pathological knee with chon-<br />

(R"z,grn) (Rmwrm)* ~bYn,gYml(t>4, dromalacia of the patella. (b) TFD of (a) obtained using the<br />

n=O m=o matching pursuit method. au: Acceleration units.<br />

m#n<br />

1310<br />

195<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.


Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />

where Wgr,(t,w) is the Wigner transform of the Gaus-<br />

sian window function. The double sum corresponds to<br />

the cross-terms of the Wigner distribution, and should<br />

be removed in order to obtain an interference-free energy<br />

distribution of z(t) in the TF plane. Thus only the first<br />

term is retained, and the interference-free TFD is given<br />

by<br />

M-1<br />

W’(t,4 = I(Rn~lSrn)IZwgrn(tlw). (5)<br />

n=O<br />

Figs.l(b) and 2(b) show the TFDs of the normal<br />

VAG signal in fig.l(a) and the abnormal VAG signal in<br />

fig.2(a). The TFD of the normal signal was computed<br />

using 392 TF atoms, and the abnormal signal’s TFD<br />

was computed using 441 TF atoms. The TFDs obtained<br />

are of very high resolution, and with synthetic signals of<br />

known TF dynamics, the TFDs based on the MP algo-<br />

rithm showed very good localization in both time and<br />

frequency. The bright spots in the TFD figures corre-<br />

spond to the TF atoms; the brightness increases with<br />

energy. In the illustration, the TFD of the abnormal sig-<br />

nal shows more high-frequency activity than the TFD of<br />

the normal signal. However, this need not be always true<br />

(especially in the case of normal noisy knees), and mere<br />

visual interpretation of the TFDs will not help in discrim-<br />

inating pathological knees from normal knees. In order<br />

to differentiate the signals, features of diagnostic value<br />

need to be extracted from the TF plane.<br />

C. Time-Frequency <strong>Signal</strong> Features<br />

As the TFD obtained using the TF atoms is an<br />

interference-free distribution and is always positive, fea-<br />

tures derived from the TF plane will posses a high de-<br />

gree of accuracy as compared to features obtained with<br />

Cohen’s class TF transforms. The features used in the<br />

present work were computed as marginal calculations of<br />

the TFDs. The four TF features of relevance derived<br />

from the TFDs of VAG signals were:<br />

Instantaneous energy (IE): As the TFDs were ob-<br />

tained using TF atoms that were coherent with the<br />

signal structure and the signal components that were<br />

determined to be noise by the MP algorithm were re-<br />

jected, the IE obtained as afunction of time will have<br />

a high signal-to-@se ratio. The IE was computed<br />

as the mean of W (t,w) along each time slice, which<br />

gives a measure of energy variation with time. Sig-<br />

nals generated by pathological knees will be highly<br />

time-variant (i.e., they are highly nonstationary) be-<br />

cause of the differences in cartilage roughness and<br />

nonuniformity. Thus the IE of an abnormal signal is<br />

expected to show large variations with time.<br />

Instantaneous energy spread (IES): IES measures<br />

the spread of energy over frequency for each time<br />

slice. This was computed as the standard deviation<br />

131 1<br />

of W’(t,w) along each time slice. This is a good<br />

measure if the signal is multicomponent in nature.<br />

Abnormal VAG signals generated as a result of fric-<br />

tion between rough cartilage surfaces may have more<br />

components because of the nonuniformity of the sur-<br />

faces, and a high signal energy spread is expected<br />

around the IE.<br />

Instantaneous mean frequency (IMF): IMF was com-<br />

puted as the first moment along each time slice:<br />

IMF measures the frequency dynamics of the sig-<br />

nal. The movement of the knee during signal acqui-<br />

sition may cause some linear or nonlinear frequency<br />

modulation of the signal, with the modulation in-<br />

dex depending on the state of lubrication, stiffness,<br />

and roughness of the cartilage surfaces. Pathologi-<br />

cal knees have less lubricated and rougher cartilage<br />

surfaces than normal knees, and hence the IMF of<br />

pathological knees will be different from that of nor-<br />

mal knees.<br />

Instantaneous mean frequency spread (IMFS): IMFS<br />

is given by the second central moment along each<br />

time slice:<br />

IMFS gives the spread of frequency about the mean<br />

frequency for each time instant. The spread of fre-<br />

quency at a time instant arises as a result of am-<br />

plitude modulation. Amplitude modulation is pos-<br />

sible in VAG signals, and may be dependent on the<br />

quality and intensity of sound produced due to joint<br />

vibration. IMFS could be an excellent feature in<br />

identifying noisy knees,<br />

The four featires derived above are dependent on the<br />

functional state of the cartilage surfaces in the knee joint,<br />

and are expected to be suitable for discriminating patho-<br />

logical knees from normal knges.<br />

D. Pattern Classification<br />

The features discussed in previous section are time<br />

dependent. This could be easily observed in the wave-<br />

forms shown in fig.3, which were derived from the TFD<br />

shown in fig.2(b). In order to facilitate a global deci-<br />

sion on the signal, the mean and variance of the features<br />

were calculated. Therefore, a given VAG signal will have<br />

eight parameters. The classification of knees as normal or<br />

pathological was achieved using a statistical pattern clas-<br />

sifier based on discriminant analysis of the parameters [7].<br />

The database was randomly split into two (almost) equal<br />

parts. The features of signals in one part of the database<br />

196<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.


Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />

Fig. 3. Features of the abnormal signal in fig.?. (a) IE waveform.<br />

(b) IES waveform. (c) IMF waveform. (d) IMFS waveform.<br />

were used to train the classifier. The classifier was then<br />

tested on the second part of the database. The signals<br />

used in the test stage were different from those in the<br />

training stage. The classification accuracy is given as the<br />

percentage of the number of correctly classified signals to<br />

the number of signals in the group.<br />

111. RESULTS AND DISCUSSION<br />

The classifier was trained with the TF parameters of<br />

19 signals (9 normal and 10 abnormal), and was tested<br />

with 18 signals (10 normal and 8 abnormal). The classifi-<br />

cation accuracy in training was 89.5%, and the test stage<br />

accuracy was 77.8%.<br />

Among the features used, IE and IES contributed<br />

significantly to the classification accuracy. This shows<br />

that abnormal VAG signals possess highly time-varying<br />

energy as compared to normal VAG signals. The perfor-<br />

mance of IES leads to the conclusion that VAG signals<br />

of pathological knees are more multicomponent in na-<br />

1312<br />

197<br />

ture than those of normal knees. This may be due to<br />

the possibility that the roughness of cartilage surfaces in<br />

pathological knees give rise to multiple sources of vibra-<br />

tion signals. Though IMF did not show high discrimina-<br />

tion, it can be used to detect linear frequency-modulated<br />

components or chirp signals (if any) in VAG signals. We<br />

are currently working on detection of chirp components<br />

in TFDs from an image processing perspective. While<br />

IMFS may not be a very good feature for classification of<br />

knees as normal or abnormal, it could be used to study<br />

the sound patterns of knees, and to investigate whether<br />

the sound patterns of pathological knees differ from those<br />

of normal knees.<br />

The use of the variance of the features could allow<br />

this method to be applied on a larger set of signals, and<br />

could avoid any bias as a result of variations in transducer<br />

placement, amplifier-filter settings, etc.<br />

1V. CONCLUSIONS<br />

A novel approach to analyze nonstationary VAG sig-<br />

nals was used. The method does not require any joint an-<br />

gle information to label the components of a VAG signal,<br />

and is independent of patient information such as age,<br />

activity level, and gender. <strong>Signal</strong> features of diagnostic<br />

relevance were extracted from their TFDs obtained using<br />

the MP method. The TF features extracted have shown<br />

good potential of this method for screening of VAG sig-<br />

nals.<br />

Acknowledgements: We gratefully acknowledge support<br />

from the Alberta Heritage Foundation for Medical Re-<br />

search and the Arthritis Society of Canada.<br />

REFERENCES<br />

M.L. Chu, LA. Gradisar, and R. Mostardi. A noninvasive electroacoustical<br />

evalution technique of cartilage damage in pathological<br />

knee joints. Medical and Biological Engineering and<br />

Computing, 16:437-442,1978.<br />

W.G. Kernohan, D.E. Beverland, G.F. McCoy, A. Hamilton,<br />

P. Watson, and R.A.B. Mollan. Vibration arthrometry. Acta<br />

Orthop. Scand., 61(1):70-79,1990.<br />

N.P. Reddy, B.M. Rothschild, M. Mandal, V. Gupta, and<br />

S. Suryanarayanan. Noninvasive acceleration measurements<br />

to characterize knee arthritis-and chondromalacia. Annals of<br />

Biomedical Engineering, 23:78-84, 1995.<br />

Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. Frank,<br />

K.O. Ladly, and Y.T. Zhang. Screening of vibroarthrographic<br />

signals via adaptive segmentation and linear prediction modeling.<br />

IEEE Transactions on Biomedical Engineering, 43:15-23,<br />

1996.<br />

S. Krishnan,R.M. Rangayyan, G.D. Bell, C.B. Frank, andK.0.<br />

Ladly. Screening of knee joint vibroarthrographic signals by<br />

statistical analysis of dominant poles. In CDROM proceedings,<br />

18th Annual International Conference of The IEEE Engineering<br />

in Medicine and Biologg Society, Amsterdam, The Netherlands,<br />

October 1996.<br />

S.G. Mallat and Z. Zhang. Matching pursuit with timefrequency<br />

dictionaries. IEEE Trans. on <strong>Signal</strong> Processing,<br />

41(12):3397-3415,1993.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.<br />

SPSS Inc., Chicago, IL. SPSS Advanced Statistics User's<br />

Guide, 1990.


138<br />

Detection of Chirp and Other Components in the Time-Frequency Plane<br />

using the Hough and Radon Transforms<br />

Sridhar Krishnan and Rangaraj M. Rangayyan<br />

Dept. of Electrical and Computer Engineering, The <strong>University</strong> of Calgary,<br />

2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />

Email: (krishnan) (ranga)@enel.ucalgary.ca<br />

Abstract ~ We propose a novel approach to detect chirp<br />

(linear frequency modulated) components in multicomponent<br />

nonstationary signals in the time-frequency (TF) plane.<br />

The approach, based on the Hough and Radon transforms<br />

of TF distributions, can be used to detect chirp components<br />

with varying energy in unknown signal-to-noise ratio environments.<br />

In addition to detection of chirps, the proposed<br />

technique could also be used as a tool to evaluate the TF<br />

resolution provided by different TF analysis methods.<br />

I. INTRODUCTION<br />

Time-frequency distributions (TFD) give the energy dis-<br />

tribution of a signal in the time-frequency (TF) plane, and<br />

are suitable for analyzing nonstationary signals. In particu-<br />

lar, a TFD gives information about the time, frequency, and<br />

combined TF dynamics of a signal. Stationary, Dirac, and<br />

chirp (linear frequency modulated or FM) components of a<br />

signal appear as directional components in the TF plane.<br />

The directional components may be narrow or broad in the<br />

TF plane depending upon the resolution of the TF transfor-<br />

mation used and the energy spread of the component. If the<br />

signal energy is oriented only horizontally in the TF plane<br />

(i.e., a stationary component) or only vertically (i.e., a Dirac<br />

component), then signal detection is easy, and optimal de-<br />

tection can be achieved by computing the marginal densities<br />

of the TFD. However, in practice, chirp components may<br />

occur with arbitrary TF orientations.<br />

Detection of chirp components helps in understanding<br />

the underlying TF dynamics of a signal. Many methods of<br />

chirp detection have been proposed in the literature; typ-<br />

ical applications of chirp detection are found in synthetic<br />

aperture radar, communication over time-varying multipath<br />

channels, and seismology. A method for optimal detection<br />

of chirp components based on a maximum likelihood ap-<br />

proach was proposed by Kay and Boudreaux-Bartels [l].<br />

This method of chirp detection is equivalent to the Radon<br />

transform (RT) of the TFD obtained using the Wigner dis-<br />

tribution. The Radon-Wigner method of chirp detection is<br />

computationally expensive, and an efficient implementation<br />

based on a dechirping method was proposed by Wood and<br />

Barry [a]. The Hough transform (HT) could be used instead<br />

of the RT to detect arbitrary shapes in TF planes which<br />

are not necessarily straight lines (chirps). A Wigner-Hough<br />

method to detect chirp and nonlinear FM components was<br />

0-7803-39O5-3/97/$1O.OO01997 IEEE<br />

198<br />

Nonstationary<br />

Sipal --+<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.<br />

Time-frequency (TF)<br />

atoms<br />

- - Matching pwsuit Wiper distribution HoughXadon (HR)<br />

algorithm ofTF atoms (TFD) transform<br />

Deteciion of<br />

chirps<br />

c- lhieshold<br />

Fig. 1. Block diagram of the proposed method.<br />

proposed by Barbarossa and Lemoine [3], [4].<br />

The motivation of this work is to apply a combined<br />

Hough-Radon (HR) transform (HRT) on TFDs obtained us-<br />

ing the matching pursuit (MP) method to detect energy-<br />

varying chirps with unknown parameters. The block dia-<br />

gram of the proposed method is shown in Fig.1. The use of<br />

the MP technique facilitates application of this method in<br />

environments with unknown signal-to-noise ratio (SNR) .<br />

11. THE HOUGH-RADON TRANSFORM<br />

The TFD of a multicomponent nonstationary signal can<br />

be obtained using the MP method proposed by Mallat and<br />

Zhang [5]. In MP, the given signal is decomposed into a linear<br />

expansion of waveforms, known as TF atoms, selected<br />

from a large dictionary of Gabor functions. The TF atoms<br />

corresponding to only the coherent structures of the signal<br />

can be extracted, and the SNR of the signal with unknown<br />

noise power can be increased. The TFD obtained as a summation<br />

of the Wigner transforms of TF atoms is of high TF<br />

resolution and free of interference.<br />

To detect chirp components (straight lines at arbitrary<br />

orientations in the TF plane), the HT may be used. The HT<br />

is a common technique to detect lines and curves that satisfy<br />

a parametric constraint [6].<br />

The HT is most commonly used as follows: Consider a<br />

point (zi, yi) in the image plane (please note that the image<br />

plane here denotes the TF plane). The general equation of<br />

a straight line in the slope-intercept form is yi = mzi + b,<br />

where m is the slope, 6 is the intercept with the ?J axis, and


the x and y axes correspond to the t and w axes respectively.<br />

There are an infinite number of lines that pass through a<br />

point (xi, yi) and still satisfy the equation yi = mxi + b, for<br />

varying values of the parameters m and b. Parameterizing<br />

the TF plane into the (m, b) parameter space poses a problem<br />

because of the unbounded nature of m and b. One way to<br />

avoid this problem is to use the normal representation of a<br />

line given by<br />

zcosB+ysinB=p. (1)<br />

The parameter space (pie), also known as the Hough do-<br />

main, is now bounded in 0 to the interval [O,.ir] and in p<br />

by the Euclidean distance to the farthest point in the image<br />

from the centre of the image.<br />

From Eqn.1, for a specific point in the TF plane (ti, wi),<br />

we obtain a sinusoidal curve in the Hough domain. All of<br />

the sinusoids resulting from a mapping of a line in the TF<br />

plane have a common point of intersection in the Hough<br />

domain. Thus, chirps in the TF plane will correspond to<br />

high-intensity points in the Hough domain.<br />

A. Hough-Radon Algorithm for Chirp Detection<br />

139<br />

way, chirp components appear as high-intensity points in the<br />

HR domain, and the brightness increases with the energy of<br />

the chirp.<br />

The HRT is a powerful way of determining directional<br />

elements (such as chirps) in gray-level images (such as TFD),<br />

but lacks by itself the capability to eliminate components<br />

that do not contribute coherently to a particular directional<br />

pattern. The high-intensity components (features of int,er-<br />

est) in the HR domain can be extracted by using gradient<br />

operators. A gradient operator that may be used in the<br />

Hough domain is the simple 3x3 mask shown below [7]:<br />

0 -2 0<br />

1 2 1 .<br />

0 -2 0<br />

A drawback with this approach is that the filter was designed<br />

for detecting lines of one pixel width, and cannot be used to<br />

detect broad directional components. As chirps are normally<br />

broad components because of the inherent tradeoff between<br />

time and frequency resolution of TF transforms, the above<br />

filter may often fail.<br />

The problem discussed above can be overcome by not<br />

using a convolution mask, but instead using a method to<br />

detect the peaks corresponding to broad components by ap-<br />

plying a suitable threshold in the HR domain. Values in cells<br />

greater than the threshold will be retained, and values lower<br />

than the threshold will be set to zero. The threshold is se-<br />

lected based on local statistics in the HR domain. First, the<br />

histogram (probability density function) of the HR image is<br />

The computational attractiveness of the HT arises from<br />

subdivision of the Hough domain into accumulator cells.<br />

The cell at coordinates (i,j), with accumulator value<br />

A(i, j), corresponds to the square associated with the<br />

parameter coordinates (0i , pj). Initially, the cells are<br />

set to zero.<br />

For every point (tk,wk) in the TF plane, we let the parameter<br />

0 equal each of the allowed subdivision values<br />

on the 0 axis and solve for the corresponding p using<br />

Eqn.1. The resulting p's are then rounded off to the<br />

nearest allowed value on the p axis. If a particular 0i<br />

value results in the solution pj , the ccrresponding accumulator<br />

A(i, j) is incremented.<br />

9 h(g)<br />

calculated, and the mean is computed as M = T*c ,<br />

where h(g) is the frequency of occurrence of the gth gray<br />

level of the HR image with T rows and c columns. Then, the<br />

threshold is computed as<br />

At the end of the procedure, a value of M in A(i, j) corresponds<br />

to M points in the TF plane lying on the line<br />

threshold = signal factor * M. (3)<br />

t cos 0i f w sin 0i = pj. It is evident that more subdivi- The signal factor is dependent on the type of the signal being<br />

sions in the Hough domain will lead to a more accurate analyzed.<br />

estimate of collinear points, but at the expense of additional<br />

computational complexity. In this work, the full<br />

B. Mathematzcal Proof<br />

ranges of 0 and p were used.<br />

It can be mathematically shown that the HRT (or more<br />

The main drawback of the HT is that it is usually performed<br />

on binary images, and hence may not be appropriate<br />

generally, the RT) of a TFD provides maximum likelihood<br />

(ML) detection of a chirp signal. Wood and Barry [a] have<br />

for gray-level images and TFDs. As energy values of chirps stated that the RT of the general Wigner TFD is equivavary<br />

in the TF plane, they occupy different gray levels, with lent to the ML estimate of a chirp. In this paper, the above<br />

255 corresponding to the highest scaled energy component. statement is mathematically proved, and the results can be<br />

It is not appropriate to binarize the TF image, since the HT directly extended to the interference-free TFD obtained uswill<br />

then not be able to detect energy-varying chirp components.<br />

This drawback can be avoided by using the combined<br />

HRT.<br />

With the HRT, the algorithm is exactly similar to the<br />

ing MP. The ML detection of a linear chirp is given by [l]:<br />

CO<br />

L = ma&,,,, 1, W(t, WO + mt) dt > 17, (4)<br />

one discussed earlier, except that instead of counting the<br />

number of collinear points in a cell, the gray values of<br />

collinear points are added to each cell and then multiplied<br />

with the total number of collinear points in that cell. In this<br />

where<br />

7- 7-<br />

W(t,w) = - x*(t + -)x(t 2 - -) 2<br />

2n Jm<br />

exp-jwr dr (5)<br />

199<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.


140<br />

x<br />

0<br />

%<br />

E<br />

Lr,<br />

Time<br />

Fig. 2. Graphical illustration of the HRT of a TFD.<br />

is the Wigner distribution of signal x(t)] and m is the slope<br />

of the path of integration in the TF plane as shown in Fig.2.<br />

Geometrically, the line integration in Eqn.4 is similar to the<br />

RT of W(t,w) given bv<br />

00<br />

~j~(t,u)) = S__~i-jrcos~- ssinB,rsinB+scosB)ds,<br />

where R is the Rador, aperator. Using Eqn.5 in Eqn.6,<br />

R (W(t, w)) J-”, J-”, z*(rcosB - ssin B + r/2)<br />

exp(-j(rsin0 + scosB)r) drds.<br />

=<br />

x(r cos B - s sin B - r/2) (7)<br />

From Fig.2 m = -A and WO = 5. For MI, estimation<br />

of a chirp, w = WO + mt from Eqn.4. Therefore<br />

wo+mt = __- s &(T sin R cos B + s sine)<br />

- r-rcosZO+ssin RcosO<br />

sin 0<br />

= rsinB+scosB.<br />

Now, transforming Eqn.7 to Cartesian co-ordinates and us-<br />

ing Eqn.8, we get<br />

(W(t,<br />

m a ?<br />

U)) = J-, J-, z* (t + ./a) x(t - 7-P)<br />

(6)<br />

(8)<br />

exp(-j(wo + mt)r) drdt (9)<br />

= J-”, W(t,wo + mt)dt.<br />

This proves that the RT or the HRT of a general Wigner<br />

TFD (or the TFD obtained by MP) is equivalent to ML<br />

detection of chirps.<br />

C. Analysts of TF Resolution<br />

The proposed technique can also be used to evaluate<br />

the TF resolution provided by different TFDs. A test signal<br />

composed of a sine and a Dirac component is passed through<br />

the TFD generator (e.g., Choi-Williams, spectrogram), and<br />

the HRT of the output is computed. The TF resolution of<br />

200<br />

(a) E 0.6-<br />

-<br />

.$0.4<br />

0 z0.2-<br />

I / j 1<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.<br />

I<br />

Fig. 3. Testing the thresholdedHR method with an image. (a) Original<br />

image, (b) HR image, (c) After applying the gradient mask operator<br />

to (b), and (d) After thresholding (b).<br />

angle in degrees<br />

Fig. 4. Directional concentration plots provided by two methods. (a)<br />

Gradient mask method, (b) Threshold method.<br />

the TFD can be evaluated by observing the peaks at 0 and 90<br />

degrees in the HR domain. A TFD with good time resolution<br />

will have a narrow component at 90 degrees, whereas a TFD<br />

with good frequency resolution will have a narrow component<br />

at 0 degree. At present there is no technique available to<br />

check the TF resolution provided by different TFDs. The<br />

proposed method should be a good tool to evaluate the TF<br />

resolution performance of different TF methods.<br />

111. RESULTS<br />

Experiment 1: The proposed method was tested on the non-<br />

TF image shown in Fig.S(a). The image has broad direc-<br />

tional features, comparable to what is expected with chirp<br />

signals in the TF plane. The HR image is shown in Fig.3(b),<br />

from which it is evident that the broad directional compo-<br />

nents in the test image correspond to bright, broad com-<br />

ponents in the HR domain. The HR image also has other<br />

less intense components that do not relate to the directional


Fig. 5. Results with synthetic signal 1. (a) Synthetic signal (z axistime<br />

samples, y axis- amplitude), (b) TFD of (a) (.: axis- time, y<br />

axis- frequency), (c) HR image (after thresholding), 2 axis- 0 (0 to<br />

T), y axis- p.<br />

features. The less-intense components are reduced to some<br />

extent by using the 3x3 mask. The mask did not perform<br />

well in extracting the broad components. By thresholding<br />

the parameter space using the threshold as in Eqn.3, the<br />

broad components were extracted, as illustrated in Fig.3(d).<br />

Figs.4(a) a.nd 4(b) show the directional concentration of the<br />

HR distributions in Figs.S(c) and 3(d). From Fig.4b it is evi-<br />

dent that the directional components are well resolved by the<br />

threshold method as compared to the gradient mask method.<br />

Experiment 2: The proposed method of chirp detection was<br />

tested on two synthetic signals of known time and frequency<br />

dynamics. The synthetic signals were computed using a si-<br />

nusoid (frequency dynamics), and two chirps (TF dynamics)<br />

with some random noise. Both the synthetic signals had the<br />

above components, but with different time durations. Each<br />

signal was decomposed into TF atoms (Gabor functions) by<br />

using the MP algorithm, with the maximum number of TF<br />

atoms allowed being 200. The TFDs of the signals were com-<br />

puted by adding the Wigner distributions of the TF atoms.<br />

The TFDs of the two signals are shown in Figs.5 and 6. It<br />

is interesting to note that frequency dynamics (sine com-<br />

ponent) and TF dynamics (chirp components) are treated<br />

equally well by this representation. The TFD obtained us-<br />

ing MP is interference-free, and gives the best possible TF<br />

resolution among all the TFD methods available.<br />

The HRT was applied to the TFDs in Figs.5(b) and<br />

6(b). The two chirp components appear as bright spots in<br />

the HR image at angles of about 60' and 120'. The TFD<br />

of Fig.5(b) has broader components than that in Fig.G(b);<br />

this is because of the lower TF resolution of shorter-duration<br />

signals. This difference can be seen in Figs.5(c) and 6(c).<br />

(C)<br />

201<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.<br />

(e)<br />

141<br />

Fig. 6. Results with synthetic signal 2. (a) Synthetic signal, (b) TFD<br />

of (a), (c) HR image (after thresholding).<br />

IV. CONCLUSION<br />

A novel approach to detect chirps in unknown SNR<br />

environments has been proposed in this paper. The com-<br />

bined Hough and Radon transform of a TFD obtained using<br />

the MP method provides maximum likelihood detection of<br />

chirps. The problem of identifying directional components<br />

and their dynamics in the TF plane simplifies to locating<br />

high-intensity spots in the HR plane. The method could be<br />

extended to detect arbitrary patterns in the TF plane by<br />

using the generalized HT.<br />

Acknowledgements: We gratefully acknowledge support from<br />

the Alberta Heritage Foundation for Medical <strong>Research</strong> and<br />

the Natural Sciences and Engineering <strong>Research</strong> Council of<br />

Canada.<br />

REFERENCES<br />

[l] S. Kay and G.F. Boudreax-Bartels. On the optimalityof the Wigner<br />

distribution for detection. IEEE ICASSP, pages 2331-2334, 1986.<br />

[2] J.C. Wood and D.T. Barry. Radon transformation of time-<br />

frequency distributions for analysis of multicomponent sig-<br />

nals. IEEE Transactions on <strong>Signal</strong> Processing, 42( 11):3166-3177,<br />

November 1994.<br />

[3] S. Barbarossa. <strong>Analysis</strong> of multicomponent LFM signals by com-<br />

bined Wigner-Hough transform. IEEE Transactions on <strong>Signal</strong> Pro-<br />

cessing, 43(6):1511-1515, June 1995.<br />

[4] S. Barbarossa and 0. Lemoine. <strong>Analysis</strong> of nonlinear FM signals<br />

by pattern recognition of their time-frequency representation. IEEE<br />

<strong>Signal</strong> Processing Letters, 3(4):112-115, April 1996.<br />

[5] S.G. Mallat and 2. Zhang. Matching pursuit with time-frequency<br />

dictionaries. IEEE Transactions on <strong>Signal</strong> Proceesing, 41( 12):3397-<br />

3415, 1993.<br />

[6] R.C. Gonzalez and P. Wintz. Digital Image Processing. Addison-<br />

Wesley, Inc., Reading, MA, second edition, 1987.<br />

[7] W.A. Rolston. Directional image analysis. Master's thesis, Dept.<br />

of Elect. and Comp. Engg., The Univ. of Calgary, Calgary, AB,<br />

Canada, 1994.


22<br />

Spatial-Temporal Decorrelating Decision-Feedback Multiuser Detector<br />

for Synchronous Code-Division Multiple- Access Channels<br />

Sridhar Krishnan and Brent R. Petersen<br />

Dept. of Electrical and Computer Engineering,'l'he <strong>University</strong> of Calgary,<br />

2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />

Email: (krishnan)(bp)@enel.ucalgary.ca<br />

Abstract - In this paper, a new multiuser detector<br />

for synchronous code-division multiple-access channels is<br />

developed, and the performance is compared with other<br />

multiuser detectors. The proposed multiuser detector is<br />

based on spatial-temporal filtering and decision-feedback<br />

techniques, hence the name spatial-temporal decorrelat-<br />

ing decision-feedback (STDF) detector. An optimum<br />

STDF det.ector is expected to have an exponential com-<br />

plexity as he IIUIII~IX of users grow. A suboptimum<br />

STDF detector shows a better performance in terms of<br />

probability of error (or SNR) and asymptotic efficiency<br />

as compared to the other suboptimum detectors. Simu-<br />

lation results under diverse channel conditions show that<br />

STDF is a bandwidth efficient technique, which is an es-<br />

sential requirement for modern wireless communications.<br />

Also the results indicate that STDF performance gains<br />

are more significant for relatively weak users.<br />

I. INTRODUCTION<br />

Multiuser communications has been an active area<br />

of research and has numerous applications, especially in<br />

the area of wireless communications. There are several<br />

different ways in which multiple users can communicate<br />

through the channel to the receiver. One method is to<br />

allow more than one user to share a channel or a sub-<br />

channel by use of a unique code sequence or signature<br />

waveforms that allows the user to spread the informa-<br />

tion signal across the assigned bandwidth. This multiple-<br />

access method is called the code-division multiple-access<br />

(CDMA). The objective of this work is to develop an effi-<br />

cient multiuser detector based on spatial filtering (beam-<br />

formers) and decision feedback for synchronous CDMA<br />

channels.<br />

In a CDMA system, the receiver may have an idea<br />

about the assigned signature waveforms and observes the<br />

sum of the transmitted signals in additive white Gaussian<br />

noise. The conventional single-user detector consists of a<br />

bank of single-user ma,tched filters followed by thresh-<br />

old detectors. If the assigned signature waveforms are<br />

orthogonal and if the powers of the users are not very<br />

different then the conventional detector would achieve<br />

optimum demodulation [l]. It has been shown that the<br />

optimum maximum likelihood receiver employing a gen-<br />

eralization of the Viterbi algorithm significantly outper-<br />

0-7803-3905-3/97/$ 10.OOO 1997 IEEE<br />

202<br />

forms the conventional single-user detector at the expense<br />

of a high computational complexity [a]. Since these conditions<br />

are often difficult to satisfy in practice, several<br />

suboptimum detectors have been proposed in literature<br />

PI, [3i1 ~41, 151.<br />

A. The Lanear Decorrelatang Detector<br />

The linear decorrelating detector also known as the<br />

decorrelator can significantly outperform the conven-<br />

tional single user detector [l]. This is because the decor-<br />

relator takes into account the outputs of all the matched<br />

filters in making a decision as against a single matched<br />

filter output used in conventional type. As the outputs of<br />

all the matched filters are considered in making decisions,<br />

the multiuser interference among users can be easily ex-<br />

ploited. The signal at the output of the matched filter is<br />

given as<br />

y = RWb + n,<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.<br />

(1)<br />

where R is the crosscorrelation matrix of signature waveforms,<br />

W is a diagonal matrix with Wk,k = &, k =<br />

T<br />

1,. . . , I


the users and is explained in next section. Also, a decor-<br />

relating decision-feedback (DF) detector for synchronous<br />

and asynchronous CDMA channels have been proposed<br />

in the literature [3], [4]. Performance gains of DF detec-<br />

tors are more significant than decorrelator, especially for<br />

relatively weak users [3], [4].<br />

B. Spatial- Temporal Decorrelating (STD) Detector<br />

Integrated spatial-temporal processing of the re-<br />

ceived signal has been shown to provide significant per-<br />

formance improvement over the decorrelator [5]. In most<br />

cases the multiple users are distributed in space in such<br />

a way that they are intercepted at the detector from var-<br />

ious directions. By exploiting the signals' spatial dis-<br />

tribution (direction matrix) and the temporal properties<br />

(crosscorrelation matrix R) followed by linear decorrela-<br />

tion (decorrelator) a uniformly superior performance has<br />

been achieved. The signal at the output of the matched<br />

filter is given as<br />

y = MWb + n. (3)<br />

The above equation is similar to Eqn.1 except the<br />

matrix filter M. Where M here is the spatial-temporal<br />

crosscorrelation matrix. The next stage of processing is<br />

similar to the decorrelator and the output of the matrix<br />

filter in this case is<br />

y = M-'y = Wb + 6, (4)<br />

where n is a Gaussian noise vector with the autocorrela-<br />

tion matrix M(n) = (r2M-l. Comparative results with<br />

other suboptimum detectors have shown superior perfor-<br />

mance gains of STD detectors [5]. The superior perfor-<br />

mance of STD detectors has been the motivation of this<br />

work, in which spatial filtering is combined with decision-<br />

feedback (STDF). The STDF detector is derived in next<br />

section and its performance is compared with the other<br />

suboptimum detectors.<br />

11. SPATIAL-TEMPORAL DECORRELATING<br />

DECISION-FEEDBACK (STDF) DETECTOR<br />

It has already been shown that for STD the complex-<br />

ity of the detector grows exponentially as the number of<br />

users increase [5]. Hence, the same complexity can be ex-<br />

pected for optimum STDF detectors. Here only a subop-<br />

timum STDF is considered. The suboptimum STDF de-<br />

tector is derived by exploiting the spatial-temporal cross-<br />

correlation matrix M given in Eqn.3. The matrix M can<br />

be written as the Hadamard or Schur product of matrices<br />

A and R.<br />

M = R O (A~A), (5)<br />

where o denotes the Hadamard product of matrices [6].<br />

A is the directions matrix comprising the direction vec-<br />

torsal,aa,...,ak,i.eA= [ a1 a2 ... aK ],andthe<br />

2-<br />

v-<br />

3+<br />

: Spatid<br />

Pl,<br />

'3<br />

MF2 (cjl '<br />

1 Filter I<br />

I :: FLll<br />

@J-<br />

w%<br />

I<br />

I<br />

: CL :: ::<br />

I I<br />

2Kl<br />

I<br />

.. n<br />

1- - -<br />

I<br />

I<br />

I<br />

12<br />

@L :<br />

I K I ,<br />

direction vector ak = [ 1 ejsxl ... ejokp 1'. The direction<br />

vector ak expresses the phases and gains of the P<br />

sensors relative to a reference sensor in the direction of<br />

arrival of the wavefront of user k. AH is the conjugate<br />

transpose of A. <strong>Analysis</strong> shows that R is a positivedefinite<br />

matrix.<br />

Also, it can be easily shown that AHA is always a<br />

positive definite matrix, that implies matrix M will also<br />

be a positive definite matrix. If M is a positive definite<br />

matrix, then it can be decomposed as<br />

M = L ~L, (6)<br />

where LH is an upper triangular matrix and L is a lower<br />

triangular matrix. The above method of matrix decom-<br />

position is known as the Cholesky decomposition [7]. The<br />

matrices LH and L correspond to causal and stable ma-<br />

trix filters. If the filters are represented as a spectrum<br />

(frequency domain representation), then the Cholesky de-<br />

composition can be viewed as a spectral factorization the-<br />

orem [8].<br />

In STDF, the sampled output y of the matched fil-<br />

ter is passed through the feedforward filter (LH)-l, and<br />

the resulting output vector of the matrix filter (LH)-' is<br />

given as<br />

y = (L*)-'y = LWb + n, (7)<br />

where n is a white Gaussian noise matrix with the di-<br />

agonal elements being the variance u'. Therefore, the<br />

feedforward filter (LH)-' is nothing but a whitening fil-<br />

ter. The model given in Eqn.7 is a white noise model of<br />

the CDMA channel. Also the expression given in Eqn.7<br />

makes analysis simpler. The kth component of the vector<br />

203<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.<br />

bK


24<br />

y can be written as Now the SNR of kth user can be given as<br />

From the above equation it can be seen that the kth<br />

user term has a desired term, an interference term, and<br />

a white noise t,erm.<br />

The above model gives rise to the decision-<br />

feedback technique used in STDF. The users signals<br />

in STDF are arranged in decreasing order of energy<br />

(E1 2 & 2 . . . 2 &). Where, user 1 will be the strongest<br />

user and the I


,-t<br />

1 2 J 4 5 e 7 e e 10<br />

SNR (I" de.)<br />

Low corr, Low sep<br />

Fig. 2. Probability of error curves for weaker user in a two-user system for different combinations of signal cross correlations and angular<br />

separations. Legend: corr- correlation, sep- separation, SD- spatial decorrelator, STDF- spatial-temporal decorrelating decision<br />

feedback, STD- spatial-temporal decorrelating, DF- decision feedback.<br />

received energy of the user divided by the power spectral<br />

density level (No) of the background thermal white Gaus-<br />

sian noise (not including interference from other users).<br />

In essence, the efficiency represents the performance loss<br />

due to multiuser interference. The desirable figure of<br />

merit is the asymptotic efficiency Ij)k, where the back-<br />

ground Gaussian noise level goes to zero, i.e.,<br />

Ij)k = lim ~<br />

SNR,i<br />

No-+O SNR,, '<br />

which characterizes the underlying performance loss<br />

when the dominant impairment is the existence of other<br />

users rather than the additive channel noise.<br />

Since for the STDF detector, from the expression in<br />

Eqn.17, the ideal probability of error does not depend on<br />

the noise level or the power of the interferers, and the<br />

ideal asymptotic efficiency of STDF is<br />

In the following simulation experiments, the perfor-<br />

mance figures of the proposed STDF detector have been<br />

compared with that of the other suboptimum detectors.<br />

A. <strong>Signal</strong>-to-Noise ratio<br />

IV. SIMULATION RESULTS<br />

For comparing SNR of the multiuser detectors two<br />

experiments were performed. In experiment 1, a two-user<br />

205<br />

25<br />

system was considered. For the two-user system different<br />

types of channel and direction matrix combinations were<br />

tried. The two channels considered were<br />

R1 has low crosscorrelation factor, whereas cross-<br />

correlations between the users signature waveforms are<br />

relatively high in case of R2. In simple words R2 simu-<br />

lates a bandwidth-efficient channel. Also the two different<br />

spatial distributions of the users were considered<br />

A1 corresponds to a low angular (13') separation<br />

between the two users, whereas in case of A2 the users<br />

are separated by an angle of 67.5'.<br />

Fig.2 shows the probability of error graphs for weaker<br />

user. In Fig.2(a) the users have low signal crosscorrela-<br />

tions and low angular separation between them. STDF<br />

performs slightly better than STD. Also DF performs bet-<br />

ter than decorrelator. The advantage of spatial filtering is<br />

clearly evident from the superior performances of STDF<br />

and STD over DF and decorrelator. The same observa-<br />

tions can be found in Fig.2(b), but the spatial decorrela-<br />

tor (SD) [5] shows some performance improvement which<br />

is obvious in case of highly separated users.<br />

Figs.2(c) and 2(d) corresponding to highly corre-<br />

lated users signature waveforms show interesting results.<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.


Fig. 3. Probability of error curves for stronger user in a two-user system for different combinations of signal cross correlations and<br />

angular separations. Legend same as in Fig.2.<br />

The STDF detector clearly outperforms the other detec-<br />

tors. The results indicate that STDF can be used in<br />

bandwidth-efficient CDMA channels where the signature<br />

waveforms have significant crosscorrelations. In case of<br />

Fig.2(d) the SD shows a slight improvement over the<br />

decorrelator.<br />

Fig.3 shows the graphs for the stronger user of the<br />

two. The graphs clearly indicate that there is no perfor-<br />

mance difference between STDF and STD, and also be-<br />

tween DF and decorrelator. The only factor which makes<br />

STDF or STD better is the spatial filtering. As there is<br />

no feedback involved for strongest user in STDF and DF,<br />

the error rates are identical to that of STD and decor-<br />

relator respectively (this confirms with theory). Also<br />

in Fig.S(d) the SD shows a slightly better performance<br />

than the decorrelator, further confirming that in case of<br />

highly correlated and highly separated users, spatial fil-<br />

tering will make a significant contribution towards SNR<br />

improvement.<br />

In experiment 2, a four-user system was considered.<br />

The signal crosscorrelation matrix was given by<br />

r 1 0.5 0.4 0.2 1<br />

R3 = I 0.5 1 0.8 0.6 I<br />

0.4 0.8 1 0.3<br />

1 0.2 0.6 0.3 1 1<br />

and the direction matrix of a sensor with respect to a<br />

reference sensor in the direction of arrival of the users<br />

206<br />

wavefronts is<br />

Fig.4 shows the probability of error curves for the<br />

above four-user system. The four-user system simulates a<br />

multiuser environment better than the two-user examples<br />

considered earlier. From the graphs, it is clearly evident<br />

that the STDF detector shows a better performance than<br />

the other suboptimum detectors considered in this work.<br />

As the users become stronger the performance difference<br />

between STDF and STD narrows down. The results con-<br />

firm that STDF is a very powerful detection technique<br />

for relatively weak users. Also, in case of the strongest<br />

user there is no performance difference between STDF<br />

and STD, which coincides with our earlier observations.<br />

B. Asymptotzc Eficzency<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.<br />

In experiment 3, the asymptotic efficiency of differ-<br />

ent detectors were compared. As asymptotic efficiency<br />

basically measures the performance loss of a detector due<br />

to multiuser interference, a four-user system (R3 and<br />

A3) was considered. The reason is because a four-user<br />

system with different signal crosscorrelations will better<br />

simulate an hostile environment with different levels of<br />

multiuser interference than when compared to a two-user<br />

system. Fig.5 shows the histogram plot of asymptotic ef-<br />

ficiencies for all the four users. The plot shows that STDF<br />

is always more efficient than the other suboptimum de-<br />

tectors, except in case of the strongest user, where STDF


'il! (c)<br />

(a) Weakea User<br />

2nd Strongest User<br />

SNR (in dB)<br />

10-1<br />

1 2 3 4 a e 7 B Q ,o<br />

SNR (In OB)<br />


18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam 1996<br />

4.2.2: Time-varying <strong>Analysis</strong> of Various <strong>Signal</strong>s<br />

Screening of Knee Joint Vibroarthrographic <strong>Signal</strong>s by Statistical<br />

Pattern <strong>Analysis</strong> of Dominant Poles<br />

S. KRISHNAN’, R.M. RANGAYYAN112, G.D. BELL’l3, C.B. FRANK2$3, K.O. LADLY3<br />

‘Dept. of Electrical and Computer Engineering, ’Dept. of Surgery, 3Sport Medicine Centre<br />

The <strong>University</strong> of Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca)<br />

Abstract-<strong>Analysis</strong> of human knee joint vibration signals<br />

or vibroarthrographic (VAG) signals could lead to a non-<br />

invasive method for the diagnosis of cartilage pathol-<br />

ogy. In this study, the nonstationary VAG signals<br />

were adaptively segmented into locally stationary seg-<br />

ments. Autoregressive (AR) model coefficients were de-<br />

rived from the stationary segments by using the Burg-<br />

lattice method. The dominant poles of the models ex-<br />

tracted from the AR polynomials and a signal variability<br />

parameter were used as VAG signal features. The VAG<br />

signal features with a few relevant clinical parameters<br />

were used as feature vectors in statistical pattern classifi-<br />

cation experiments based on logistic regression analysis.<br />

The results indicated a classification accuracy of 81.7% in<br />

screening 90 VAG signals with no restriction imposed on<br />

the type of abnormal signals, and an accuracy of 93.7%<br />

in classifying 71 VAG signals with abnormal signals re-<br />

stricted to a specific type of articular cartilage pathology<br />

known as chondromalacia patella.<br />

I. INTRODUCTION<br />

Vibration signals emitted from human knee joints<br />

during normal movement of the leg, known as vi-<br />

broarthrographic (VAG) signals, are expected to be as-<br />

sociated with roughness, softening, or the state of lubri-<br />

cation of the cartilage surfaces, and may be useful indi-<br />

cators of early joint degeneration or disease. VAG aig-<br />

nal analysis could decrease the need for diagnostic use of<br />

arthroscopy. A variety of VAG signal analysis techniques<br />

have been proposed in the literature (for a review of pre-<br />

vious publications, please see Moussavi et al. [l]). The<br />

present work investigates, with a large database of sig-<br />

nals, the diagnostic potential of VAG based on pattern<br />

classification experiments performed using signal model<br />

parameters and a few clinical parameters as features.<br />

A. Data Acquistion<br />

11. METHODS<br />

In order to detect the VAG signal, a Dytran ac-<br />

celerometer (model 3115a) was placed on the surface of<br />

the skin at the mid-patella position of the knee, and the<br />

signal was recorded during swinging movement of the leg<br />

from 135’ to 0’ to 135’ in a total time period of 4 s. An<br />

electronic goniometer was placed on the lateral side of the<br />

knee to measure the angle of motion. Before digitizing the<br />

signal at a sampling rate of 2.5 kHz and 12 bits/sample,<br />

the signal was amplified and conditioned using a 10 Hz<br />

to 1 kHz bandpass filter.<br />

Auscultation was performed during swinging move-<br />

ment of the leg by placing a stethoscope on the lateral,<br />

medial, and anterior surfaces of the knee. The sounds<br />

heard were categorized and coded along with the ap-<br />

proximate corresponding joint angle for use as features in<br />

classification experiments. For subjects who underwent<br />

arthroscopy, the location of the pathology was used to es-<br />

timate the joint angle at which the pathological surfaces<br />

would come in contact and contribute to the VAG signal.<br />

Two databases were used in this study: 1). Database<br />

A consists of 51 normal and 39 abnormal signals, with<br />

no restriction on the type of cartilage pathology, and 2).<br />

Database B, extracted from database A, consists of 51<br />

normal and 20 abnormal’ signals, with the abnormals re-<br />

stricted to chondromalacia patella only. (Chondromala-<br />

cia patella is a common type of articular cartilage pathol-<br />

ogy in which the cartilage softens, fibrillates, and finally<br />

the bone is exposed.)<br />

208<br />

0-7803-3811-1/97/$10.00 OIEEE 968<br />

B. Feature Extmctzon<br />

Like many other biological signals, VAG signals are<br />

nonstationary. Hence, in order to apply standard sig-<br />

nal processing methods such as parametric modeling or<br />

spectral analysis, the signals have to be segmented into<br />

quasi-stationary segments. En this work VAG signals were<br />

adaptively segmented into stationary segments by using a<br />

recursive least-squares lattice algorithm [2]. An example<br />

of a VAG signal of an abnormal subject with chondroma-<br />

lacia patella, along with the final segment boundaries, is<br />

illustrated in figure 1. The VAG signal segments were au-<br />

toregressive (AR) modeled using the Burg-lattice method<br />

[2]. The transfer function of the AR or the “all pole” filter<br />

may be written as<br />

where M is the model order, and Csk are the AR coeffi-<br />

cients [3]. By factorizing the denominator, Eq. 1 can be<br />

rewritten as<br />

1<br />

H(r) = (2 - Pl)(t - P2)(. - P3)...(z. - PM) ’<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:57 from IEEE Xplore. Restrictions apply.<br />

(2)


18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam 1996<br />

4.2.2: Time-varying <strong>Analysis</strong> of Various <strong>Signal</strong>s<br />

Fig. 1. VAG signal of an abnormal subject with chondromalacia<br />

patella. The vertical lines represent segment boundaries. au:<br />

Arbitrary units.<br />

where pl ,pa, ..., p~ are the complex poles of the model. A<br />

model order of 40 was used [2]. Since the model order was<br />

an even number, the poles occurred in conjugate pairs.<br />

The distance T of a pole from the origin in the complex<br />

z- plane determines its spectral bandwidth, fB, as<br />

fs = cos-l [ (1 + r2) - 2(1- r)<br />

2r<br />

Poles with a large T contribute to the dominant peaks<br />

in the signal spectrum [4]. The superior performance of<br />

poles in tracking the frequency or spectral behavior of a<br />

signal makes them an appropriate choice for parametric<br />

representation of signals with multi-peaked spectra, such<br />

as VAG signals.<br />

C. Pattern <strong>Analysis</strong><br />

From the twenty poles (complex conjugate pole pairs<br />

were represented by only one pole from each pair) of the<br />

model of each VAG signal segment, six poles with the<br />

highest r were selected as the dominant poles. The six<br />

dominant poles; a signal variability parameter computed<br />

as the variance of the means (VM) of the segments of a<br />

VAG signal record; and a few clinical parameters such as<br />

the type of sound heard during auscultation, age, gender,<br />

and activity level of the subject were used to form feature<br />

vectors for use in classification experiments.<br />

The accuracy rate in classification of VAG signal<br />

segments into normal and abnormal groups was deter-<br />

mined by applying logistic regression analysis [5] on ran-<br />

dom splits of the databases into training and test sets.<br />

The VAG signal segments used in the test sets were tu<br />

tally different from and independent of the VAG signal<br />

segments used in the training sets. The overall accuracy<br />

rate was calculated as the percentage of the correctly-<br />

classified segments in the test set.<br />

111. RESULTS<br />

Several random split experiments were conducted<br />

with the database A and the database B. Table 1 shows<br />

969<br />

TABLE I<br />

CLASSIFICATION ACCURACY WITH DATABASE A AND DATABASE B<br />

- Database<br />

A<br />

B<br />

Correct Classification Accuracy Rate<br />

Normal Segments Abnormal Segments Overall<br />

201/211 36 f '79 2371290<br />

95.3% 45.6% 81.7%<br />

1881195 35/43 2231238<br />

96.4% 81.4% 93.7%<br />

the best test results obtained. The use of poles instead<br />

of the AR model coefficients [2] has provided an increase<br />

in classification accuracy of about 2 to 3%.<br />

IV. DISCUSSION<br />

The results confirm that VAG signal analysis is in-<br />

deed a potential tool for noninvasive diagnosis of artic-<br />

ular cartilage pathology. The results with database B<br />

further indicate that the proposed methods have poten-<br />

tial in detecting chondromalacia patella with noninvasive<br />

procedures.<br />

The use of AR model poles has the advantage that<br />

the pole frequencies could be directly related to domi-<br />

nant frequency components present in the signals. Such a<br />

parametric representation of signals should facilitate bet-<br />

ter description and understanding of signal and system<br />

characteristics than the use of more abstract parameters<br />

such as the AR model coefficients.<br />

Future work will be directed towards wavelet analysis<br />

for improved feature analysis of the nonstationary VAG<br />

signals, which may overcome some of the approximations<br />

involved in our current segmentation-based approach.<br />

ACKNOWLEDGEMENT<br />

We gratefully acknowledge support of this project<br />

with grants from the Arthritis Society of Canada and the<br />

Alberta Heritage Foundation for Medical <strong>Research</strong>.<br />

REFERENCES<br />

[l] Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. Frank,<br />

K.O. Ladly, and Y.T. Zhang. Screening of vibroarthrographic<br />

signals via adaptive segmentation and linear prediction model-<br />

ing. IEEE Transactions an Biomedical Engineering, 43:15-23,<br />

1996.<br />

[2] S. Krishnan. Adaptive filtering, modeling, and classification of<br />

knee joint vibroarthrographic signals. Master's thesis, Dept. of<br />

Electrical and Computer Engineering, The <strong>University</strong> of Cal-<br />

gary, Calgary, AB, Canada, April 1996.<br />

[3] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood<br />

Cliffs, N.J., 2nd edition, 1990.<br />

[4] 0. Paiss and G.F. Inbar. Autoregressive modeling of surface<br />

EMG and its spectrum with application to fatigue. IEEE<br />

Transactiona on Biomedical Engineering, BME-34( 10):761-<br />

769, 1987.<br />

[5] SPSS Inc., Chicago, IL. SPSS Advanced Statrattcs User's<br />

Guide, 1990.<br />

209<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:57 from IEEE Xplore. Restrictions apply.


339<br />

RECURSIVE LEAST-SQUARES LATTICE-BASED ADAPTIVE<br />

SEGMENTATION AND AUTOREGRESSIVE MODELING OF KNEE JOINT<br />

VIBROARTHROGRAPHIC SIGNAILS<br />

S. Krishnanl, R.M. Rangayyan1T2, G.D. Bel1273, C.B. F'~-ank~?~, K.O. Ladly3<br />

'Department of Electrical and Computer Engineering,2 Department of Surgery, Sport Medicine Centre<br />

The <strong>University</strong> of Calgary, Alberta, T2N 1N4, Canada. (Email : rangaQenel.ucalgary.ca)<br />

Abstract : Vibration signals emitted during movement<br />

of the knee, known as vibroarthrographic (VAG) sig-<br />

nals, may bear diagnostic information. We propose a<br />

new technique for adaptive segmentation based on the<br />

recursive least-squares lattice algorithm to segment the<br />

non-stationary VAG signals into locally-stationary com-<br />

ponents, which were then autoregressive modeled using<br />

the Burg-Lattice method. Classification of 90 VAG sig-<br />

nals as normal or abnormal using the signal and clini-<br />

cal parameters provided an accuracy of 71.1% with the<br />

leave-one-out method. When the abnormal signals were<br />

restricted to chondromalacia patella only, the classifica-<br />

tion accuracy increased to 80.3%. The results indicate<br />

that VAG is a potential tool for non-invasive screening<br />

for chondromalacia patella.<br />

1 Introduction<br />

Based on the many investigations that have been carried<br />

out on vibroarthrographic (VAG) signal analysis in the<br />

past few years, there is a good evidence to suggest that<br />

the VAG or knee joint sound signal has an exciting poten-<br />

tial for distinguishing between normal and abnormal car-<br />

tilage surfaces [l]. However, in previous studies on VAG,<br />

signal classification experiments were performed on a lim-<br />

ited number of signals. Using different adaptive signal<br />

processing techniques, the present work closely investi-<br />

gates the diagnostic potential of VAG based on extensive<br />

pattern classification experiments.<br />

In this paper, utilizing a reasonably large database<br />

of 90 subjects, the following approaches and techniques<br />

are addressed :<br />

0 Improved adaptive segmentation of the non-<br />

stationary VAG signals using the recursive least-<br />

squares lattice (RLSL) algorithm;<br />

0 Improved autoregressive (AR) modeling of VAG sig-<br />

nal segments using the Burg-Lattice method; and<br />

0 Classification of VAG signals into two groups - Nor-<br />

mal and Abnormal - using logistic regression analysis<br />

and the leave-one-out method.<br />

CCECE'96 0-7803-3143-5 /96/$4.00 0 1996 IEEE<br />

210<br />

The proposed methods should be useful as clinical<br />

tools for diagnosis of cartilage pathology or as tests before<br />

arthroscopy or major surgery.<br />

2 Clinical Data Acquisitioin<br />

Subjects sat on a rigid table with both legs suspended,<br />

and repeatedly extended and flexed their legs at an ap-<br />

proximate angular speed of 67'/s; the range of motion was<br />

approximately 135' to 0' to 135' in a total time period<br />

of 4 s[l]. It has been found that this rate of movement is<br />

the most comfortable rate for subjects to move their legs<br />

smoothly with consistency [2].<br />

Auscultation was performed during swinging move-<br />

ment of the leg by placing a stethoscope on the medial,<br />

lateral, and anterior surfaces of the knee. Sounds such as<br />

pops, clicks, grinds, and clunks heard during auscultation<br />

were coded along with the approximate corresponding<br />

joint angle for use as discriminant features in classification<br />

experiments. For patients who underwent arthroscopy,<br />

the position of the observed pathology was used to esti-<br />

mate the joint angle at which the affected surfaces could<br />

come into contact and contribute to VAG or sound sig-<br />

nals. For all subjects who participated in the study, the<br />

following information was also documented : age, gender,<br />

and number of times the subject exercised per week.<br />

3 VAG <strong>Signal</strong> Recording Setup<br />

The VAG signal was detected by a Dytran (Dytran,<br />

Chatsworth, CA) miniature accelerometer (model 3115a)<br />

placed on the skin over the mid-patella of the subject dur-<br />

ing dynamic movement of the knee. The signal was am-<br />

plified and conditioned by a bandpass filter of bandwidth<br />

10 Hz to 1 kHz using Gould (Gould, Cleveland, OH) iso-<br />

lation pre-amplifiers (model 11-5407-58) and (Gould uni-<br />

versal amplifiers (model 13-4615-18), and recorded on a<br />

Hewlett Packard (Hevvlett Packard, San Diego, CA) in-<br />

strumentation recorder (model 3968A). The bandpass fil-<br />

ter minimizes low-frequency movement artifacts and also<br />

prevents aliasing effects. A National Instruments (Na-<br />

tional Instruments, Austin, TX) AT-MIO-16L data ac-<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.


quisition board and Lab Windows (National Instruments,<br />

Austin, TX) software on a Zenith (Zenith, Los Angeles,<br />

CA) 386 computer were used to digitize the signals at a<br />

sampling rate of 2.5 kHz and 12 bits/sample. The data<br />

were then transferred to a SUN (SUN, Cupertino, CA)<br />

Sparcstation for processing.<br />

An electronic goniometer to measure the angle of the<br />

limb during movement was placed on the lateral aspect of<br />

the knee with the axis of rotation at the joint line. The<br />

signal from the goniometer was converted after digitiza-<br />

tion to the real angle in degrees based on the voltage of<br />

the goniometer at 0' and 90'. In this study, two databases<br />

were used :<br />

e Database AB, which consists of VAG signals of 51<br />

normal subjects, includes historically normal as well<br />

as symptomatic subjects who underwent arthroscopy<br />

and were found to be normal, and VAG signals<br />

of 39 symptomatic subjects with arthroscopically-<br />

confirmed cartilage pathology; and<br />

Database C extracted from database AB, which con-<br />

sists of 51 normal VAG signals and 20 abnormal<br />

VAG signals (restricted to chondromalacia patella<br />

only). Among the 20 chondromalacia patella cases,<br />

17 had additional pathology such as meniscal tears<br />

and chondromalacia of the tibial plateau.<br />

4 Adaptive Segmentation<br />

VAG signals are recorded during swinging movement of<br />

the knee, over a range of motion of 135' to 0' (exten-<br />

sion) and 0' to 135' (flexion). This kind of movement<br />

causes the joint surfaces to rub against each other, and<br />

also against the under-surface of the patella. The regions<br />

of the joint surfaces coming in contact are different at<br />

each position during the swing. The contact area may<br />

not be the same for every swing even for the same angle<br />

position: further, the quality of the joint surfaces com-<br />

ing in contact may change with joint angle. This means<br />

that signals of different characteristics are expected at<br />

different joint angles. As the statistical characteristics of<br />

the VAG signals are time-variant, the signals are non-<br />

stationary in nature. Hence, in order to apply standard<br />

signal processing techniques such as parametric modeling<br />

or spectral analysis on VAG signals, the signals have to<br />

be first adaptively segmented into locally-stationary seg-<br />

ments or components.<br />

Adaptive segmentation of VAG signals has already<br />

been reported in the literature by Tavathia et al. [3]<br />

and Moussavi et al. [l]. The new adaptive segmenta-<br />

tion method developed in the present work is based on<br />

the RLSL algorithm. The advantage in using a lattice fil-<br />

ter for segmentation of VAG signals is that the statistical<br />

211<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.<br />

340<br />

changes in the signals are well reflected in the filter pa-<br />

rameters, and hence segment boundaries can be detected<br />

by monitoring any one of the filter parameters such as<br />

the mean squared error, conversion factor, or the reflec-<br />

tion coefficients. Also, under certain circumstances, the<br />

required segmentation filter order can be predicted from<br />

the forward prediction error power. It was found that for<br />

VAG signals, the ensemble-averaged forward prediction<br />

error power (computed using 35 primary VAG signals)<br />

reaches a constant value after a model order of six. In<br />

this study, the conversion factor (y) has been used to<br />

monitor statistical changes in the VAG signals. In a sta-<br />

tionary environment, y starts with an initial value of zero<br />

and remains small during the early part of the initializa-<br />

tion period. After a few iterations, y begins to increase<br />

rapidly towards a final value of unity [4]. In the case of<br />

non-stationary signals such as VAG, y will fall from its<br />

steady value of unity whenever a change occurs in the<br />

statistics of the signal. This can be used in segmenting<br />

the VAG signal into locally-stationary components. The<br />

segmentation algorithm, in brief, is as follows:<br />

1. The VAG signal is passed twice through the segmen-<br />

tation filter. The first pass is used to allow the filter<br />

to converge, and the second pass is used to test the y<br />

value at each sample with a preferred fixed threshold<br />

value for detection of segment boundaries.<br />

2. Whenever y at a particular time sample during the<br />

second pass is lesser than the threshold value, a pri-<br />

mary segment boundary (PSB) is marked.<br />

3. If the difference between a PSB and the previous<br />

PSB of the same signal is greater than or equal to the<br />

minimum desired segment length of 120 data points<br />

[l], the PSB is marked as a final segment boundary;<br />

if not, the PSB is deleted and the process continued<br />

until all the PSBs are tested.<br />

Test results of the RLSL-based adaptive segmenta-<br />

tion method on different non-stationary synthetic signals<br />

indicated the high efficiency of this method in detecting<br />

rapid and gradual changes in signals [5]. The main advan-<br />

tage of the new method of adaptive segmentation is that<br />

the threshold is a fixed value as opposed to a variable<br />

value that was adopted in the previous study of Mous-<br />

savi et al.[l]. For some signals, especially normal VAG<br />

signals, it was found that the adaptive segmentation pro-<br />

cedure gave almost the same results as manual segmen-<br />

tation based on auscultation and/or arthroscopy. Figure<br />

2 shows the plot of y for the corresponding VAG signal<br />

in figure 1. The dashed lines in figure 1 show the final<br />

segment boundaries for the corresponding VAG signal.<br />

On the average, eight segments per VAG signal were ob-<br />

tained.


300<br />

100<br />

Ill Ill<br />

Ill Ill<br />

Ill Ill<br />

Ill Ill<br />

Ill Ill<br />

Ill Ill<br />

I l l Ill<br />

Ill Ill<br />

Ill Ill<br />

Ill Ill II<br />

1 1 1 Ill II<br />

Ill 111 I1<br />

I,,, 0, I<br />

-300<br />

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />

lime Samples<br />

Figure 1: VAG signal of an arthroscopically abnormal<br />

subject (chondromalacia grade 111) with the final segment<br />

boundaries shown by vertical dashed lines. The final seg-<br />

ment boundaries were determined by the RLSL adaptive<br />

segmentation algorithm. au: Arbitrary units.<br />

0.996 1 I‘ I 1<br />

0.9955<br />

0.995 0 1000 2000 3000 4600 300 6000 7000 8M)O 9000 10000<br />

number of iterations<br />

Figure 2: Plot of the conversion factor (7) for the abnor-<br />

mal VAG signal shown in figure 1. The horizontal dashed<br />

line is the fixed threshold line.<br />

212<br />

5 Autoregressive Modeling<br />

Modeling techniques such as autoregressive (AR) mod-<br />

eling, also referred to as “all-pole” modeling, provide<br />

parameters which could potentially be correlated with<br />

the physiological system producing the signals. The AR<br />

model is a linear, second-moment stationary model. Al-<br />

though VAG signals are neither linear nor stationary,<br />

second-moment stationarity holds over VAG signal seg-<br />

ments. Hence, approlpriate analysis of VAG segments<br />

may be based on an 14R model to extract all linearly-<br />

retrievable information from the signal in a minimum-<br />

variance manner. Some of the common ways to estimate<br />

the AR parameters are the autocorrelation or the Yule-<br />

Walker method [6], covariance method [6], Cholesky de-<br />

composition method [4:1, least squares method 141, and the<br />

Burg-Lattice method [4J. In this study on VAG signals,<br />

AR modeling method based on Burg-Lattice algonthm is<br />

investigated.<br />

The Burg-Lattice method was applied to stationary<br />

VAG signal segments and the AR prediction coefficients<br />

were derived. The model order used was 40. This order<br />

was chosen based on a,pplication of the Akaike Informa-<br />

tion Criterion (AIC), and models of this order were ob-<br />

served to predict the VAG signal segments well [l]. How-<br />

ever, a performance analysis of AR model coefficients in<br />

terms of the classification accuracy rate indicated that<br />

the first six AR coefficients of VAG signal segments are<br />

adequate for pattern classification of VAG sigioals [5].<br />

6 VAG Pat tern Classification<br />

As described in the previous section, the AR prediction<br />

coefficients were derived by modeling each VAG segment<br />

by the Burg-Lattice method. One of the obvious visual<br />

differences between normal and abnormal signatls was that<br />

the abnormal signals were much more variable in ampli-<br />

tude across a swing cycle than the normal signals. How-<br />

ever, this difference is lost in the process of dividing the<br />

signals into segments and considering each segment as a<br />

separate signal. To overcome this problem, the means<br />

(time averages) of the segments of each subject’s signal<br />

were computed, and the variance of these means (VM)<br />

was computed across the various segments of the same<br />

signal. The first six AR model coefficients, tlhe VM pa-<br />

rameter, and a few clinical parameters such as sound,<br />

age, gender, and activity level were used as discriminant<br />

features in the classification experiments.<br />

In this study, the classification of signals was done<br />

using the logistic analysis subroutine available in the Sta-<br />

tistical Package for Social Sciences (SPSS) software [7],<br />

and the leave-one-out method [8] was used l;o estimate<br />

the correct classification accuracy rate. In applying this<br />

method, all the segments of the VAG signal of one sub-<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.


Table 1: Comparison of different classification experi-<br />

ments by using the accuracy rates determined by applying<br />

the leave-one-out method, and the best test classification<br />

results obtained with the random split method.<br />

ject were excluded from the database, the classifier was<br />

designed with the segments of the VAG signals of the re-<br />

maining subjects, and then the VAG signal segments of<br />

the excluded subject were tested by the classifier. This<br />

operation was repeated to test all the subjects in each<br />

database. If segments spanning more than 10% of the<br />

duration of a subject’s signal were classified as abnormal,<br />

the subject was labeled as an abnormal subject; other-<br />

wise the subject was labeled as a normal subject. The<br />

number of correctly-classified subjects was then counted<br />

to estimate the classification accuracy rate. Since each<br />

test subject is excluded from the training sample set in<br />

turn, independence between the test set and the training<br />

set is maintained.<br />

Further, in another procedure, the accuracy rate in<br />

classifying the VAG signal segments into two groups was<br />

determined by applying logistic analysis on random splits<br />

of the databases into training and test sets. The VAG<br />

signal segments used in the test sets were totally different<br />

and independent from the VAG signal segments used in<br />

the training sets. The overall-accuracy rate of a training<br />

or a test set was given as the percentage of the number of<br />

correctly-classified segments in the training/test stage to<br />

the total number of segments in the training/test stage.<br />

7 Results<br />

Table 1 shows the classification results with database AB<br />

and database C. Several random split experiments were<br />

conducted [5], and Table 1 shows the best test classifi-<br />

cation results obtained with the random split method.<br />

From the results of the leave-one-out and random split<br />

methods, we can infer that the proposed method shows a<br />

better classification result with database C, and is sensi-<br />

tive to chondromalacia patella cases.<br />

8 Discussion and Further Work<br />

Substantial numbers of normal and abnormal VAG sig-<br />

nals were analyzed in this work, and the results ascer-<br />

tain that VAG signal analysis is indeed a potential tool<br />

for non-invasive diagnosis of articular cartilage pathology.<br />

213<br />

342<br />

Also, the proposed method has shown tremendous po-<br />

tential in detecting chondromalacia patella (results with<br />

database C) with non-invasive procedures.<br />

Future work will be directed towards time-<br />

scale/time-frequency analysis for improved feature anal-<br />

ysis of the non-stationary VAG signals, which may over-<br />

come the approximations involved in our current para-<br />

metric approach and the difficulties in segment-based<br />

analysis, and could lead to improved identification of dif-<br />

ferent types of cartilage pathology.<br />

Acknowledgements<br />

We gratefully acknowledge support of this project<br />

with grants from the Arthritis Society of Canada and the<br />

Alberta Heritage Foundation for Medical <strong>Research</strong>.<br />

References<br />

Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.<br />

[l] Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B.<br />

Frank, K.O. Ladly, and Y.T. Zhang. Screening of<br />

vibroarthrographic signals via adaptive segmentation<br />

and linear prediction modeling. IEEE Transactions<br />

on Biomedical Engineering, 43( 1):15-23, 1996.<br />

[a] K.O. Ladly. <strong>Analysis</strong> of patellofemoral joint vibration<br />

signals. Master’s thesis, The <strong>University</strong> of Calgary,<br />

Calgary, AB, Canada, 1992.<br />

[3] S. Tavathia, R.M. Rangayyan, C.B. Frank, G.D. Bell,<br />

K.O. Ladly, and Y.T. Zhang. <strong>Analysis</strong> of knee vi-<br />

bration signals using linear prediction. IEEE Trans-<br />

actions on Biomedical Engineering, 39(9):959-970,<br />

1992.<br />

[4] S. Haykin. Adaptive filter theory. Prentice-Hall, En-<br />

glewood Cliffs, N.J., 2nd edition, 1990.<br />

[5] S. Krishnan. Adaptive filtering, modeling, and classi-<br />

fication of knee joint vibroarthrographic signals. Mas-<br />

ter’s thesis, Submitted, Dept. of Electrical and Com-<br />

puter Engineering, The <strong>University</strong> of Calgary, Cal-<br />

gary, AB, Canada, April 1996.<br />

[6] J. Makhoul. Linear prediction: A tutorial review.<br />

Proc. IEEE, 63(4):561-580, 1975.<br />

[7] SPSS Inc., Chicago, IL. SPSS Advanced Statistics<br />

User’s Guide, 1990.<br />

[8] K. Fukunaga. Introduction to Statistical Pattern<br />

Recognition. Academic Press, Inc., San Diego, CA.,<br />

2nd edition, 1990.


Other Refereed Conference Papers<br />

T. Tabatabaei, S. Krishnan and A. Guergachi, Speech-based emotion recognition using<br />

sequence discriminant Support Vector Machines, 4 pages in CDROM Proc. Canadian Medical<br />

and Biological Engineering Conference (CMBEC), Toronto, Ontario, May 2007.<br />

O. Nedjah, A. Hussein, S. Krishnan, R. Sotudeh, CN tower lightining current derivative Heidler<br />

model for the validation of wavelet de-noising algorithm, In Proc. 18th International Wroclaw<br />

Symposium and Exhibition on Electromagnetic Compatibility, Wroclaw, Poland, pp:282 – 287,<br />

June 2006.<br />

A. Morrison, S. Krishnan, A. Anpalagan and B. Grush, Receiver autonomous mitigation of GPS<br />

non-line-of-sight multipath errors, 6 pages in Proc. ION National Technical Meeting, Monterey,<br />

California, January 2006.<br />

A. Ramalingam and S. Krishnan, Video fingerprinting using space-time and Gaussian mixture<br />

models, 4 pages in Proc. Canadian Workshop on Information Technology (CWIT), Montreal,<br />

Quebec, June 2005.<br />

K. Momen, S. Krishnan, D. Beal, E. Bouffet, B. Kavanagh, T. Chau, Self-organization of the<br />

communication space based on user range-of-motion: a framework for configuring non-contact<br />

augmentative communication devices, 4 pages in Proc. Canadian Medical And Biological<br />

Engineering Conference, Quebec City, Quebec, September 9-11, 2004<br />

J. Lukose and S. Krishnan, EEG signal analysis for screening alcoholics, 4 pages in Proc.<br />

International Dynamics of Continuous, Discrete, and Impulsive Systems (DCDIS) Conference,<br />

Guelph, Ontario, May 2003.<br />

K. Umapathy and S. Krishnan, Pathological voice screening using local discriminant bases,<br />

International 4 Pages in Proc. International Dynamics of Continuous, Discrete, and Impulsive<br />

Systems (DCDIS) Conference, Guelph, Ontario, May 2003.<br />

S. Erkucuk (and S. Krishnan, M. Zeytinoglu), A novel technique for digital audio watermarking,<br />

Student Contest Presentation at the IEEE International Conference on Multimedia and Expo<br />

(ICME), Laussanne, Switerland, August 2002. (Won the IBM T.J. Watson <strong>Research</strong> award for<br />

innovative ideas)<br />

K. Umapathy, S. Krishnan, and S. Jimaa, Time-frequency analysis of wideband speech and<br />

audio, 2 pages in Proc. Micronet Annual Workshop, Aylmer, Quebec, April 2002.<br />

S. Krishnan, R.M. Rangayyan, and K. Umapathy, A time-frequency approach for auditory<br />

display of time-varying signals, in Proc. IASTED International Conference on <strong>Signal</strong> and Image<br />

Processing, Hawaii, USA, pp 236-241, August 2001.<br />

K. Umapathy and S. Krishnan, Joint time-frequency coding of audio signals, in Proc. 5th<br />

WSES/IEEE Multiconference on Circuits, Systems, Communications, and Computers, Crete,<br />

Greece, pp 32-36, July 2001.<br />

214


K. Umapathy and S. Krishnan, Low bit-rate time-frequency coding of wideband audio signals, in<br />

Proc. IASTED International Conference on <strong>Signal</strong> Processing, Pattern Recognition and<br />

Applications, Rhodes, Greece, pp 101-105, July 2001.<br />

R.M. Rangayyan, S. Krishnan, G.D. Bell, and C.B. Frank, Computer-aided auscultation of knee<br />

joint vibration signals. In Proc. European Medical and Biological Engineering Conference,<br />

Vienna, Austria, pp: 464-465, Nov. 1999.<br />

S. Krishnan and R.M. Rangayyan, Knee joint vibration signal analysis using adaptive timefrequency<br />

distributions. In Proc. European Medical and Biological Engineering Conference,<br />

Vienna, Austria, pp: 466-467, Nov. 1999.<br />

S. Krishnan and R.M. Rangayyan, Feature identification in the time-frequency distributions of<br />

knee joint vibroarthrographic signals using Hough and Radon transforms. In Proc. International<br />

Conference on Robotics, Vision, and Parallel Processing, Tronoh, Malaysia, pp: 82-89, July<br />

1999.<br />

R.M. Rangayyan, S. Krishnan, G.D. Bell, C.B. Frank, and K.O. Ladly, Impact of muscle<br />

contraction interference cancellation on vibroarthrographic screening, Proc. International<br />

Conference on Biomedical Engineering, Kowloon, Hong Kong, pp 16-19, June 1996. (invited<br />

paper)<br />

S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank, and K.O. Ladly, Adaptive segmentation<br />

and cepstral analysis of vibroarthrographic signals for non-invasive diagnosis of knee joint<br />

cartilage pathology, Proc. 22nd Canadian Medical and Biological Engineering Conference,<br />

Charlottetown, PEI, Canada, pp 8-9, June 1996.<br />

N. Kumaravel and S. Krishnan, Knowledge based biosignal processing system for diagnosing<br />

heart disorders, Proc. International Conference on Robotics, Vision, and Parallel Processing,<br />

Ipoh, Malaysia, pp 602-609, May 1994.<br />

S. Krishnan, An expert diagnostic system using signal processing tool, in Proc. International<br />

conference on expert systems for development, Bangkok, Thailand, March 1994.<br />

215

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!