Signal Analysis Research (SAR) Group - RNet - Ryerson University
Signal Analysis Research (SAR) Group - RNet - Ryerson University
Signal Analysis Research (SAR) Group - RNet - Ryerson University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Signal</strong> <strong>Analysis</strong><br />
<strong>Research</strong> (<strong>SAR</strong>)<br />
<strong>Group</strong><br />
Refereed Conference Papers<br />
May 1996- December 2007
Table of Contents<br />
2007.12 Construction of Discriminative Positive Time-Frequency<br />
Distributions<br />
K. Umapathy and S. Krishnan<br />
2007.10 Combining Vocal Source and MFCC Features for Enhanced<br />
Speaker Recognition Performance Using GMMs<br />
D. Hosseinzadeh and S. Krishnan<br />
2007.08 Multiresolution <strong>Analysis</strong> and Classification of Small Bowel<br />
Medical Images<br />
A. Khademi and S. Krishnan<br />
2007.06 Interference Detection in Spread Spectrum Communication Using<br />
Polynomial Phase Transform<br />
R. Zarifeh, S. Krishnan, A. Anpalagan and N. Alinier<br />
2007.05 Emotion Recognition Using Novel Speech <strong>Signal</strong> Features<br />
T.S. Tabatabaei, S. Krishnan and A. Guergachi<br />
2007.04 A Watermarking Method for Speech <strong>Signal</strong>s Based on the Time-<br />
Warping <strong>Signal</strong> Processing Concept<br />
C. Ioana, A. Jarrot, A. Quinquis and S. Krishnan<br />
2006.12 Chirp-Based Image Watermarking as Error-Control Coding<br />
B. Ghoraani and S. Krishnan<br />
2006.07 Automatic Content-Based Image Retrieval Using Hierarchical<br />
Clustering Algorithms<br />
K. Jarrah, S. Krishnan and L. Guan<br />
2006.07 Computational Intelligence Techniques and their Applications in<br />
Content-Based Image Retrieval<br />
K. Jarrah, M. Kyan, S. Krishnan and L. Guan<br />
2006.07 Discrete Polynomial Transform for Digital Image Watermarking<br />
Application<br />
L. Le, S. Krishnan and B. Ghoraani<br />
2006.05 Improving Position Estimates From a Stationary GNSS Receiver<br />
Using Wavelets and Clustering<br />
M. Aram, B. Li, S. Krishnan and A. Anpalagan<br />
2006.05 Keystroke Identification Based on Gaussian Mixture Models<br />
D. Hosseinzadeh, S. Krishnan and A. Khademi<br />
2006.05 Soccer Video Retrieval Using Adaptive Time-Frequency Methods<br />
J. Marchal, C. Ioana, E. Radoi, A. Quinquis and S. Krishnan<br />
1 - 5<br />
6 - 9<br />
10 – 13<br />
14 - 19<br />
20 - 23<br />
24 - 27<br />
28 – 31<br />
32 – 37<br />
38 – 41<br />
42 – 45<br />
46 – 50<br />
51 – 54<br />
55 – 58
2006.05 Support Vector Machines Based Approach for Chemical<br />
Phosphorus Removal Process in Wastewater Treatment Plant<br />
T.S.Tabatabaei, T. Farooq, A. Guergachi and S. Krishnan<br />
2005.11 Data Embedding in Miu-Law Speech with Spread Spectrum<br />
Techniques<br />
L. Zhang, H. Ding and S. Krishnan<br />
2005.09 Comparison of JPEG 2000 and Other Lossless Compression<br />
Schemes for Digital Mammograms<br />
A. Khademi and S. Krishnan<br />
2005.07 Gaussian Mixture Modeling Using Short Time Fourier Transform<br />
Features for Audio Fingerprinting<br />
A. Ramalingam and S. Krishnan<br />
2005.05 Multipath Mitigation of GNSS Carrier Phase <strong>Signal</strong>s for an On-<br />
Board Unit for Mobility Pricing<br />
R. Puri, A. El Kaffas, A. Anpalagan, S. Krishnan and B. Grush<br />
2005.03 A <strong>Signal</strong> Classification Approach Using Time-Width VS<br />
Frequency Band Sub-Energy Distributions<br />
K. Umapathy and S. Krishnan<br />
2005.03 Indexing of NFL Video Using MPEG-7 Descriptors and MFCC<br />
Features<br />
S.G. Quadri, S. Krishnan and L. Guan<br />
2004.12 Audio <strong>Signal</strong> Feature Extraction and Classification Using Local<br />
Discriminant Bases<br />
K. Umapathy, S. Krishnan and R.K. Rao<br />
2004.05 A Novel Robust Image Watermarking Using a Chirp Based<br />
Technique<br />
A. Ramalingam and S. Krishnan<br />
2004.05 A Novel Way of Lossless Compression of Digital Mammograms<br />
Using Grammar Codes<br />
X. Li, S. Krishnan and N.-W. Ma<br />
2004.05 Content Based Audio Classification and Retrieval Using Joint<br />
Time-Frequency <strong>Analysis</strong><br />
S. Esmaili, S. Krishnan and K. Raahemifar<br />
2004.05 Modified Local Discriminant Bases and its Applications in <strong>Signal</strong><br />
Classification<br />
K. Umapathy and S. Krishnan<br />
2004.05 Radio Over Multimode Fiber for Wireless Access<br />
R. Yuen, X.N. Fernando and S. Krishnan<br />
59 – 63<br />
64 – 67<br />
68 - 71<br />
72 – 75<br />
76 – 79<br />
80 – 83<br />
84 – 87<br />
88 – 92<br />
93 – 96<br />
97 - 100<br />
101 – 104<br />
105 – 108<br />
109 - 112
2004.05 Sub-Dictionary Selection Using Local Discriminant Bases<br />
Algorithm for <strong>Signal</strong> Classification<br />
K. Umapathy, S. Krishnan and A. Das<br />
2003.09 Ultrasound Backscatter <strong>Signal</strong> Characterization and<br />
Classification Using Autoregressive Modeling and Machine<br />
Learning Algorithms<br />
N.R. Farnoud, M. Kolios and S. Krishnan<br />
2003.07 Robust Audio Watermarking Using a Chirp Based Technique<br />
S. Erkucuk, S. Krishnan and M. Zeytinoglu<br />
2003.07 Time-Frequency Filtering of Interferences in Spread Spectrum<br />
Communications<br />
S. Erkucuk and S. Krishnan<br />
2003.05 A General Perceptual Tool for Evaluation of Audio Codecs<br />
K. Umapathy, S. Krishnan and G. Sinanian<br />
2003.05 Non-Stationary Noise Cancellation in Infrared Wireless Receivers<br />
S. Krishnan, X. Fernando and H. Sun<br />
2003.04 Adaptive Denoising at Infrared Wireless Receivers<br />
X. N. Fernando, S. Krishnan, H. Sun and K. Kazemi-Moud<br />
2003.04 Audio Watermarking Using Time-Frequency Characteristics<br />
S. Esmaili, S. Krishnan and K. Raahemifar<br />
2002.10 Time-Frequency Modeling and Classification of Pathological<br />
Voices<br />
K. Umapathy, S. Krishnan, V. Parsa and D. Jamieson<br />
2002.08 Audio <strong>Signal</strong> Classification Using Time-Frequency Parameters<br />
K. Umapathy, S. Krishnan and S. Jimaa<br />
2002.05 Detection of Linear Chirp and Non-Linear Chirp Interferences in a<br />
Spread Spectrum <strong>Signal</strong> by Using Hough-Radon Transform<br />
S. Thayilchira and S. Krishnan<br />
2002.05 Discrimination of Pathological Voices Using an Adaptive Time-<br />
Frequency Approach<br />
K. Umapathy, S. Krishnan, V. Parsa and D. Jamieson<br />
2002.05 Interference Excision in Spread Spectrum Communications Using<br />
Adaptive Positive Time-Frequency Distributions<br />
S. Erkucuk and S. Krishnan<br />
2001.05 Fixed Block-Based Lossless Compression of Digital<br />
Mammograms<br />
M.Y. Al-Saiegh and S. Krishnan<br />
113 - 116<br />
117 – 120<br />
121 – 124<br />
125 – 128<br />
129 – 132<br />
133 – 137<br />
138 – 146<br />
147 – 151<br />
152 - 153<br />
154 - 157<br />
158<br />
159 - 162<br />
163<br />
164 - 169
2001.05 Instantaneous Mean Frequency Estimation Using Adaptive Time-<br />
Frequency Distributions<br />
S. Krishnan<br />
2000.07 Sonification of Knee-joint Vibration <strong>Signal</strong>s<br />
S. Krishnan, R.M. Rangayyan, G.D. Bell and C.B. Frank<br />
1999.05 Denoising Knee Joint Vibration <strong>Signal</strong>s Using Adaptive Time-<br />
Frequency Representations<br />
S. Krishnan and R.M. Rangayyan<br />
1998.10 Comparative <strong>Analysis</strong> of the Performance of the Time-Frequency<br />
Distributions with Knee Joint Vibroarthrographic <strong>Signal</strong>s<br />
R.M. Rangayyan and S. Krishnan<br />
1998.10 Detection of Nonlinear Frequency-Modulated Components in the<br />
Time-Frequency Plane Using an Array of Accumulators<br />
S. Krishnan and R.M. Rangayyan<br />
1997.10 Time-Frequency <strong>Signal</strong> Feature Extraction and Screening of Knee<br />
Joint Vibroarthrographic <strong>Signal</strong>s Using the Matching Pursuit<br />
Method<br />
S. Krishnan, R.M. Rangayyan, G.D. Bell and C.B. Frank<br />
1997.08 Detection of Chirp and Other Components in the Time-Frequency<br />
Plane Using the Hough and Radon Transform<br />
S. Krishnan and R.M. Rangayyan<br />
1997.08 Spatial-Temporal Decorrelating Decision-Feedback Multiuser<br />
Detector for Synchronous Code-Division Multiple-Access<br />
Channels<br />
S. Krishnan and B.R. Petersen<br />
1996.10 Screening of Knee Joint Vibroarthrographic <strong>Signal</strong>s by Statistical<br />
Pattern <strong>Analysis</strong> of Dominant Poles<br />
S. Krishnan, R.M Rangayyan, G.D. Bell, C.B. Frank and K.O. Ladly<br />
1996.05 Recursive Least-Squares Lattice-Based Adaptive Segmentation<br />
and Autoregressive Modeling of Knee Joint Vibroarthographic<br />
<strong>Signal</strong>s<br />
S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank and K.O. Ladly<br />
Other Refereed Conference Papers<br />
170 – 175<br />
176 – 179<br />
180 – 185<br />
186 - 189<br />
190 - 193<br />
194 – 197<br />
198 - 201<br />
202 - 207<br />
208 – 209<br />
210 - 213<br />
214 - 215
Construction of Discriminative Positive<br />
Time-frequency Distributions<br />
Karthikeyan Umapathy and Sridhar Krishnan<br />
Dept. of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />
Email: (karthi)(krishnan)@ee.ryerson.ca<br />
Abstract— Positive time-frequency energy distributions<br />
(PTFD) are suitable for studying the non-stationary dynamics<br />
of a signal. Instantaneous features extracted from the PTFD are<br />
often used in classification applications where the discriminatory<br />
clue lies in the non-stationary behavior of the signal. From a<br />
classification point of view it would be desirable to identify<br />
and extract instantaneous features that correspond to only the<br />
discriminative portion of the signal. By doing so we get an added<br />
advantage of eliminating the overlap from the non-discriminatory<br />
portion of the signal in the instantaneous feature extraction<br />
process. In this paper, we propose a front-end processing<br />
using a novel time-width versus frequency band mapping that<br />
facilitates the construction of PTFD corresponding to only the<br />
discriminatory portion of the signal. Instantaneous features<br />
extracted from these PTFDs are readily discriminative and<br />
attractive for classification and characterization applications. The<br />
proposed method is demonstrated with a speech classification<br />
example.<br />
Keywords: Time-frequency, Positive time-frequency distributions,<br />
Instantaneous features, Matching pursuits, Time-width<br />
versus frequency band mapping.<br />
I. INTRODUCTION<br />
Joint time-frequency (TF) analysis has been widely employed<br />
for extracting TF features from non-stationary signals.<br />
While the parametric TF decomposition approaches are highly<br />
suitable for objective feature extraction, the non-parametric<br />
Cohen’s class TF energy distributions (TFD) are usually used<br />
to extract instantaneous TF features. In order to extract meaningful<br />
instantaneous features from a TFD, the TFD has to<br />
satisfy the following properties: (i) positivity and (ii) time and<br />
frequency marginals [1]. Positivity refers to that the energy<br />
values of the TFD are ≥ 0. The marginal property states that<br />
integrating a TFD in time and frequency directions should<br />
yield the instantaneous energy and energy spectral density of<br />
a signal.<br />
In reality, the presence of cross-terms with multicomponent<br />
signals affect the positivity of a TFD. In some cases<br />
though we could achieve positive TFDs by compromising<br />
TF resolution, they do not satisfy the marginals. Various<br />
methods and conditions have been proposed over the years to<br />
construct PTFDs that satisfy both the properties of positivity<br />
and marginals [2], [3]. One known way of constructing PTFDs<br />
with high TF resolution and free from cross-terms is to use the<br />
adaptive TF transformation (ATFT - Matching Pursuits with<br />
TF dictionaries) approach [4]. While this approach produces<br />
cross-term free high resolution PTFDs, it does not satisfy the<br />
marginal properties. A correction to the marginals using minimum<br />
cross entropy (MCE) optimization has been proposed<br />
in the work of [5] that make the ATFT based PTFDs suitable<br />
for instantaneous feature extraction. An added advantage of<br />
the ATFT based PTFD approach is that the building blocks<br />
of the PTFD are the TF functions (represented by a set of<br />
decomposition parameters) whose parameters can be cleverly<br />
manipulated to achieve desirable effects in the PTFDs.<br />
In authors previous works [6], [7], [8] a novel time-width<br />
versus frequency band (TWFB) mapping based on ATFT was<br />
used to identify the discriminative decomposition subspaces<br />
between classes of signals. Objective TF features were extracted<br />
from these subspaces and successfully applied in real<br />
world applications to achieve high classification accuracies.<br />
The TWFB mapping utilizes the parametric benefits of ATFT<br />
in identifying the discriminative decomposition subspaces that<br />
are suitable for objective feature extraction. In order to extend<br />
this approach to extract instantaneous TF features from the<br />
discriminatory portion of the signal, the information provided<br />
by the TWFB mapping has to be combined with the ability of<br />
ATFT to construct PTFD. This possibility is explored in this<br />
paper and a methodology to construct discriminative PTFDs<br />
using TWFB mappings is presented. These discriminative<br />
PTFDs are constructed using only the discriminative portion of<br />
a signal which ensures that the instantaneous features extracted<br />
from these PTFDs truly reflect the discriminating dynamics<br />
between different classes of signals.<br />
The block diagram of the proposed methodology is shown<br />
in Fig. 1. The solid lines in the block diagram show the main<br />
components of the proposed work. The paper is organized as<br />
follows: Section 2 covers the methodology, which includes<br />
subsections on TWFB mappings, discriminative subspace selection,<br />
and construction of discriminative PTFDs, Section 3<br />
presents the results and discussion, and conclusions are given<br />
in Section 4.<br />
II. METHODOLOGY<br />
A. Time-width versus Frequency Band Mappings<br />
TWFB mappings are constructed using the decomposition<br />
parameters of an ATFT [6], [7], [8]. In ATFT, any given real<br />
signal is modeled using a sum of L TF functions selected from<br />
a redundant dictionary of TF functions. The TF functions used<br />
to model a real signal can be represented using five model<br />
or decomposition parameters (ai, si, pi, fi, and φi). The<br />
parameter ai is the expansion coefficient for the TF functions,<br />
si represents the time-width or scale parameter of the TF<br />
function, pi its time position, fi its center frequency and φi<br />
1–4244–0983–7/07/$25.00 c○ 2007 IEEE ICICS 2007<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />
1
Class A<br />
Class B<br />
TWFB<br />
Mapping<br />
Discriminative<br />
TF functions<br />
Marginals<br />
Wigner−Ville<br />
Distributions<br />
MCE<br />
Optimization<br />
Discriminative<br />
TWFB − Time−width vs Frequency Band Mapping, MCE − Minimum Cross Entropy, PTFD − Positive Time−frequency Distribution<br />
represents the phase of the TF function. The index i represents<br />
the iteration number. In our study Gaussian TF functions were<br />
used due to their excellent TF resolution properties [1].<br />
In order to effectively utilize the ATFT decompositions for<br />
discriminant subspace selection, the decomposition parameters<br />
need to be rearranged in a pseudo dictionary format. Of the<br />
five essential TF decomposition parameters explained, from<br />
a TF decomposition subspace point of view only the energy<br />
parameter ai, time-width si, and the frequency fi parameters<br />
are of relevance. This is because, the main feature of a TF<br />
function and thereby the decomposition itself lies in these three<br />
parameters. The phase parameter φi provides the information<br />
about how the TF functions combine to form the signal, which<br />
is of more relevance in a signal reconstruction scenario. The<br />
information provided by the time parameter pi is not important<br />
for identifying global signal patterns since in most cases the<br />
pattern recognition applications look for only global patterns<br />
irrespective of its occurrence in time (translation invariance).<br />
Also the time (pi) independence is the key to bring generality<br />
and organization to the TWFB mapping. Hence only ai,<br />
si, and fi are used in the construction of TWFB mapping.<br />
However it is obvious that without the time and time-varying<br />
information neither instantaneous features nor features related<br />
to time-triggered events can be extracted. So, after locating<br />
the discriminant subspaces using the TWFB mapping, all 5<br />
ATFT decomposition parameters (including pi and φi) ofthe<br />
TF functions that correspond to the discriminatory subspaces<br />
will be used to construct the PTFD.<br />
Let us redefine the subscript of si to sw, fi to fb and<br />
a 2 i (energy parameter) to a 2 (j, sw, fb)<br />
Fig. 1. Block diagram of the proposed methodology<br />
. The index w in the<br />
sw now represents the possible time-width values of the TF<br />
function. The sw then represents all the TF functions with a<br />
particular time-width w. Similarly the index b in fb represents<br />
the possible frequency band values. The fb then represents all<br />
the TF functions that occur within a particular frequency band<br />
b. The range of values for w and b is determined by the discrete<br />
implementation of the ATFT algorithm and the choice of the<br />
frequency band resolution. Depending upon the application<br />
requirement, finer resolution TWFB can be constructed by<br />
controlling the step size of the time-width and frequency<br />
PTFD<br />
parameters. The subscript (j, sw, fb) of the energy parameter<br />
corresponds to the j th TF function that has a time-width value<br />
of w and occurs within the b th frequency band. For every<br />
combination of (sw,fb), there may be j =1, ..., M TF<br />
functions. The M varies for each combination of (sw,fb)and<br />
is signal dependent.<br />
The TWFB mapping can then be defined as the cumulative<br />
energy distribution of the TF functions for all the possible<br />
combinations of the time-widths (sw) and frequency bands<br />
(fb) and is given by<br />
TWFB(sw,fb) =<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />
2<br />
M (sw ,fb ) <br />
j=1<br />
a 2 (j,sw,fb) , (1)<br />
In the implementation used in this study, the index w<br />
takes values from 1 to 14 (which translates into a length<br />
of 2w time samples) and b values range from 1 to 4 (Each<br />
covering 1 th<br />
4 of normalized frequency spectrum). This give<br />
us 56 possible combinations of sw and fb. Let us call each<br />
of these combinations as a tile in the TWFB mapping. Each<br />
of the these cell corresponds to the cumulative energy of<br />
all the TF functions corresponding to a particular sw and<br />
fb combination. Fig. 2 shows a sample signal, spectrogram,<br />
and its TWFB mapping. The X axis of the TWFB map<br />
corresponds to the time-width or scale parameter sw and the<br />
Y axis corresponds to the frequency bands fb. Here we would<br />
like to stress that the TWFB mapping is independent of time<br />
occurrences of a signal pattern which is a desirable property<br />
(translation invariance) for pattern recognition. The Z axis<br />
(color) indicates the magnitude of the cumulative energy of TF<br />
functions for each cell. The next subsection will explain how<br />
the TWFB mapping can be used for identifying discriminative<br />
TF decomposition subspace.<br />
B. Discriminative Subspace Selection<br />
TWFB maps of different classes of signals can be compared<br />
to arrive at the TWFB tiles (difference mapping) that demonstrate<br />
high discrimination between the classes of the signal [6],<br />
[7], [8]. As an example, for a 2 class problem (class A and<br />
class B as shown in Fig. 1), we compute the average TWFB
Amplitude<br />
Frequency<br />
Frequency bands (f b )<br />
1<br />
0.5<br />
0<br />
−0.5<br />
1<br />
0.5<br />
0<br />
1 TWFB map<br />
2<br />
3<br />
4<br />
2000 4000 6000 8000 10000 12000 14000 16000<br />
Time samples<br />
Spectrogram<br />
1000 2000 3000 4000<br />
Time<br />
5000 6000 7000 8000<br />
2 4 6 8<br />
Time−width (s )<br />
w<br />
10 12 14<br />
Fig. 2. A sample TWFB mapping of a synthetic signal. From top to bottom:<br />
time domain signal, spectrogram, and TWFB map.<br />
mapping for each of the class using a set of training signals.<br />
The corresponding cells of the average TWFB mappings are<br />
then compared by applying a dissimilarity measure D to obtain<br />
a difference mapping. The choice of dissimilarity measure D<br />
can be made depending on the application. In the proposed<br />
method, the simple absolute energy difference between the<br />
cells were used as the dissimilarity measure. The difference<br />
mapping is then given by<br />
TWFB(sw,fb) diff = |TWFB(sw,fb) A − TWFB(sw,fb) B |<br />
(2)<br />
After arranging the cells of the difference mapping in the<br />
decreasing order of their absolute difference value, the top P<br />
cells are chosen as the discriminatory cells. The restricted span<br />
of sw and fb or the discriminant TF decomposition subspace<br />
corresponding to these P cells then represents the highly<br />
discriminatory clues of a signal. Once the span of sw and<br />
fb are identified using the training signals, the TF functions<br />
corresponding to this span could be used to construct the<br />
discriminative PTFD.<br />
C. Construction of Discriminative PTFD<br />
The ATFT based PTFD is constructed by adding the<br />
Wigner-Ville distributions (WVDs) of the individual TF functions<br />
[4]. WVD is a quadratic classical TFD and is known to<br />
have the best TF resolution [1]. However it suffers from the<br />
cross term artifacts when dealing with multicomponent signals<br />
due its bilinear nature. For notational convenience from this<br />
point forward let us represent the ATFT based PTFD as just<br />
AP. In order to explain the cross-term free nature of the AP, let<br />
us express a signal x(t) in terms of TF functions (gγi) afterL<br />
ATFT iterations as x(t) = L−1<br />
i=0 gγi(t)+R L x(t)<br />
[4]. The first part of the preceding equation represents the<br />
signal modeled using the L TF functions and the second part<br />
of the equation represents the residue of the signal x(t).<br />
As explained in Section II-A each of the TF function gγi is<br />
represented by a set of decomposition parameters (ai, si, pi,<br />
fi, and φi). Now assuming that the signal x(t) is completely<br />
modeled with L TF functions (i.e. the residue is zero after L<br />
iterations), we could express the WVD of x(t) in terms of the<br />
TF functions gγi as<br />
WVDx(t, f) =<br />
+<br />
<br />
L−1<br />
| | 2 Wgγi(t, f)<br />
L−1 L−1 <br />
h=i (3)<br />
W[gγi,gγh](t, f)<br />
where Wgγ(t, f) is the WVD of the TF function. Here it<br />
should be noted that the TF functions gγi used in this work<br />
are Gaussian hence the WVD of the individual TF functions<br />
Wgγ(t, f) are positive [4]. The second term (double sum)<br />
corresponds to the cross term artifacts that occur if x(t) is<br />
a multicomponent signal. These cross terms could be easily<br />
removed by retaining only the first term in the Equ. 3 which<br />
yields the AP<br />
L−1 <br />
AP (t, f) = | | 2 Wgγi(t, f) (4)<br />
i=0<br />
The AP generated this way is positive, free from cross<br />
terms, and has sufficiently high TF resolution for analyzing<br />
non-stationary signals. However the AP does not satisfy the<br />
marginal properties. The marginal properties of AP will be<br />
addressed in the later part of this section.<br />
Similar to the above case of constructing AP for the complete<br />
signal x(t), we could compute AP for the discriminatory<br />
portion of the signal x(t) identified using the TWFB mappings.<br />
Let us denote the discriminatory portion of the signal as ˆx(t)<br />
that corresponds to Q TF functions that occurred within the<br />
restricted span of sw and fb (as explained in Section II-B). The<br />
ˆx(t) can then be written as ˆx(t) = Q−1<br />
i=0 gγi(t),<br />
where Q
as an added advantage of this approach, the AP is automatically<br />
denoised from the non-discriminant and overlapping signal<br />
structures. Here the term “noise” means irrelevant signal<br />
information for a particular application.<br />
As mentioned earlier the AP (t, f)ˆx still needs to be corrected<br />
for its marginals. The works of [2], [3] have demonstrated<br />
successful ways of correcting the marginals of an<br />
improper TFD using the minimum cross entropy (MCE) optimization<br />
and TF copulas respectively. Although the TF copula<br />
based approach [3] is a recent work that is computationally<br />
attractive, we chose to use the MCE approach since it has<br />
been already successfully applied to correct ATFT based TFDs<br />
in [5]. Moreover the choice between these two approaches<br />
does not affect the main focus of this paper in constructing<br />
the discriminative PTFD. MCE is an iterative process where<br />
the correction factor of the marginals is computed as the<br />
ratio of the marginals of the improper PTFD to the actual<br />
marginals extracted from the signal. Iteratively this procedure<br />
is alternatively applied in time and frequency directions till the<br />
marginals of the PTFD matches to that of the actual signal<br />
marginals. Let us assume the AP (t, f) ′<br />
ˆx<br />
is the corrected<br />
ATFM-TFD that satisfies the marginals, u(t) as the true time<br />
marginal which could be extracted from the time domain<br />
signal, and u ′ (t) as the actual time marginal of AP (t, f)ˆx,<br />
then, after simplification the first iteration is shown as<br />
AP 1 (t, f) ′<br />
u(t)<br />
ˆx = AP (t, f)ˆx<br />
u ′ (t)<br />
This operation scales the AP (t, f)ˆx by the time marginal<br />
ratio. After this operation, AP 1 (t, f) ′<br />
ˆx would be the corrected<br />
AP in the time marginal. Now, to satisfy the frequency<br />
marginal, the operation is repeated with AP 1 (t, f) ′<br />
ˆx as the<br />
prior estimate and computing the correction factor using u(f)<br />
and u ′ (f). This process is repeated alternatively to correct the<br />
time and frequency marginals. The only difference in the above<br />
procedure compared to the previous works is that the true<br />
time u(t) and frequency marginal u(f) are computed from the<br />
discriminatory portion of the signal ˆx(t). ˆx(t) is reconstructed<br />
using the discriminant TF functions and the true marginals are<br />
computed before being used with MCE. The block diagram<br />
in Fig.1 shows this operation of extracting marginals from the<br />
discriminant TF functions. After n iterations the AP n (t, f) ′<br />
ˆx<br />
will become the corrected discriminative AP satisfying the<br />
marginal conditions and is suitable for extracting meaningful<br />
instantaneous features.<br />
III. RESULTS AND DISCUSSIONS<br />
To demonstrate the proposed construction of AP we present<br />
here a pathological speech classification (characterization)<br />
application. Fig. 3 has 5 rows and 2 columns containing<br />
10 images. The top row of the figure shows a normal and<br />
pathological speech segments of length 16384 samples in time<br />
domain. The second row shows the spectrogram of the speech<br />
segments which gives an idea about their time-frequency<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />
(6)<br />
4<br />
Fig. 3. Example of constructing discriminatory PTFD - Pathological speech<br />
classification application<br />
energy distribution. The TWFB mappings of the normal and<br />
pathological speech segments are shown in the third row of<br />
the figure. These TWFB mappings were constructed using<br />
1000 TF functions each. Please note that the X axis of the<br />
TWFB mappings are in time-width and not time. It is evident<br />
from the TWFB images that there is a significant difference in<br />
the cumulative energy distribution of the corresponding cells<br />
especially between sw of 5 to 11 and frequency bands fbof 3<br />
and 4. The difference mapping was computed and the cells that<br />
exhibited high difference between the normal and pathological<br />
speech segments were chosen. In the third row of the figure,<br />
the chosen cells are shown bounded by a rectangular box.<br />
All the Q TF functions that fell within these boxes were<br />
used to reconstruct the discriminatory portion of the normal<br />
and pathological speech segments (ˆx(t)). During the reconstruction<br />
all the 5 ATFT decomposition parameters were used.<br />
The discriminatory portion of the reconstructed signals are<br />
shown in the fourth row of the figure. By closely comparing<br />
the first row of original signal x(t) and the fourth row of ˆx(t)<br />
it could be observed that the discriminatory cells of the TWFB<br />
mapping did capture the signal components that differ between<br />
the two speech segments. The discriminatory AP n (t, f) ′<br />
ˆx was<br />
then computed in 5 MCE iterations using Q TF functions and<br />
the marginals extracted from ˆx(t). TheAP n (t, f) ′<br />
ˆx of the<br />
normal and pathological speech segments are shown in the fifth
Normalized Frequency<br />
Normalized Frequency<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
IMF of the normal speech segment<br />
2000 4000 6000 8000<br />
Time instants<br />
10000 12000 14000 16000<br />
IMF of the pathological speech segment<br />
2000 4000 6000 8000<br />
Time instants<br />
10000 12000 14000 16000<br />
Fig. 4. Instantaneous mean frequency (IMF) extracted from the discriminatory<br />
AP n (t, f) ′<br />
ˆx ’s<br />
row of the figure. Extracting instantaneous features from these<br />
AP n (t, f) ′<br />
ˆx ’s will readily demonstrate the discrimination<br />
between the normal and pathological speech segments. As an<br />
example, we extracted the instantaneous mean frequency (IMF)<br />
from the AP n (t, f) ′<br />
ˆx ’s of normal and pathological speech<br />
segments and are shown in the Fig. 4. The difference between<br />
the IMF pattern is evident from the figure. It should be noted<br />
that we achieved the above result using only the PTFD that<br />
was constructed using the discriminatory portion of the signal.<br />
IV. CONCLUSIONS<br />
A novel approach to construct discriminatory PTFD for<br />
instantaneous feature extraction was presented. The proposed<br />
method used the TWFB mappings to identify the discriminatory<br />
decomposition subspaces and translated them into<br />
corresponding PTFD. The instantaneous features extracted<br />
from discriminatory PTFD are expected to be free of overlaps<br />
from irrelevant signal structures and ideally suitable for<br />
classification applications. Since PTFDs are the best way to<br />
extract meaningful instantaneous features from the signal, the<br />
proposed approach is an optimal way to perform meaningful<br />
classification that demands instantaneous features. A pathological<br />
speech classification example was used to demonstrate<br />
the proposed technique and the results were discussed.<br />
Future work involves applying the proposed method to real<br />
world applications and compare its performance with the<br />
conventional ways of extracting instantaneous features. TF<br />
copula based approach will also be investigated in constructing<br />
discriminative PTFDs.<br />
REFERENCES<br />
[1] L. Cohen, “Time-frequency distributions - a review,” Proceedings of the<br />
IEEE, vol. 77(7), pp. 941–981, 1989.<br />
[2] P. Loughlin, J. Pitton, and L. Atlas, “Construction of positive timefrequency<br />
distributions,” IEEE Trans. on <strong>Signal</strong> Processing, vol. 42, pp.<br />
2697–2705, 1994.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:34 from IEEE Xplore. Restrictions apply.<br />
5<br />
[3] M. Davy and A. Doucet, “Copulas: a new insight into positive timefrequency<br />
distributions,” IEEE <strong>Signal</strong> Processing Letters, vol. 10, no. 7,<br />
pp. 215–218, 2003.<br />
[4] S. G. Mallat and Z. Zhang, “Matching pursuit with time-frequency<br />
dictionaries,” IEEE Trans. <strong>Signal</strong> Processing, vol. 41, no. 12, pp. 3397–<br />
3415, 1993.<br />
[5] S. Krishnan, R. M. Rangayyan, G. D. Bell, and C. B. Frank, “Adaptive<br />
time-frequency analysis of knee joint vibroarthrographic signals for<br />
noninvasive screening of articular cartilage pathology,” IEEE Trans. on<br />
biomedical engineering, vol. 47, no. 6, pp. 773–783, June 2000.<br />
[6] K. Umapathy and S. Krishnan, “Time-width versus frequency band mapping<br />
of energy distributions,” IEEE Transactions on <strong>Signal</strong> Processing,<br />
vol. 55, no. 3, pp. 978–989, Mar 2007.<br />
[7] K. Umapathy, S. Krishnan, and A. Das, “Sub-dictionary selection using<br />
local discriminant bases algorithm for signal classification,” in Proceeding<br />
of IEEE Canadian Conference on Electrical and Computer Engineering,<br />
Niagara falls, Canada, May 2004, pp. 2001–2004.<br />
[8] K. Umapathy and S. Krishnan, “A signal classification approach using<br />
time-width vs frequency band sub-energy distributions,” in Proc. IEEE<br />
International conference on Acoustics, Speech and <strong>Signal</strong> processing<br />
ICASSP05, Philadelphia, USA, Mar 2005, pp. V 477–480.
Combining Vocal Source and MFCC Features for Enhanced<br />
Speaker Recognition Performance Using GMMs<br />
Abstract— This work presents seven novel spectral features for speaker<br />
recognition. These features are the spectral centroid (SC), spectral<br />
bandwidth (SBW), spectral band energy (SBE), spectral crest factor<br />
(SCF), spectral flatness measure (SFM), Shannon entropy (SE) and Renyi<br />
entropy (RE). The proposed spectral features can quantify some of the<br />
characteristics of the vocal source or the excitation component of speech.<br />
This is useful for speaker recognition since vocal source information<br />
is known to be complementary to the vocal tract transfer function,<br />
which is usually obtained using the Mel frequency cepstral coefficients<br />
(MFCC) or linear predication cepstral coefficients (LPCC). To evaluate<br />
the performance of the spectral features, experiments were performed<br />
using a text-independent cohort Gaussian mixture model (GMM) speaker<br />
identification system. Based on 623 users from the TIMIT database, the<br />
spectral features achieved an identification accuracy of 99.33% when<br />
combined with the MFCC based features and when using undistorted<br />
speech. This represents a 4.03% improvement over the baseline system<br />
trained with only MFCC and ΔMFCC features.<br />
I. INTRODUCTION<br />
Speaker recognition has many potential applications as a biometric<br />
tool for resources that can be accessed via the telephone or internet.<br />
In these applications, the identity of users cannot be verified because<br />
there is no direct contact between the user and the service provider.<br />
Hence, speaker recognition is a cost effective and practical technology<br />
that can be used for enhanced security.<br />
Often in literature, the entire speech system is modeled with a<br />
time-varying excitation and a short-time-varying filter [1]. Using this<br />
model, the source and filter are assumed independent and hence the<br />
speech signal (s(t)) is modeled by the linear convolution of:<br />
Danoush Hosseinzadeh and Sridhar Krishnan<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON - M5B 2K3 Canada<br />
Email: (danoushh@hotmail.com) (krishnan@ee.ryerson.ca)<br />
s(t) =x(t) ∗ h(t) (1)<br />
where, x(t) is a periodic excitation (for voiced speech) or white<br />
noise (for unvoiced speech) and h(t) is a time-varying filter which<br />
constantly changes to produce different sounds. Although h(t) is<br />
time varying, it can be considered stable over a period of a few<br />
milliseconds (ms); typically around 10-30ms is commonly used in<br />
literature [1]. This convenient short-time stationary behavior is exploited<br />
by many speaker recognition systems in order to characterize<br />
the vocal tract configuration given by h(t), which is known to be<br />
a unique speaker-dependent characteristic for a given sound. While<br />
assuming a linear model, this information can be easily extracted<br />
from speech signals using well established deconvolution techniques<br />
such as homomorphic filtering or linear prediction methods.<br />
To date, the most effective features for speaker recognition have<br />
been the Mel frequency cepstral coefficient (MFCC) and the linear<br />
prediction cepstral coefficients (LPCC) [2][1][3]. These features can<br />
accurately characterize the vocal tract configuration of a speaker and<br />
can achieve good performance. Part of the success of these features<br />
is that they provide a compact representation of the vocal tract which<br />
can be modeled effectively. The first several MFCCs can characterize<br />
the speaker’s vocal tract configuration and LPCCs generally define<br />
lower order polynomials [1]. Additionally, the first derivative of the<br />
MFCC feature (ΔMFCC) is largely uncorrelated with the MFCC<br />
feature and has been shown to enhance recognition performance.<br />
Although the MFCC and LPCC based features have proven to be<br />
effective for speaker recognition, they do not provide a complete<br />
description of the speaker’s speech system. Hence, vocal source<br />
information can complement these traditional features by quantifying<br />
some speaker-dependent characteristics such as pitch, harmonic<br />
structure and spectral energy distribution [4][5].<br />
This work proposes seven novel spectral features for speaker<br />
recognition that can quantify the vocal source. These features are<br />
the spectral centroid (SC), spectral bandwidth (SBW), spectral band<br />
energy (SBE), spectral crest factor (SCF), spectral flatness measure<br />
(SFM), Shannon entropy (SE), and Renyi entropy (RE). These<br />
spectral features can be used to complement the MFCC or LPCC<br />
features since they can quantify characteristics of the vocal source.<br />
It is also known that there is some degree of coupling between<br />
the vocal source and vocal tract [6][4] - i.e. the linear model<br />
assumed when calculating MFCC and LPCC is not entirely accurate.<br />
Therefore, the vocal source signal is to some extent predictable<br />
for a given vocal tract configuration. Given these factors, features<br />
that characterize the vocal source can be expected to improve the<br />
performance of existing speaker recognition systems. In this work,<br />
the seven proposed spectral features are extracted from the speech<br />
spectrum and used to enhance the performance of MFCC-based<br />
features in order to illustrate their effectiveness.<br />
Others have attempted to use the vocal source for improving<br />
performance of speaker recognition systems. Attempts have been<br />
made to develop features from the LPCC residual [7][8] with some<br />
success. In these cases, the authors have noted improved performance<br />
by complementing vocal tract features with vocal source information.<br />
The paper is organized as follows. Section II defines the baseline<br />
system used for testing and presents the spectral features. Section<br />
III presents the results as well as the experimental conditions and<br />
Section IV concludes the paper.<br />
II. PROPOSED TESTING METHOD<br />
GMM based speaker recognition systems have become the most<br />
popular method to date. This is because GMMs can capture the<br />
acoustic phenomena or acoustic classes that are present in speech<br />
[2]. In fact, some of the GMM clusters have been found to be<br />
highly correlated with particular phonemes [9]. As a result, good<br />
recognition performance can be achieved with GMM based systems.<br />
The performance of the proposed spectral features will be compared<br />
to the baseline system, which is a cohort text-independent<br />
GMM classifier trained with 14-dimensional MFCC vectors and 14dimensional<br />
ΔMFCC vectors extracted from 30ms speech frames.<br />
The log-likelihood function is used to find the user model that best<br />
matches a given utterance.<br />
1-4244-1274-9/07/$25.00 ©2007 IEEE 365<br />
6<br />
MMSP 2007<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.
TABLE I<br />
SUBBAND ALLOCATION USED TO CALCULATED SPECTRAL FEATURES.<br />
Subband Lower Edge (Hz) Upper Edge (Hz)<br />
1 300 627<br />
2 628 1060<br />
3 1061 1633<br />
4 1634 2393<br />
5 2394 3400<br />
A. Training and GMM Estimation<br />
The expectation maximization (EM) algorithm was used to estimate<br />
the parameters of the GMM models. In the past, model orders<br />
of 8-32 have been commonly used in literature however, good results<br />
have been obtained with cohort GMM systems using as little as<br />
16 components [2][10]. A model order of 24 was in this work to<br />
account for the additional features being used in the system and also,<br />
preliminary experimental results indicate that this model order was<br />
the optimal order for the proposed feature set given models of order<br />
16, 20, 24, 28 and 32. The k-means algorithm was used to obtain<br />
the initial estimate for each cluster since it has been shown that the<br />
initial grouping of data does not significantly affect the performance<br />
of GMM based recognition systems [2].<br />
A diagonal covariance matrix was used to estimate the variances<br />
of each cluster in the models since it is well known that diagonal<br />
covariance matrices are much more computationally efficient than full<br />
covariance matrices. Furthermore, diagonal covariance matrices can<br />
provide the same level of performance as full covariance matrices<br />
because they can capture the correlation between the features if<br />
a larger model order is used [11]. For these reasons, diagonal<br />
covariance matrices have almost been exclusively used in previous<br />
speaker recognition works. Each element of these matrices is limited<br />
to a minimum value of 0.01 during the EM estimation process to<br />
prevent singularities in the matrix, as recommended by [2].<br />
B. Spectral Features<br />
The proposed spectral features can be expected to improve the<br />
performance of MFCC or LPCC features because they can capture<br />
complementary information related to the vocal source such as pitch,<br />
harmonic structure, energy distribution, bandwidth of the speech<br />
spectrum and even voiced or unvoiced excitation. To illustrate the<br />
effectiveness of these features, they are extracted from the speech<br />
spectrum and used to enhance the performance of MFCC and<br />
ΔMFCC features.<br />
Spectral features should be extracted from multiple subbands,<br />
as shown in Table I. This extraction method will provide better<br />
discrimination between different speakers because the trend for a<br />
given feature can be captured from the spectrum. This is better than<br />
obtaining one global value from the spectrum, which is not likely to<br />
show speaker-dependent characteristics.<br />
The proposed subbands are linearly spaced on the Mel scale and<br />
spans the range of a practical telephone channel (300Hz-3.4kHz).<br />
This allocation scheme reflects the fact that most of the energy<br />
of the speech signal is located in the lower frequency regions and<br />
therefore, narrowly defined subbands are used in the lower frequency<br />
regions in order to capture more detail. This is also consistent with<br />
the non-linearities of human auditory perception, which shows<br />
more sensitivity to lower frequencies than higher frequencies. This<br />
non-linearity has been shown to be important for cepstral based<br />
features such as the MFCC feature [3].<br />
Spectral features are extracted from 30ms speech frames as<br />
follows. Let si[n] for n ∈ [0,N], represents the i th speech frame<br />
366<br />
7<br />
and Si[f] represents the spectrum of this frame. Then, Si[f] can<br />
be divided into M non-overlapping subbands where, each subband<br />
(b) is defined by a lower frequency edge (lb) and a upper frequency<br />
edge (ub). Now, each of the seven spectral features can be calculated<br />
from Si[f] as shown below.<br />
Spectral Centroid (SC) - SC as given below is the weighted<br />
average frequency for a given subband, where the weights are the<br />
normalized energy of each frequency component in that subband.<br />
Since this measure captures the center of gravity of each subband it<br />
can locate large peaks in subbands. These peaks correspond to the<br />
approximate location of formants [12] or pitch frequencies.<br />
SCi,b =<br />
ub<br />
f=l b f|Si[f]| 2<br />
ub<br />
f=l b |Si[f]| 2<br />
Spectral Bandwidth (SBW) - SBW as given below is the weighted<br />
average distance from each frequency component in a subband to<br />
the spectral centroid of that subband. Here, the weights are the<br />
normalized energy of each frequency component in that subband.<br />
This measure quantifies the relative spread of each subband for<br />
a given sound and therefore, it might characterize some speaker-<br />
dependent information.<br />
SBWi,b =<br />
(2)<br />
ub<br />
f=l b (f − SCi,b) 2 |Si[f]| 2<br />
ub<br />
f=l b |Si[f]| 2 (3)<br />
Spectral Band Energy (SBE) - SBE as given below is the energy of<br />
each subband normalized with the combined energy of the spectrum.<br />
The SBE gives the trend of energy distribution for a given sound and<br />
therefore, it contains some speaker-dependent information.<br />
ub f=l |Si[f]|<br />
b<br />
SBEi,b =<br />
2<br />
<br />
f,b |S[f]|2 (4)<br />
Spectral Flatness Measure (SFM) - SFM as given below is a<br />
measure of the flatness of the spectrum, where white noise has a<br />
perfectly flat spectrum. This measure is useful for discriminating<br />
between voiced and un-voiced components of speech [13].<br />
SFMi,b =<br />
ub<br />
f=l b |Si[f]| 2<br />
1<br />
u b−l b+1<br />
1<br />
u b −l b +1<br />
ub<br />
f=l b |Si[f]| 2<br />
Spectral Crest Factor (SCF) - SCF as given below provides a<br />
measure for quantifying the tonality of the signal. This measure is<br />
useful for discriminating between wideband and narrowband signals<br />
by indicating the relative peak of a subband. These peaks correspond<br />
to the most dominant pitch frequency in each subband.<br />
SCFi,b =<br />
1<br />
u b−l b+1<br />
max(|Si[f]| 2 )<br />
ub<br />
f=l b |Si[f]| 2<br />
Renyi Entropy (RE) - RE as given below is an information theoretic<br />
measure that quantifies the randomness of the subband. Here, the<br />
normalized energy of the subband can be treated as a probability<br />
distribution for calculating entropy and α is set to 3, as commonly<br />
found in literature [14]. This RE trend is useful for detecting the<br />
voiced and unvoiced components of speech.<br />
REi,b = 1<br />
1 − α log ⎛ <br />
<br />
u α<br />
b<br />
<br />
⎝ Si[f]<br />
<br />
2 <br />
<br />
ub <br />
<br />
f=l f=l Si[f] <br />
b<br />
b ⎞<br />
⎠ (7)<br />
Shannon Entropy (SE) - SE as given below is also an information<br />
theoretic measure that quantifies the randomness of the subband.<br />
Here, the normalized energy of the subband can be treated as a<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.<br />
(5)<br />
(6)
probability distribution for calculating entropy. Similar to the RE<br />
trend, the SE trend is also useful for detecting the voiced and unvoiced<br />
components of speech.<br />
<br />
<br />
u b<br />
<br />
<br />
Si[f]<br />
SEi,b = − <br />
<br />
ub <br />
<br />
f=l f=l Si[f] <br />
b<br />
b · log <br />
<br />
<br />
<br />
Si[f]<br />
2 <br />
<br />
ub (8)<br />
f=l Si[f] <br />
b<br />
To the best of our knowledge, these features are being used for<br />
the first time in speaker recognition although they have previously<br />
been used in other areas [15]. These spectral features along with the<br />
MFCC and ΔMFCC features will be extracted from each speech<br />
frame and appended together to form a combined feature matrix for<br />
the speech signal. These vectors can then be modeled and used for<br />
speaker recognition. Equation 9 shows the feature matrix that can<br />
be extracted based on only one spectral feature, say the SC feature,<br />
from i frames; where the bracketed number is the length of the<br />
feature. It should be noted that any other spectral feature can be<br />
substituted in for the SC feature in the feature matrix.<br />
−→ F =<br />
⎡<br />
⎢<br />
⎣<br />
MFCC1(14)<br />
.<br />
ΔMFCC1(14)<br />
.<br />
SC1(5)<br />
.<br />
⎤<br />
⎥<br />
⎦ (9)<br />
MFCCi(14) ΔMFCCi(14) SCi(5)<br />
The spectral features are expected to be largely uncorrelated with<br />
the MFCC based features because the spectral features can capture<br />
some information about the vocal source, whereas the MFCC features<br />
tend to capture information about the vocal tract. Among the spectral<br />
features, there may be some correlation between the SC and the<br />
SCF features because they both quantify information about the peaks<br />
(locations of energy concentration) of each subband. The difference is<br />
that the SCF feature describes the normalized strength of the largest<br />
peak in each subband while the SC feature describes the center of<br />
gravity of each subband. Therefore, these features will perform well<br />
if the largest peak in a given subband is much larger than all other<br />
peaks in that subband. The RE and SE features are also correlated<br />
since they are both entropy measures. However, the RE feature is<br />
much more sensitive to small changes in the spectrum because of<br />
the exponent term α. Therefore, although these features quantify the<br />
same type of information, their performance may be different for<br />
speech signals.<br />
III. EXPERIMENTAL RESULTS<br />
All speech samples used in these experiments were obtained from<br />
623 speakers of the TIMIT speech corpus. Since the TIMIT database<br />
has a sampling frequency of 16kHz, the signals were down sampled<br />
to 8kHz which is well suited for telephone applications. Features<br />
were extracted from 30ms long frames with 15ms of overlap with the<br />
previous frames and a Hamming window was applied to each frame<br />
to ensure a smooth frequency transition between frames. Twenty<br />
seconds of undistorted speech from each speaker was used to train the<br />
system and the remaining samples were used for testing. Although the<br />
tests were performed with undistorted audio, it is expected that some<br />
of these features will remain robust to different linear and non-linear<br />
distortions [15].<br />
A. Results and Discussions<br />
MFCC based features are very effective for characterizing the<br />
vocal tract configuration. Although this is the main reason for the<br />
success of the MFCC based features, they do not provide a complete<br />
description of the speaker’s speech system. The proposed spectral<br />
features are expected to increase identification accuracy of MFCC<br />
367<br />
8<br />
TABLE II<br />
EXPERIMENTAL RESULTS USING 7S TEST UTTERANCES (298 TESTS)<br />
Feature Accuracy(%)<br />
MFCC & ΔMFCC (Baseline system) 95.30<br />
MFCC & ΔMFCC & SC 97.32<br />
MFCC & ΔMFCC & SBE 97.32<br />
MFCC & ΔMFCC & SBW 96.98<br />
MFCC & ΔMFCC & SCF 96.31<br />
MFCC & ΔMFCC & SFM 81.55<br />
MFCC & ΔMFCC & SE 90.27<br />
MFCC & ΔMFCC & RE 98.32<br />
MFCC & ΔMFCC & SBE & SC 96.98<br />
MFCC & ΔMFCC & SBE & RE 96.98<br />
MFCC & ΔMFCC & SC & RE 99.33<br />
based systems because they provide some information about the<br />
vocal source.<br />
Table II demonstrates the identification accuracy of the system<br />
when using spectral features in addition to the MFCC based features<br />
with undistorted speech. The table also shows several combinations<br />
of the best performing features. The accuracy rate represents the<br />
percentage of test samples that were correctly identified by the<br />
system, as shown below.<br />
Samples Correctly Identified<br />
Accuracy = (10)<br />
Total Number of Samples<br />
It is evident from these results that there is some speaker-dependent<br />
information captured by most of the proposed features since they improved<br />
identification rates when combined with the standard MFCC<br />
based features. In fact, when two of the best performing spectral<br />
features (SC and RE) were simultaneously combined with the MFCC<br />
based features, an identification accuracy of 99.33% was achieved,<br />
which represents a 4.03% improvement over the baseline system.<br />
These results suggest that the proposed spectral features provide<br />
complementary and discriminatory information about the speaker’s<br />
vocal source and system, which leads to enhanced identification<br />
accuracies.<br />
The best performing feature was the RE feature. This feature is<br />
very effective at quantifying voiced speech which is quasi-periodic<br />
(relatively low entropy) and un-voiced which is often represented<br />
by AWGN (relatively high entropy). However, we suspect that the<br />
RE feature may also be characterizing another phenomena other<br />
than voice and unvoiced speech. This is likely since the SE feature<br />
did not show any performance benefits and it too is an entropy<br />
measure capable of discriminating between voiced and unvoiced<br />
speech. One possibility is that the exponential term α in the RE<br />
definition is contributing to this performance improvement. Since the<br />
spectrum is a normalized to the range of [0,1] before calculating<br />
these features, the exponent term α has the effect of significantly<br />
reducing the contributions of the low energy components relative to<br />
the high energy components. Therefore, the RE feature is likely to<br />
produce a more reliable measure since it heavily relies on the high<br />
energy components of each subband. However, the entropy features<br />
in general are susceptible to random noise and will not perform well<br />
in all conditions.<br />
Figure 1(a) shows that the SC feature can capture the center of<br />
gravity of each subband. Since the subband’s center of gravity is<br />
related to the spectral shape of the speech signal, it implies that the SC<br />
feature can also detect changes in pitch and harmonic structure since<br />
they fundamentally affect the spectrum. Pitch and harmonic structure<br />
convey some speaker-dependent information and are complementary<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.
Mag.<br />
Mag.<br />
0.15<br />
0.1<br />
0.05<br />
(a) Location of SC<br />
0<br />
0.15<br />
0.1<br />
0.05<br />
500 1000 1500 2000 2500<br />
(b) Location of SCF<br />
3000 3500 4000<br />
Frequency (Hz)<br />
Mag.<br />
Mag.<br />
0<br />
0.2<br />
0.1<br />
500 1000<br />
8% 18% 2%<br />
1500 2000 2500<br />
(c) Percentage of SBW<br />
33% 38%<br />
3000 3500 4000<br />
Frequency (Hz)<br />
0<br />
0<br />
0.2<br />
0.1<br />
500 1000 1500 2000 2500<br />
(d) Percentage of SBE<br />
46% 5% 3% 2% 2%<br />
3000 3500 4000<br />
Frequency (Hz)<br />
0<br />
0 500 1000 1500 2000 2500 3000 3500 4000<br />
Frequency (Hz)<br />
Fig. 1. Plot of the spectral features. Subband boundaries are indicated<br />
with dark solid lines and feature location is indicated with dashed lines. (a)<br />
Location of the SC (b) Location of the SCF (c) SBW as a percentage of the<br />
five subbands. (d) SBE as a percentage of the of the whole spectrum.<br />
to the vocal tract transfer function for speaker recognition. In addition,<br />
the SC feature can also locate the approximate location of the<br />
dominant formant in each of the subbands since formants will tend<br />
towards the subband’s center of gravity. These properties of the SC<br />
feature provide complementary information and led to the improved<br />
performance of the MFCC based classifier.<br />
The SCF feature shown in Figure 1(b) quantifies the normalized<br />
strength of the dominant peak in each subband. Given that the<br />
dominant peak in each subband corresponds to a particular pitch<br />
frequency harmonic, it shows that the SCF feature is pitch dependent<br />
and therefore, it is also speaker-dependent for a given sound. This<br />
dependence on pitch frequency is useful when the vocal tract configuration<br />
(i.e. MFCC) is known as seen by the enhanced performance.<br />
Moreover, the SCF feature is a normalized measure and should not<br />
be significantly affected by the intensity of speech from different<br />
sessions.<br />
The SBE feature, shown in Figure 1(d), also performed well in<br />
the experiments. This feature provides the distribution of energy in<br />
each subband as a percentage of the entire spectrum, which is another<br />
measure that can quantify the harmonic structure of the signal. The<br />
SBE feature is also a normalized energy measure and should not<br />
be significantly affected by the intensity (or relative loudness) of<br />
speech from different sessions. The results in Table II suggests that<br />
for a given vocal tract configuration the SBE trend is predictable and<br />
complementary for speaker recognition.<br />
The SBW feature is largely dependent on the SC feature and the<br />
energy distribution of each subband therefore, it has also performed<br />
well for the reasons mentioned above. Figure 1(c) shows the SBW<br />
for each subband as a percentage of all subbands.<br />
The SFM feature did not perform well because it quantifies characteristics<br />
that are not well defined in speech signals. For example,<br />
the SFM feature measures the tonality of the subband, a characteristic<br />
that is difficult to define in the speech spectrum since its energy is<br />
distributed across many frequencies.<br />
IV. CONCLUSION<br />
Features such as the SC, SCF and SBE provide vocal source<br />
information as it relates to harmonic structure, pitch frequency and<br />
spectral energy distribution, while the entropy features quantify the<br />
368<br />
9<br />
spectrum in terms of voiced and unvoiced speech. The proposed<br />
features were shown to be complementary in nature and enhanced<br />
performance when used with the vocal tract transfer function (i.e.<br />
MFCC). This is mainly because the vocal tract transfer function is<br />
the most discriminating feature for speaker recognition and it greatly<br />
influences the spectral shape and harmonic structures of speech.<br />
Experimental results show that the proposed spectral features<br />
improve the performance of MFCC based features. Based on 623<br />
users from the TIMIT database, the combined feature set of MFCC,<br />
ΔMFCC, SC and RE achieved an identification accuracy of 99.33%<br />
(for clean speech) by incorporating information about the vocal<br />
source. This represents a 4.03% improvement over the baseline<br />
system, which only used the MFCC based features.<br />
The good performance of spectral features for speaker recognition<br />
in this speaker identification system is very promising. These features<br />
should also produce good results if used with more sophisticated<br />
speaker recognition techniques, such as universal background model<br />
(UBM) based approaches.<br />
REFERENCES<br />
[1] J. P. Campbell, “Speaker recognition: A tutorial.” Proc. of the IEEE,<br />
vol. 85, no. 9, pp. 1437–1462, Sep. 1997.<br />
[2] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker<br />
identification using gaussian mixture speaker models.” IEEE Trans. on<br />
Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, Jan. 1995.<br />
[3] S. B. Davis and P. Mermelstein, “Comparison of parametric representations<br />
for monosyllabic word recognition in continuously spoken<br />
sentences.” IEEE Trans. on Acoustics, Speech, and <strong>Signal</strong> Processing,<br />
vol. 28, no. 4, pp. 357–366, Aug. 1980.<br />
[4] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, “Modeling of the<br />
glottal flow derivative waveform with application to speaker identification.”<br />
IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5,<br />
pp. 569–586, Sept. 1999.<br />
[5] J. M. Naik, “Speaker verification: A tutorial.” IEEE Communications<br />
Magazine, vol. 28, no. 1, pp. 42–48, Jan. 1990.<br />
[6] C. Che and Q. Lin, “Speaker recognition using HMM with experiments<br />
on the yoho database.” in Proc. Eurospeech, Sept. 1995, pp. 625–628.<br />
[7] W. Chan, T. Lee, N. Zheng, and H. Ouyang, “Use of vocal source features<br />
in speaker segmentation.” in Proc. IEEE Int’l Conf. on Acoustics,<br />
Speech, and <strong>Signal</strong> Processing (ICASSP), vol. 1, May 2006, pp. 14–19.<br />
[8] N. Zheng and P. Ching, “Using haar transformed vocal source information<br />
for automatic speaker recognition.” in Proc. IEEE Int’l Conf. on<br />
Acoustics, Speech, and <strong>Signal</strong> Processing (ICASSP), vol. 1, May 2004,<br />
pp. 77–80.<br />
[9] R. Auckenthaler, E. S. Parris, and M. J. Carey, “Improving a GMM<br />
speaker verification system by phonetic weighting.” in Proc. IEEE Int’l<br />
Conf. on Acoustics, Speech, and <strong>Signal</strong> Processing (ICASSP), Mar. 1999,<br />
pp. 313–316.<br />
[10] J. Gonzalez-Rodriguez, S. Gruz-Llanas, and J. Ortega-Garcia, “Biometric<br />
identification through speaker verification over telephone lines.” in<br />
Proc. IEEE Int’l Carnahan Conf. on Security Technology, Oct. 1999,<br />
pp. 238–242.<br />
[11] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification<br />
using adapted gaussian mixtures models.” Digital <strong>Signal</strong> Processing,<br />
vol. 10, pp. 19–41, Jan. 2000.<br />
[12] K. K. Paliwal, “Spectral subband centroid features for speech recognition.”<br />
in Proc. IEEE Int’l Conf. on Acoustics, Speech, and <strong>Signal</strong> Proc.<br />
(ICASSP), vol. 2, May 1998, pp. 617–620.<br />
[13] R. E. Yantorno, K. R. Krishnamachari, and J. M. Lovekin, “The<br />
spectral autocorrelation peak valley ratio (SAPVR) − a usable speech<br />
measure employed as a co-channel detection system.” in Proc. IEEE<br />
Int’l Workshop on Intelligent <strong>Signal</strong> Processing (WISP), May 2001.<br />
[14] P. Flandrin, R. G. Baraniuk, and O. Michel, “Time-frequency complexity<br />
and information.” in Proc. IEEE Int’l Conf. on Acoustics, Speech, and<br />
<strong>Signal</strong> Processing (ICASSP), vol. 3, Apr. 1994, pp. 329–332.<br />
[15] A. Ramalingam and S. Krishnan, “Gaussian mixture modeling of shorttime<br />
fourier transform features for audio fingerprinting.” IEEE Trans. on<br />
Information Forensics and Security, vol. 1, no. 4, pp. 457–463, 2006.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:54 from IEEE Xplore. Restrictions apply.
Proceedings of the 29th Annual International<br />
Conference of the IEEE EMBS<br />
Cité Internationale, Lyon, France<br />
August 23-26, 2007.<br />
Multiresolution <strong>Analysis</strong> and Classification of Small Bowel Medical<br />
Images<br />
Abstract— This is the first reported work in the area of small<br />
bowel image classification and a novel analysis system was<br />
developed. Principles of human texture perception were used to<br />
design features which can discriminate between abnormal and<br />
normal images. The proposed method extracts statistical features<br />
from the wavelet domain, which describe the homogeneity<br />
of localized areas within the small bowel images. To ensure that<br />
robust features were extracted, a shift-invariant discrete wavelet<br />
transform (SIDWT) was explored. LDA classification was used<br />
with the leave one out method to improve classification under<br />
the small database scenario. A total of 75 abnormal and normal<br />
bowel images were used for experimentation resulting in high<br />
classification rates: 85% specificity and 85% sensitivity. The<br />
success of the system can be accounted to the discriminatory<br />
and robust feature set (translation, scale and semi-rotational<br />
invariant), which successfully classified various sizes and types<br />
of pathologies at multiple viewing angles.<br />
Index Terms— Biomedical image processing, feature extraction,<br />
computer-aided diagnosis, content-based image retrieval<br />
I. INTRODUCTION<br />
The PillCam TM SB is a tiny capsule endoscope (10mm ×<br />
27mm [1]), which is ingested from the mouth. As natural<br />
peristalsis moves the capsule through the gastrointestinal<br />
tract, it captures video and wirelessly transmits it to a data<br />
recorder the patient is wearing around his or her waist.<br />
This video provides visualization of the 21 foot long small<br />
bowel, which was originally seen as a “black box” to<br />
doctors [2]. Video is recorded for approximately 8 hours<br />
and then the capsule is excreted naturally. Clinical results<br />
for the PillCam TM show that it provides superior diagnostic<br />
capabilities for diseases of the small intestine [2].<br />
In the small intestine, there are four main types of cancers,<br />
which are named after the cell they originate from:<br />
adenocarcinoma, sarcoma, carcinoid and lymphoma. These<br />
types of cancers can occur in various sizes and shapes, and<br />
may be found anywhere along the small bowel tract. Since an<br />
internal view of the small bowel was previously not available,<br />
the PillCam TM offers gastroenterologists a new method of<br />
detecting disease. The downfall of this technology is that the<br />
doctor has to watch and diagnose approximately 8 hours of<br />
footage! Viewing this footage is a very labourious task for<br />
physicians, which could cause them to miss important clues<br />
due to fatigue, boredom or due to the repetitive nature of<br />
the task. Therefore, to aid the doctors with this labourious<br />
task, a computer-aided diagnosis (CAD) system may be<br />
used to offer a secondary opinion of the images. Such a<br />
system would automatically isolate suspicious video instants<br />
April Khademi and Sridhar Krishnan<br />
Dept. of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Tor., ON, Canada<br />
akhademi@ieee.org, krishnan@ee.ryerson.ca<br />
1-4244-0788-5/07/$20.00 ©2007 IEEE 4524<br />
10<br />
(images) for the doctor. The extracted features may also<br />
be used for content-based image retrieval (CBIR), where<br />
physicians can locate abnormal image(s) based on their<br />
semantic content, not based on text annotations.<br />
There are several challenges associated with the development<br />
of an automated classification scheme for small bowel<br />
imagery: the camera angle can be expected to be different<br />
from patient to patient, suspicious regions may occur in<br />
several different places along the gastrointestinal tract and<br />
pathologies come in various forms, sizes and shapes. This<br />
work aims to develop a unified feature extraction algorithm<br />
which can account for all these scenarios. This is the first<br />
reported work in the area of small bowel image classification<br />
and the system aims to detect both malignant and benign<br />
pathologies with a high classification rate. The small bowel<br />
images (video instants) are stored as lossy JPEG images, so<br />
feature extraction is completed in the compressed domain.<br />
II. METHODOLOGY<br />
To extract highly discriminatory features, image processing<br />
techniques are needed to analyze and understand the biomedical<br />
images. Since biomedical signals (including small<br />
bowel images) contain a combination of information which is<br />
localized spatially (i.e. transients, edges) as well as structures<br />
which are more diffuse (i.e. small oscillations, texture) [3], a<br />
technique which can exploit both these characteristics (which<br />
may be related to the diagnosis) is required. To perform this<br />
task, the discrete wavelet transform (DWT) will be utilized<br />
due to its excellent space-localization properties [4] [6].<br />
A. DWT Properties for Feature Extraction<br />
The DWT is scale-invariant since a complete decomposition<br />
will contain all the basis functions needed to decompose<br />
various scaled versions of the input image. Since pathological<br />
features do not come in a predefined size, scale-invariance<br />
will help to capture pathologies of different sizes.<br />
Although the DWT offers good localization and scaleinvariance<br />
properties, it is well known that the DWT is shiftvariant<br />
[4]. Different translations of an input image results<br />
in a different set of DWT coefficients. Therefore, extracting<br />
robust features from the wavelet domain is a challenging<br />
task.<br />
B. Shift-Invariant DWT<br />
SaP1B5.1<br />
To extract a consistent feature set, the 2-D version of<br />
Belkyns’s shift-invariant DWT (SIDWT) is utilized [5]. This<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.
algorithm computes the DWT for all circular translates of<br />
the image, in a computationally efficient manner. Coifman<br />
and Wickerhauser’s best basis algorithm [6] is employed to<br />
ensure the same set of coefficients are chosen, regardless<br />
of the input shift. This will permit for the selection of a<br />
consistent set of DWT coefficients, therefore allowing for<br />
the extraction of robust, shift-invariant features.<br />
C. Image Texture<br />
When a textured surface is viewed, the human visual system<br />
can discriminate between textured regions quite easily.<br />
To understand how the human visual system can easily differentiate<br />
between textures, Julesz defined textons, which are<br />
elementary units of texture [7]. Various textured regions can<br />
be decomposed using these textons, which include elongated<br />
blobs, lines, terminators and more. It was also found that the<br />
frequency content, scale, orientation and periodicity of these<br />
textons can provide important clues on how to differentiate<br />
between two or more textured areas [7] [4]. Therefore,<br />
a robust texture analysis scheme would take into account<br />
the localized spatial properties of images to understand the<br />
orientation, periodicity, scale or frequency content of the<br />
primitive texture elements. Consequently, to differentiate<br />
between normal and pathological cases of the small intestine,<br />
the proposed work aims to develop an automated system<br />
which mimics the human visual system to understand the<br />
texture content of the small bowel images.<br />
Small Bowel Texture: Normal small bowel images<br />
contain smooth, homogeneous texture elements with very<br />
little disruption in uniformity except for folds and crevices.<br />
This is shown in Figure 1(d)-(f). Abnormal small bowel<br />
images (benign and malignant) can contain various pathologies<br />
(polyp, Kaposi’s sarcoma, carcinoma, etc.). These diseases<br />
may occur in various sizes, shapes, orientations and<br />
locations within the gastrointestinal tract. Although there<br />
are many types of diseases, small bowel pathologies have<br />
some common textural characteristics: (1) diseased regions<br />
contain a variety of textured areas simultaneously and (2)<br />
diseased areas are mostly composed of heterogeneous texture<br />
components. Typical, abnormal cases are shown in Figure<br />
1(a)-(c). Another important factor which must be considered<br />
is that the camera angle will vary from image to image.<br />
Therefore, textural characteristics may appear in several<br />
orientations.<br />
D. Features<br />
To extract texture-based features, normalized graylevel cooccurrence<br />
matrices (GCMs) are used. Let each entry of<br />
the normalized GCM be represented as p (l1, l2, d, θ), where<br />
l1 and l2 are two graylevels at a distance d and angle θ.<br />
Normalized GCMs allow statistical quantities to be computed<br />
which reflect the textural properties of the region of interest.<br />
To exploit the textural characteristics of the small bowel<br />
images, texture features which describe the relative homogeneity<br />
or non-uniformity of the images are used since these<br />
texture properties differentiate between the normal and the<br />
abnormal images. The features used are homogeneity (h),<br />
4525<br />
11<br />
which describes how uniform the texture is and entropy (e),<br />
which is a measure of nonuniformity or the complexity of<br />
the texture.<br />
h(θ) =<br />
e(θ) =<br />
L−1 L−1 <br />
l1=0 l2=0<br />
L−1 L−1 <br />
l1=0 l2=0<br />
p 2 (l1, l2, d, θ), (1)<br />
p (l1, l2, d, θ) log2(p (l1, l2, d, θ)). (2)<br />
To gain a highly descriptive representation, textural features<br />
are computed from the wavelet domain. Extracting<br />
features from the wavelet domain will result in a localized<br />
texture description, since the DWT has excellent spacelocalization<br />
properties. To account for oriented texture, the<br />
GCMs are computed at various angles in the wavelet domain<br />
at d = 1 to account for fine texture. Typically, the DWT isn’t<br />
used for texture analysis due to its shift-variant property.<br />
However, using the SIDWT algorithm previously described<br />
will permit for the extraction of a consistent feature set,<br />
thus allowing for multiscale texture analysis. This scheme is<br />
devised to be in accordance with human texture perception.<br />
1) Multiresolutional Features: To examine texture features<br />
at various scales, GCMs p (l1, l2, d, θ) are computed<br />
from the wavelet domain for each scale j at several angles<br />
θ. Each subband isolates different frequency components -<br />
the HL band isolates horizontal edge components, the LH<br />
subband isolates horizontal edges, the HH band captures the<br />
diagonal high frequency components and LL band contains<br />
the lowpass filtered version of the original. Consequently,<br />
to capture these oriented texture components, the GCM is<br />
computed at 0 ◦ in the HL band, 90 ◦ in the LH subband,<br />
45 ◦ and 135 ◦ in the HH band and 0 ◦ , 45 ◦ , 90 ◦ and 135 ◦ in<br />
the LL band to account for any directional elements which<br />
may still may be present in the low frequency subband.<br />
From these GCMs, homogeneity h(θ) and entropy e(θ) are<br />
computed for each decomposition level using Equation 1 and<br />
2. For each decomposition level j, more than one directional<br />
feature is generated for the HH and LL subbands. The<br />
features in these subbands are averaged so that: features<br />
are not biased to a particular orientation of texture and<br />
the representation will offer some rotational invariance. The<br />
features generated in these subbands (HH and LL) are<br />
shown below. Note that the quantity in parenthesis is the<br />
angle at which the GCM was computed.<br />
h j<br />
HH<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.<br />
e j<br />
HH<br />
h j<br />
LL<br />
e j<br />
LL<br />
1<br />
<br />
= h<br />
2<br />
j<br />
HH (45◦ ) + h j<br />
HH (135◦ <br />
) , (3)<br />
1<br />
<br />
= e<br />
2<br />
j<br />
HH (45◦ ) + e j<br />
HH (135◦ <br />
) , (4)<br />
1<br />
<br />
= h<br />
4<br />
j<br />
LL (0◦ ) + h j<br />
LL (45◦ <br />
)<br />
(5)<br />
+ 1<br />
<br />
h<br />
4<br />
j<br />
LL (90◦ ) + h j<br />
LL (135◦ <br />
) ,<br />
1<br />
<br />
= e<br />
4<br />
j<br />
LL (0◦ ) + e j<br />
LL (45◦ <br />
)<br />
(6)<br />
+ 1<br />
<br />
e<br />
4<br />
j<br />
LL (90◦ ) + e j<br />
LL (135◦ <br />
) .
Fig. 1. Typical images of the small bowel captured by the PillCam TM SB, which exhibit textural characteristics. (a) Small bowel lymphoma, (b) GIST<br />
tumour, (c) polypoid mass, (d) healthy small bowel, (e) normal small bowel, (f) normal colonic mucosa.<br />
As a result, for each decomposition level j, two feature sets<br />
are generated:<br />
F j<br />
h =<br />
<br />
h j<br />
HL (0◦ ), h j<br />
LH (90◦ ), h j<br />
HH , h j<br />
<br />
LL , (7)<br />
F j <br />
e = e j<br />
HL (0◦ ), e j<br />
LH (90◦ ), e j<br />
<br />
HH , ej LL , (8)<br />
where h j<br />
HH , h j<br />
LL , ej HH and ej LL are the averaged texture<br />
descriptions from the HH and LL band previously described<br />
and h j<br />
HL (0◦ ), e j<br />
HL (0◦ ), h j<br />
LH (90◦ ) and e j<br />
LH (90◦ ) are homogeneity<br />
and entropy texture measures extracted from the HL<br />
and LH bands. Since directional GCMs are used to compute<br />
the features in each subband, the final feature representation<br />
is not biased for a particular orientation of texture and may<br />
provide a semi-rotational invariant representation.<br />
E. Classifier<br />
A large number of test samples are required to evaluate a<br />
classifier with low error (misclassification) rates since a small<br />
database will cause the parameters of the classifiers to be estimated<br />
with low accuracy. This requires the biomedical image<br />
database to be large, which may not always be the case since<br />
acquiring the images is always not easy and also the number<br />
of pathologies may be limited. If the extracted features are<br />
strong (i.e. the features are mapped into nonoverlapping<br />
clusters in the feature space) the use of a simple classification<br />
scheme will be sufficient in discriminating between classes.<br />
Therefore, linear discriminant analysis (LDA) will be the<br />
classification scheme used in conjunction with the Leave One<br />
Out Method (LOOM). LOOM combats the small sample size<br />
scenario by removing one sample from the whole set and<br />
generating the discriminant functions from the remaining<br />
N − 1 data samples. Using these discriminant scores, the<br />
left out sample is classified. This procedure is completed<br />
for all N samples. LOOM allows classifier parameters to be<br />
estimated with least bias [8].<br />
III. RESULTS AND DISCUSSION<br />
The objective of the proposed system is to automatically<br />
classify various pathologies from normal regions throughout<br />
the small bowel tract. The small intestine images used<br />
are 256×256, 24bpp and lossy (.jpeg). Forty-one normal<br />
and 34 abnormal (including: various sized lesions such as<br />
submucosal masses, lymphomas, jejunal carcinomas, polypoid<br />
masses, Kaposi’s sarcomas, multifocal carcinomas, etc.)<br />
images were used for experimentation (ground truth is<br />
supplied with the database and images are acquired from<br />
4526<br />
12<br />
TABLE I<br />
CONFUSION MATRIX CONTAINING THE NUMBER OF CORRECTLY<br />
CLASSIFIED SMALL BOWEL IMAGES AS EITHER NORMAL OR ABNORMAL.<br />
Normal Abnormal<br />
Normal 35 (85%) 6 (15%)<br />
Abnormal 5 (15%) 29 (85%)<br />
various patients). The images were converted to grayscale<br />
prior to any processing to examine the feature set in this<br />
domain. Features were extracted for the first five levels of<br />
decomposition. Further decomposition levels will result in<br />
subbands of 8×8 or smaller, which will result in skewed<br />
probability distribution (GCM) estimates and thus were not<br />
included in the analysis. Therefore, the extracted features are<br />
F j<br />
h and F j e for j = {1, 2, 3, 4, 5}. The block diagram of the<br />
proposed system is shown in Figure 2.<br />
In order to find the optimal sub-feature set, an exaustive<br />
search was performed (i.e. all possible feature combinations<br />
were tested using the proposed classification scheme). The<br />
optimal classification performance was achieved by combining<br />
homogeneity features from the first and third decomposition<br />
levels with entropy from the first decomposition level.<br />
These three feature sets are shown below:<br />
F 1 h =<br />
<br />
h 1 HL(0 ◦ ), h 1 LH(90 ◦ ), h 1 HH, h 1 F<br />
<br />
LL , (9)<br />
3 h =<br />
<br />
h 3 HL(0 ◦ ), h 3 LH(90 ◦ ), h 3 HH, h 3 F<br />
<br />
LL , (10)<br />
1 e = e 1 HL(0 ◦ ), e 1 LH(90 ◦ ), e 1 HH, e 1 <br />
LL . (11)<br />
Using the above features in conjunction with LOOM and<br />
LDA, the classification results for the small bowel images<br />
are shown as a confusion matrix in Table I. A total of 75<br />
abnormal and normal bowel images were correctly classified<br />
at rate of 85% specificity and 85% sensitivity. The classification<br />
rates are high even though: (1) the angle of the<br />
camera (or viewing angle) is different from image to image,<br />
(2) the images came from various patients and different<br />
regions within the gastrointestinal tract, (3) the pathologies<br />
were not restricted to a specific type, but in fact included<br />
many diseases and (4) the masses and lesions were of<br />
various sizes and shapes. The success of the system can be<br />
accounted to several factors. Firstly, the utilization of the<br />
DWT was important to gain a space-localized representation<br />
of the images’ nonstationary properties. Secondly, the choice<br />
of wavelet-based statistical texture measures (entropy and<br />
homogeneity) was critical in differentiating between the<br />
localized texture properties of the images, since abnormal<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.
images contain localized heterogeneous texture elements,<br />
whereas normal images are smooth (uniform). Utilization of<br />
the SIDWT allowed for the extraction of consistent (i.e. shiftinvariant)<br />
features. Furthermore, due to the scale-invariant<br />
basis functions of the DWT, pathologies of varying sizes<br />
were captured within one transformation (i.e. the features<br />
were scale-invariant).<br />
The system is relatively robust to the different camera<br />
angles by design. Since the viewing angle is different from<br />
image to image, features were collected at various angles (0 ◦ ,<br />
45 ◦ , 90 ◦ , 135 ◦ ) in the respective subbands in order to account<br />
for the texture properties, regardless of the orientation. The<br />
feature set thus offered a semi-rotational invariant representation<br />
which could account for oriented textural properties at<br />
various angles within the gastrointestinal tract.<br />
Since this is the first work in the area of small bowel<br />
image classification, the results are promising and show<br />
great potential for applications such as CAD and CBIR. This<br />
is especially true since all features were extracted in a fullyautomated<br />
manner without any intervention or assistance<br />
from a gastroenterologist. This means that such a system<br />
could in fact be used as a tool which could either (1) sort<br />
the 8 hours of film and highlight suspicious regions or (2)<br />
automatically retrieve a specific region or mass, without<br />
having to use text annotations.<br />
Although the classification results are high, any misclassification<br />
can be accounted to cases where there is a lack<br />
of statistical differentiation between the texture uniformity<br />
of the abnormal and normal small bowel images. Additionally,<br />
normal tissue can sometimes assume the properties<br />
of abnormal regions; for example, consider a normal small<br />
bowel image which has more than the average amount of<br />
folds. This may be characterized as non-uniform texture and<br />
consequently would be misclassified.<br />
Another important consideration arises from the sizes of<br />
the databases. As was stated in Section II-E, the number of<br />
images used for classification can determine the accuracy<br />
of the estimated classifier parameters. Since only a modest<br />
number of images were used, misclassification could result<br />
due to the lack of proper estimation of the classifiers parameters<br />
(although the scheme tried to combat this with LOOM).<br />
Additionally, finding the right trade off between number of<br />
features and database size is an ongoing research topic and<br />
has yet to be perfectly defined [8].<br />
A last point for discussion is the fact that features were<br />
successfully extracted from the compressed domain. Since<br />
many forms of multi-media are being stored in lossy formats,<br />
it is important that classification systems may also be<br />
successful when utilized in the compressed domain.<br />
Fig. 2. System block diagram for the classification of small bowel images.<br />
4527<br />
13<br />
IV. CONCLUSIONS<br />
A unified feature extraction and classification scheme was<br />
developed using the DWT for small bowel images and this<br />
is the first reported work in the area. Textural features<br />
were extracted from the wavelet domain in order to obtain<br />
localized numerical descriptors of the relative homogeneity<br />
of the small bowel images. To ensure the DWT representation<br />
was suitable for the consistent extraction of features, a shiftinvariant<br />
discrete wavelet transform (SIDWT) was computed.<br />
To combat the small database size, a small number of<br />
features and LDA classification were used in conjunction<br />
with the LOOM to gain a more accurate approximation of<br />
the classifier’s parameters.<br />
Seventy-five abnormal and normal bowel images were<br />
correctly classified at rate of 85% specificity and 85%<br />
sensitivity. The success of the system can be accounted<br />
to the semi-rotational invariant, scale-invariant and shiftinvariant<br />
features, which permitted the extraction of discriminating<br />
features for multiple camera angles and various sized<br />
pathologies. Due to the success of the proposed work, it may<br />
be used in a CAD scheme or a CBIR application, to assist the<br />
gastroenterologists to diagnose and sort 8 hours of footage.<br />
REFERENCES<br />
[1] B. Kim, S. Park, C. Jee, and S. Yoon, in An earthworm-like locomotive<br />
mechanism for capsule endoscopes. International Conference on<br />
Intelligent Robots and Systems, Aug. 2005, pp. 2997 – 3002.<br />
[2] Given Imaging Ltd., PillCam TM SB Capsule Endoscopy. [ONLINE],<br />
2006, http://www.givenimaging.com/.<br />
[3] M. Unser and A. Aldroubi, “A review of wavelets in biomedical<br />
applications,” Proceedings of the IEEE, vol. 84, no. 4, pp. 626 – 638,<br />
Apr. 1996.<br />
[4] S. Mallat, Wavelet Tour of <strong>Signal</strong> Processing. USA: Academic Press,<br />
1998.<br />
[5] J.Liang and T. Parks, “Image coding using translation invariant wavelet<br />
transforms with symmetric extensions,” IEEE Transactions on Image<br />
Processing, vol. 7, pp. 762 – 769, May 1998.<br />
[6] A. Khademi, “Multiresolutional analysis for classification and compression<br />
of medical images,” Master’s thesis, 2006, ryerson <strong>University</strong>,<br />
Canada.<br />
[7] B. Julesz, “Textons, the elements of texture perception, and their<br />
interactions,” Nature, vol. 290, no. 5802, pp. 91–97, Mar. 1981.<br />
[8] K. Fukunaga and R. Hayes, “Effects of sample size in classifier design,”<br />
IEEE Transactions on Pattern <strong>Analysis</strong> and Machine Intelligence,<br />
vol. 11, no. 8, pp. 873 – 885, Aug. 1989.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:53 from IEEE Xplore. Restrictions apply.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />
Interference Detection in Spread Spectrum<br />
Communication Using Polynomial Phase Transform<br />
Randa Zarifeh and Nandini Alinier<br />
School of Electronics, Communication<br />
and Electrical Engineering<br />
<strong>University</strong> of Hertfordshire<br />
Hertfordshire, UK<br />
Email: rzarifeh@ee.ryerson.ca,n.d.alinier@herts.ac.uk<br />
Abstract—In this paper we propose an interference detection<br />
technique for detecting time varying jamming signals in spread<br />
spectrum communication systems. The technique is based on<br />
Discrete Polynomial Phase Transform (DPPT), where the jamming<br />
signal is synthesized from the modulated spread spectrum<br />
signal using the DPPT. The technique has shown good performance<br />
under low interference conditions with 2dB SJR, when<br />
correlation coefficient between the synthesized chirp signal and<br />
the reference chirp is 0.9. The computational complexity of the<br />
proposed technique is low compared to other techniques such as<br />
Hough-Radon Transform. This interference detection technique<br />
can be applied for different interference excision methods in<br />
military and wireless communication applications.<br />
I. INTRODUCTION<br />
The most commonly used type of spread spectrum signal is<br />
the direct sequence (DS/SS) spread spectrum signal, where a<br />
pseudorandom (PN) sequence is superimposed upon the data<br />
bits to achieve data spreading over a wider bandwidth. This<br />
increase in the bandwidth yields a processing gain, defined<br />
as the ratio of the bandwidth of the transmitted signal to the<br />
bandwidth of the message signal. The spread spectrum signal<br />
is not easily detected since it appears to be noise-like except<br />
at the intended receiver where the PN sequence is known.<br />
The DS spread spectrum signals are often used for their<br />
interference rejection capabilities in military and wireless communications.<br />
While SS systems can strongly reject narrowband<br />
interference, they fail in rejecting wideband interference. In<br />
practical systems, it is not possible to transmit high power<br />
wideband jamming signal due to power limitation. That is<br />
the reason for considering most of the jamming signals to<br />
be wideband signals with narrowband instantaneous frequency<br />
such as chirp signals, linear or nonlinear FM signals. The<br />
performance of the SS system can be further improved by<br />
detecting the interference/jamming and excise it prior to data<br />
despreading and detection.<br />
Different research works have been done in the area of<br />
interference (chirp) detection; some of the proposed methods<br />
are using: adaptive filters [1], evolutionary algorithm [2], and<br />
maximum likelihood estimation. The optimal method is the<br />
maximum likelihood technique which integrates along the<br />
Instantaneous Frequency (IF) ridge in the time frequency<br />
distribution. But if the initial information on the position of<br />
1-4244-0353-7/07/$25.00 ©2007 IEEE<br />
Sridhar Krishnan and Alagan Anpalagan<br />
Department of Electrical<br />
and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong><br />
Toronto, Canada<br />
Email: (krishnan)(alagan)@ee.ryerson.ca<br />
the IF is not available, the integration will be taken along all<br />
possible lines in the TF domain. The maximum likelihood can<br />
also be applied on the IF ridge which is the result of wavelet<br />
transforms as done in [3]. This method has a high computational<br />
complexity especially when the initial estimation of the<br />
IF is not available. Another technique was proposed by Amin<br />
et al. [12], where they evaluated the Wigner-Ville Distribution<br />
(WVD) of the observed signal and estimated the IF parameters<br />
from the WVD. Once the parameters have been estimated,<br />
an adaptive time varying filter can be set up to suppress the<br />
interference. One of the problems related to this method is<br />
that, if the interference is low with respect to the SS signal or<br />
the noise, the estimation of the interference parameter can fail<br />
and the suppression filter can track the useful signal instead<br />
of the interference.<br />
A linear chirp signal was detected by applying Hough-<br />
Radon Transform (HRT) on the WVD or the Spectrogram<br />
of the signal [4][5]. The HRT is an optimal technique for<br />
detecting directional lines in an image which requires a high<br />
degree of computational complexity. Other techniques for<br />
chirp detection are based on signal synthesis. Previous work<br />
on signal synthesis from bilinear Time Frequency Distribution<br />
(TFD) was done by Bartel and Parks [6]. In their work<br />
the signal was synthesized from the WVD using the least<br />
square approximation. Krattenthaler and Hlawatsch extended<br />
the work in [6] and synthesized the chirp signal from the<br />
smoothed versions of the WVD [7]. These techniques are<br />
based on the least square approximation where they have high<br />
computational complexity.<br />
The motive of this work is to detect a jamming/interfering<br />
chirp signal in a spread spectrum communication system. This<br />
interfering signal could be from intentional jammer (hostile<br />
source) or interference from multipath effects in multipath<br />
channel. In this paper we propose a chirp detection technique<br />
using signal synthesis approach, where a parametric signal<br />
analysis approach is used to represent the time domain chirp<br />
signal. The proposed solution based on DPPT will detect the<br />
chirp jammer/interferer even under low jamming power with<br />
low computational complexity, and hence is better than the<br />
existing approaches. The proposed technique is a good interference<br />
detection tool that can be applied prior to interference<br />
2979<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.<br />
14
excision blocks in communications.<br />
The paper is organized as follows: Section II describes<br />
the signal and system model, spread spectrum system and<br />
chirp signals. Section III defines the discrete polynomial<br />
phase transform technique. Section IV outlines numerical and<br />
simulation results. And the last Section V is the conclusion<br />
and the summary of the paper.<br />
II. SIGNAL AND SYSTEM MODEL<br />
A. Spread Spectrum System<br />
Assuming Binary Phase Shift Keying modulation (BPSK),<br />
the transmitted spread spectrum signal s(t) consists of the<br />
message signal m(t) and the spreading signal p(t),<br />
where<br />
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />
s(t) =m(t)p(t), (1)<br />
m(t) = <br />
bkrectTm (t − kTm), (2)<br />
k<br />
bk = {+1,-1} is the message bits, and rectTm is a rectangular<br />
pulse of duration Tm, and<br />
p(t) =<br />
L−1 <br />
cnrectTp (t − nTp), (3)<br />
n=0<br />
where cn = {+1,-1} is the nth chip of the L-element PN<br />
sequence,<br />
s(t) = <br />
bkp(t − kTm). (4)<br />
k<br />
During the transmission of the modulated signal, additive<br />
white Gaussian noise n(t) (with zero mean and variance = σ 2 )<br />
and interference i(t) are added to the signal in the channel,<br />
and the following signal is received:<br />
r(t) =s(t)+n(t)+i(t). (5)<br />
At the receiver the received signal r(t) is synchronized and<br />
correlated with the same PN sequence (known to the intended<br />
receiver) and estimation of the message signal ˆmk is made<br />
based on the polarity of the recovered message bits,<br />
ˆmk = 〈r(t),p(t)〉 = mk〈p(t),p(t)〉+〈n(t),p(t)〉+〈i(t),p(t)〉,<br />
(6)<br />
where is the correlation operator. From (6) it can be<br />
seen that correlating the received signal with the PN sequence<br />
p(t) will recover the message signal, but will spread both the<br />
noise and the interference. If the ratio of the interference power<br />
to the signal power is large so the processing gain can not<br />
suppress the interference then the estimation of the message<br />
bit will be wrong. The SS system (shown in Fig.1) is able to<br />
recover the correct data bit at low interference, but when the<br />
interference is high and time varying the SS system will fail.<br />
B. Chirp <strong>Signal</strong><br />
Fig. 1. Spread Spectrum System Block Diagram.<br />
Chirp signals are present in many areas of science and<br />
engineering. They are present in natural signals such as animal<br />
sounds and whistling sounds. Because of their ability to<br />
reject interference, they are widely used in spread spectrum<br />
communications, military communications, radar and sonar<br />
applications.<br />
Mathematically, chirp signals are modeled as nonstationary<br />
signals with polynomial phase parameters. A polynomial phase<br />
signal y(n) can be expressed as:<br />
M<br />
y(n) =b0 exp{jφ(n)} = b0 exp j am(n∆) m<br />
<br />
, (7)<br />
m=0<br />
where φ(n) is the phase of the signal, M is the polynomial<br />
order, N is the total signal length, and ∆ is the sampling<br />
interval and b0 is the signal amplitude.<br />
In this paper we will deal with linear and parabolic (nonlinear)<br />
chirp signals as interferences, where their phases are<br />
second and third order polynomial functions (M = 2, 3).<br />
Figures 2 and 3 show the Time-Frequency representation of<br />
the linear and parabolic chirp signal respectively.<br />
Frequency<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0 1000 2000 3000<br />
Time<br />
4000 5000 6000<br />
Fig. 2. TF representation of Linear Chirp.<br />
III. DISCRETE POLYNOMIAL PHASE TRANSFORM (DPPT)<br />
The DPPT is a parametric signal analysis approach for<br />
estimating the phase parameters of a polynomial phase signal<br />
[10] [11] [14]. Normally, the phase parameters of a signal<br />
are determined by applying least square approximation to fit<br />
15<br />
2980<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.
Frequency<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />
0<br />
0 1000 2000 3000<br />
Time<br />
4000 5000 6000<br />
Fig. 3. TF representation of Parabolic Chirp.<br />
a polynomial to the phase curve. This process poses some<br />
difficulty especially when the phase is not available. The<br />
DPPT, on the other hand, is applied directly onto the signal<br />
and it works quiet well even in the presence of noise.<br />
The principle of the DPPT is as follow: when the DPPT of<br />
order M is applied to a signal with polynomial phase of order<br />
M, it produces a spectral peak. The position of this spectral<br />
peak at frequency ω0 provides an estimation of the coefficient<br />
âM . After the estimation of âM the order of the polynomial<br />
is reduced from M to M − 1 by multiplying the signal with<br />
the conjugate pair of the estimated phase. Then the coefficient<br />
âM−1 will be estimated the same way by applying the DPPT<br />
of order M − 1 on the signal. The procedure is repeated until<br />
all of the coefficients are estimated.<br />
The DPPT of order M of a continuous phase signal y(n)<br />
is the Fourier transform of the higher order DPM[y(n),τ]<br />
operator:<br />
DPPTM [y(n),ω,τ]=<br />
N−1 <br />
(M−1) τ<br />
where τ is a positive number and:<br />
DPM[y(n),τ]exp (−jωn∆) ,<br />
(8)<br />
DP1[y(n),τ]:=y(n), (9)<br />
DP2[y(n),τ]:=y(n)y ∗ (n − τ), (10)<br />
DPM[y(n),τ]:=DP2[DPM−1[y(n),τ],τ]. (11)<br />
The coefficient aM is estimated based on the following formula:<br />
1<br />
âM =<br />
M!(τM ∆) M−1 argmaxω{|DPPTM [y(n),ω,τ]|},<br />
(12)<br />
where DPPTM[y(n),ω,τ] is calculated as in Equation (11).<br />
The formulas for the DPPT of order one to three are shown<br />
below:<br />
DPPT1[y(n),ω,τ]=fft{y(n)}, (13)<br />
DPPT2[y(n),ω,τ]=fft{y(n)y ∗ (n − τ)}, (14)<br />
DPPT3[y(n),ω,τ]=fft{y(n)[ y ∗ (n−τ)] 2 y(n−2τ)}. (15)<br />
After the estimation of aM , the order of the signal<br />
phase will be reduced by multiplying the signal y(n) with<br />
exp{−jâM (n∆) M },<br />
y(n) (M−1) = y(n)exp{−jâM (n∆) M }. (16)<br />
To determine aM−1, apply the DPPT of order M −1 on the<br />
signal y(n) (M−1) from Equation (13). The process is repeated<br />
until all the remaining coefficients are calculated. Coefficient<br />
a0 and b0 are determined by the following formulas:<br />
N−1 <br />
<br />
â0 = phase y(n)exp − j<br />
n=0<br />
<br />
ˆb0 = 1<br />
N−1<br />
N<br />
n=0<br />
<br />
y(n)exp − j<br />
m=0<br />
M<br />
m=1<br />
M<br />
m=1<br />
am(n∆) m<br />
<br />
, (17)<br />
am(n∆) m<br />
<br />
. (18)<br />
The final synthesized signal is,<br />
ˆy(n) = ˆ M<br />
<br />
b0 exp j âm(n∆)m . (19)<br />
Figures 4 and 5 show the result of applying second order and<br />
third order DPPT on non linear chirp (parabolic chirp) with<br />
third order polynomial phase. The spectral peak only appears<br />
in the case of third order DPPT corresponding to third order<br />
polynomial phase.<br />
Amplitude<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5<br />
Normalized Frequency<br />
Fig. 4. Second Order DPPT.<br />
For the DPPT algorithm to work, the location of the spectral<br />
peak ω0 when taking the Fourier transform of the DP operator<br />
has to be smaller than half of the sampling frequency ωs:<br />
16<br />
2981<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.<br />
|ω0| = M!(τ∆) M−1 |aM |≤ ω0<br />
. (20)<br />
2
Amplitude<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />
0<br />
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5<br />
Normalized Frequency<br />
Fig. 5. Third Order DPPT.<br />
This condition is translated into the following requirement on<br />
the range of coefficient aM:<br />
|aM |≤<br />
Π<br />
M!τ M−1 , (21)<br />
∆M when M = 1, we have |a1| ≤ ωs<br />
2 which is the Nyquist<br />
criterion.<br />
The accuracy of the DPPT method depends on many factors<br />
such as the level of noise present, the type of noise, the length<br />
of the chirp signal and the chosen value of τ [10] [11]. The<br />
best signal estimation is achieved when τ = N/M, where<br />
N is the signal length and M is the order of the polynomial<br />
phase of the chirp (M=2 for linear chirp, M=3 for parabolic<br />
chirp).<br />
The SNR (the jamming signal) at the spectral peek ω0<br />
should be at least 14dB for a good detection. Also the number<br />
of points when taking the Fourier transform will affect the<br />
results when estimating the position of the spectral peak.<br />
Increasing the order of the polynomial will also affect the<br />
estimation error. For example, if we consider a third order<br />
polynomial, any error which occurs in the coefficient a3 will<br />
make it impossible to remove this coefficient completely from<br />
the polynomial during phase unwrapping step. Therefore, the<br />
estimation of the next coefficient a2 will suffer and have error<br />
as well. Similarly, the new error in a2 will affect the precision<br />
of a1 and a0.<br />
The computational complexity of the DPPT is determined<br />
based on the number of multiplications needed to synthesize<br />
the chirp having a length of N. The DPPT process<br />
involves the calculation of the ambiguity function and then<br />
taking the Fourier transform of the ambiguity function. The<br />
computational complexity of the ambiguity function calculation<br />
is O(N) and the complexity of the fast Fourier trans-<br />
form is O(N log 2 N). Therefore the total complexity is only<br />
O(N log 2 N).<br />
IV. NUMERICAL AND SIMULATION RESULTS<br />
We used 128 chips per data bit for spreading the message<br />
signal and we assumed a Gaussian channel. We considered<br />
a constant amplitude linear and parabolic chirp as the interference<br />
source. We first evaluated the bit error rate (BER)<br />
resulting from the presence of linear chirp at different jamming<br />
ratios between [0, 60] dB. We assumed SNR (with Gaussian<br />
noise) to be -10 dB for each case. As seen in Figure 6, the bit<br />
error rate increase as the jamming ratio increases. The spread<br />
spectrum system is able to recover the data bits at low jamming<br />
ratio of 10 dB, but as the ratio increases the system fails to<br />
recover the correct data bits.<br />
BER<br />
10 0<br />
10 −1<br />
10 −2<br />
10 −3<br />
10<br />
0 10 20 30 40 50 60<br />
−4<br />
JSR<br />
Fig. 6. BER vs. JSR results for a self-excised SS system.<br />
Next we used the proposed DPPT technique to synthesize<br />
the jamming chirp in the previous spread spectrum system.<br />
The chirp signal used was a linear chirp with normalized<br />
instantaneous frequency varying from 0 to 0.5 Hz. We also<br />
used 128 data chips and 100 data bits for a total of 12800<br />
bits. Table I shows the correlation coefficient between the<br />
reference linear chirp and the synthesized chirp using second<br />
order DPPT. The first simulation was for -0.5 dB signal to<br />
noise ratio (SNR) with signal to jamming ratio (SJR) ranging<br />
from [6,-20] dB, and the second simulation for -5 dB signal<br />
to noise ratio with jamming ratio in the range [6,-20] dB.<br />
The DPPT showed good results since it was able to detect the<br />
chirp under low chirp ratio (0.9879 for -2dB signal to jamming<br />
ratio).<br />
In the next simulation we used a parabolic chirp signal as<br />
the interference source. Figure 8 shows the spectrogram of the<br />
received signal r(t) with interference and noise added.<br />
Table II shows the correlation coefficient between the reference<br />
chirp and the synthesized chirp. A third order DPPT<br />
was applied in the signal. The first simulation was for -0.5<br />
dB signal to noise ratio with signal to jamming ratio changing<br />
in the range [6,-20] dB. And the second simulation for -5 dB<br />
17<br />
2982<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.
Frequency<br />
TABLE I<br />
RESULTS WITH LINEAR CHIRP<br />
SJR in dB Corr-Coeff(SNR=-0.5 db) Corr-Coeff(SNR=-5 db)<br />
6 0.7916 0.0469<br />
4 0.9878 0.9480<br />
2 0.9879 0.9874<br />
0 0.9878 0.9879<br />
-2 0.9879 0.9879<br />
-4 0.9881 0.9880<br />
-6 0.9880 0.9880<br />
-8 0.9880 0.9880<br />
-10 0.9880 0.9880<br />
-15 0.9880 0.9880<br />
-20 0.9880 0.9880<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />
0<br />
0 1000 2000 3000<br />
Time<br />
4000 5000 6000<br />
Fig. 7. Received signal with chirp interference and Gaussian noise.<br />
signal to noise ratio with jamming ratio in the range [6,-20]<br />
dB.<br />
In the previous simulation the DPPT performed better in<br />
the parabolic chirp case than the linear chirp because the<br />
frequency variation in the parabolic chirp [0.6, 0.9] Hz was<br />
less than the variation in the linear chirp [0, 0.5] Hz. The<br />
experimental result shows that the proposed technique can be<br />
successfully used for detection of chirp like interference in<br />
spread spectrum system.<br />
Figure 8 shows the TF representation of the synthesized<br />
TABLE II<br />
RESULTS WITH PARABOLIC CHIRP<br />
SJR in dB Corr-Coeff(SNR=-0.5 db) Corr-Coeff(SNR=-5 db)<br />
6 0.0943 0.0482<br />
4 0.1117 0.0895<br />
2 0.9969 0.1672<br />
0 0.9981 0.9670<br />
-2 0.9987 0.9979<br />
-4 0.9991 0.9986<br />
-6 0.9994 0.9993<br />
-8 0.9995 0.9993<br />
-10 0.9997 0.9994<br />
-15 0.9997 0.9994<br />
-20 0.9997 0.9996<br />
parabolic chirp (Fig. 3) under 6dB signal to jamming ratio<br />
(low correlation coeff=0.0015).<br />
Frequency<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0 1000 2000 3000<br />
Time<br />
4000 5000 6000<br />
Fig. 8. Detected parabolic chirp with 6dB SJR.<br />
Previous work [5] on chirp detection using Hough-Randon<br />
transform (HRT) showed good performance but computationally<br />
expensive. In their work they decomposed the received<br />
signal into its TF functions using an adaptive signal decomposition<br />
algorithm. The TF functions were mapped onto the<br />
TF plane and then treated as an image, and chirps present in<br />
the TF plane were detected using HRT. The HRT is an optimal<br />
technique for detecting lines in an image but it requires a high<br />
degree of computational complexity.<br />
This technique outperformers previous TF distribution techniques,<br />
which provide a distribution of the signal spectrum<br />
over a period of time but do not inherently provide chirp<br />
parameters like the DPPT. In addition, TF distribution functions<br />
always suffer from a tradeoff between resolution and<br />
interference terms which can result in incorrect synthesis and<br />
detection of the interfering signal.<br />
V. CONCLUSION<br />
A new technique is introduced for modulated interference<br />
detection in spread spectrum systems. The simulation results<br />
show that the new method provides accurate detection and<br />
estimation of linear and parabolic chirp interference. This<br />
technique does not suffer from the same limitation as previous<br />
techniques, where it detected the chirp signals under low<br />
jamming ratio and has low computational complexity. The<br />
method can be extended to include the excision of the detected<br />
interference which will be done in future work.<br />
REFERENCES<br />
[1] Genyuan Wang and Xiang-Gen Xia, “An adaptive filtering approach to<br />
chirp estimation and isar imaging of maneuvering targets,” IEEE 2000<br />
International Radar Conference, pp. 481-486, May 2000.<br />
[2] J.S. Dhanoa, E.J. Hughes, and R.F Ormondroyd, “Simultaneous detection<br />
and parameter estimation of multiple linear chirps,” Proc. IEEE Intl.<br />
Conference Acoustics, Speech, and <strong>Signal</strong> Processing (ICASSP), vol.6,<br />
pp. VI-129-32, Apr 2003.<br />
[3] M.Morvidone and B.Torresani, “Time scale approach for chirp detection,”<br />
International Journal of Wavelets, Multiresolution and Information<br />
Processing,vol. 1, no. 1, pp. 1949, 2003.<br />
18<br />
2983<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.<br />
[4] A. Ramalingam and S. Krishnan, “A novel robust image watermarking<br />
using a chirp based technique,” In Proc. Canadian Conference on<br />
Electrical and Computer Engineering, pp. 1889-1892, May 2004.<br />
[5] S. Erkucuk, S. Krishnan, and M. Zeytinoglu, “Robust audio watermarking<br />
using a chirp based technique,” Proc. International Conference on<br />
Multimedia and Expo, 2003. ICME 03, pp. II - 513-16, July 2003.<br />
[6] G.F. Boudreaux-Bartels and T. Parks, “Time-varying filtering and signal<br />
estimation using wigner distribution synthesis techniques,” IEEE <strong>Signal</strong><br />
Processing Magazine,vol. 9, no. 2, pp. 21-67, April 1992.<br />
[7] W. Krattenhaler and F. Hlwatsch, “Bilinear signal synthesis,” IEEE<br />
Transactions on signal processing, vol. 40, no. 2, pp. 352–363, Feb<br />
1992.<br />
[8] W. Krattenhaler and F. Hlawatsch, “General signal synthesis allgorithms<br />
for smoothed version of wigner distribution,”Proc. IEEE International<br />
Conference on Acoustics, and speech, and <strong>Signal</strong> Processing (ICASSP),<br />
no. 3, pp. 1611–1614, Apr 1990.<br />
[9] A. Francos and M. Porat, “<strong>Analysis</strong> and synthesis of multicomponent<br />
signals using positive time-frequency distributions,” IEEE Transactions<br />
on <strong>Signal</strong> Processing, pp. 47(2):493-504, Feb. 1999.<br />
[10] S. Peleg and B. Friedlander, “Multicomponent signal analysis using<br />
the polynomial-phase transform,” IEEE Transactions on Aerospace and<br />
Electronic Systems, vol. 32, no. 1, pp. 378-387, Jan 1996.<br />
[11] S. Peleg and B. Friedlander, “The discrete polynomial-phase transform,”<br />
IEEE Transactions on <strong>Signal</strong> Processing,vol. 43, no. 8, pp. 1901-1914,<br />
August 1995.<br />
[12] M.G. Amin, “Interference mitigation in spread spectrum communication<br />
systems using time-frequency distributions,” IEEE Trans. <strong>Signal</strong><br />
Processing, vol. 45, no. 1, pp. 90–101, Jan 1997.<br />
[13] J.D. Laster and J.H. Reed, “Interference rejection in digital wireless<br />
communication,” IEEE <strong>Signal</strong> Processing Mag., pp. 37–62, May 1997.<br />
[14] L. Lee, “Time-frequency signal synthesis and its application in multimedia<br />
watermark detection,” Master Thesis, <strong>Ryerson</strong> <strong>University</strong>, 2005.<br />
19<br />
2984<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:52 from IEEE Xplore. Restrictions apply.
Emotion Recognition Using Novel Speech <strong>Signal</strong><br />
Features<br />
Talieh Seyed Tabatabaei, Sridhar Krishnan<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong><br />
Toronto, Canada<br />
{tseyedta, krishnan}@ee.ryerson.ca<br />
Abstract—Automatic Emotion Recognition (AER) is a very<br />
recent research topic in the Human-Computer Interaction (HCI)<br />
field which still has much room to grow. In this contribution a<br />
set of novel acoustic features and Least Square-Support Vector<br />
Machines (LS-SVMs) are proposed to set up a speaker-<br />
independent Automatic Human Emotion Recognition system.<br />
Six discrete emotional states are classified throughout this work:<br />
happiness, sadness, anger, surprise, fear, and disgust. Different<br />
multi-class SVM methods are implemented in order to get the<br />
best result. The result achieved by LS-SVM is then compared by<br />
that of a Linear Classifier. We achieved an overall accuracy of<br />
81.3%.<br />
I. INTRODUCTION<br />
<strong>Research</strong> has been done on emotion in the fields of<br />
psychology and physiology for a long time. More recently it’s<br />
been the subject of interest to engineers. Its most important<br />
application is in intelligent human-machine interaction. As<br />
computers have become an integral part of our lives, the need<br />
for a more natural communication interface between humans<br />
and machines has arisen. To accomplish this goal, a computer<br />
would have to be able to perceive its present situation and<br />
respond in a different manner depending on its perception. To<br />
make Human-Computer Interaction (HCI) more natural and<br />
friendly, it would be beneficial to give computers the ability to<br />
recognize situations the same way a human does.<br />
In today’s HCI systems, machines can recognize the<br />
speaker and also content of the speech, using speech<br />
recognition and speaker identification techniques. If machines<br />
are equipped with emotion recognition techniques, they can<br />
also know “how it is said” to react more appropriately, and<br />
make the interaction more natural. Other potential applications<br />
of Automatic Emotion Recognition (AER) include psychiatric<br />
diagnosis, intelligent toys, lie detection, learning environment,<br />
customer service, educational software, and detection of the<br />
emotional state in telephone call center conversations to<br />
provide feedback to an operator or a supervisor for monitoring<br />
purposes.<br />
One of the most important human communication<br />
channels is auditory channel which carries speech and vocal<br />
1-4244-0921-7/07 $25.00 © 2007 IEEE.<br />
345<br />
Aziz Guergachi<br />
Department of Information Technology Management<br />
<strong>Ryerson</strong> <strong>University</strong><br />
Toronto, Canada<br />
a2guerga@ryerson.ca<br />
intonation. In fact people can perceive each other’s emotional<br />
state by the way they talk. Therefore in this work we are<br />
analyzing the speech signal in order to set up an automatic<br />
system to recognize the human emotional state. Different<br />
researchers have decided on different number and kinds of<br />
emotional states, such as 3 categories of positive(joy),<br />
negative(anger, irritation), and neutral in [7], 4 categories of<br />
neutral, anger, Lombard, and loud in [9], and 5 categories of<br />
neutral, happiness sadness, anger, and fear in [8]. In this work<br />
we are automatically categorizing six different human<br />
emotional states: anger, happiness, fear, surprise, sadness,<br />
and disgust.<br />
Some researchers have developed speaker-dependent<br />
speech emotion recognition systems [7, 12]. We think that<br />
speaker independency is one of the intrinsic characteristics of<br />
an AER system. When a system is person-dependent the<br />
accuracy increases but on the other hand, for each new person<br />
we have to train our system all over again and that is a major<br />
drawback. So here we are trying to reach a very satisfying<br />
accuracy with a person-independent system by choosing right<br />
acoustic features and powerful classifier. While some<br />
researchers have utilized both acoustic characteristics and<br />
textual content of an emotional spoken utterance [10, 12], we<br />
are conducting our work using commonly used and newly<br />
proposed acoustic features of the speech signal only.<br />
Various different classifiers have been taken into<br />
consideration for categorizing the emotional states. The most<br />
common classifiers used are Hidden Markov Model (HMM)<br />
[13] and Neural Networks (NNs) [15], whereas the number of<br />
works which use Support Vector Machines (SVMs) are<br />
relatively very few [12]. SVM is a relatively new approach in<br />
the field of Machine Learning and has a large number of<br />
advantages to other conventional and popular classifiers like<br />
NNs. In this contribution we are using Least Square Support<br />
Vector Machines (LS-SVMs), which are the reformulations to<br />
the original SVMs.<br />
The paper is organized as follows: Section II explains the<br />
emotion database used in this research. Section III<br />
demonstrates the structure of the AER system proposed in this<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />
20
work and the corresponding steps. In Section II the theory of<br />
SVM is briefly discussed. In Section V the experimental<br />
results are presented, and Section VI is the conclusion.<br />
II. THE EMOTION DATABASE<br />
The database used in this research is the one created in<br />
[16]. We believe that the results obtained in different emotion<br />
recognition experiments are strongly related to the database<br />
used. So lack of a common good quality database for<br />
researchers makes it hard to compare the performance of their<br />
proposed systems.<br />
This audio-visual emotion database presented in [16] is a<br />
professional reference database for testing and evaluating<br />
video, audio or joint audio-visual emotion recognition<br />
algorithms. The final version of the database contains 42<br />
subjects, coming from 14 different nationalities. Among the<br />
42 subjects, 81% are men, while the remaining 19% are<br />
women. First, each subject is asked to listen carefully to a<br />
short story for each of the six emotions (happiness, sadness,<br />
surprise, disgust, fear, and anger) and to immerge themselves<br />
into the situation. Once the subject is ready, he or she may<br />
read, memorize and pronounce the five proposed utterances<br />
(one at the time), which constitutes five different reactions to<br />
the given situation. The subjects are asked to put as much<br />
expressiveness as possible, producing a message that contains<br />
only the emotion to be elicited. All the subjects talk in<br />
English but they might have different accents. All the<br />
utterances are approved by two experts in order to be genuine.<br />
III. SPEECH EMOTION RECOGNITION SYSTEM<br />
The structure of the speech emotion recognition system<br />
used in this paper is depicted in Fig. 1.<br />
A. Preprocessing<br />
In the preprocessing stage first each signal is de-noised by<br />
soft-thresholding the wavelet coefficients, and since the silent<br />
parts of the signals do not carry any useful information, those<br />
parts including the leading and trailing edges are eliminated by<br />
thresholding the energy of the signal. The signals are divided<br />
into frames using a Hamming window of length 23 ms.<br />
B. Feature Extraction<br />
We are proposing a set of novel acoustic features in this<br />
experiment. Most researchers use prosodic features and their<br />
statistical characteristics to classify the emotions [8, 11, 13,<br />
14, 15]. In this contribution we are using the set of features<br />
listed in Table I. Among these features only Mel Frequency<br />
Cepstrum Coefficients (MFCC) and Zero Crossing Rate<br />
(ZCR) have been used for speech emotion recognition in the<br />
past [9, 10, 11], while the rest are being used for the first time<br />
in this application. All the features are extracted from each<br />
frame and then the mean and standard deviation for each<br />
feature is considered to constitute the feature vector.<br />
C. Feature Selection<br />
The performance of a pattern recognition system highly<br />
depends on the discriminant ability of the features. Selecting<br />
the most relevant subset from the original feature set, we can<br />
increase the performance of the classifier and on the other<br />
Speech <strong>Signal</strong><br />
Final Result<br />
Figure 1. The structure of the speech emotion recognition system.<br />
hand decrease the computational complexity. We are using the<br />
forward selection method for each single binary classifier in<br />
our system in order to select the most efficient subset of<br />
features. At each step the variable which increases the<br />
performance of the classifier the most, is added to the feature<br />
subset. Fig. 3 illustrates the concept.<br />
D. Classification<br />
The recognition of human emotion is essentially a pattern<br />
recognition problem. We are using LS-SVM (descried in next<br />
section) as a classifier in this research. Since we are dealing<br />
with a multi-class classification problem, we need a method to<br />
extend our two-class support vector classification<br />
methodology to a multi-class problem. There are different<br />
ways for multi-category SVM mentioned in the literature<br />
among which one-against-all and one-against-one (pairwise)<br />
are the most popular ones. In this paper we are comparing the<br />
results achieved by one-against-all, fuzzy one-against-all,<br />
pairwise, and fuzzy pairwise [17].<br />
For the purpose of comparative study we are also applying<br />
a Linear Classifier with gradient descent optimization<br />
algorithm.<br />
IV. SUPPORT VECTOR MACHINES<br />
SVM was introduced first by Vapnik and co-workers, and<br />
it is such a powerful method that in the few years since its<br />
introduction has outperformed most other systems in a wide<br />
variety of applications. SVM is used in applications of<br />
regression and classification; however, it is mostly used as a<br />
binary classifier. SVM is based on the principle of structural<br />
risk minimization. The optimal boundary is found in such a<br />
way that maximizes the margin between two classes of data-<br />
TABLE I. LIST OF ACOUSTIC FEATURES USED FOR SPEECH EMOTION<br />
RECOGNITION<br />
Spectral Features<br />
• Shannon entropy<br />
• Renyi entropy<br />
• Spectral bandwidth<br />
• Spectral centroid<br />
• Spectral flux<br />
• Spectral roll-off<br />
frequency<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />
346<br />
21<br />
Preprocessing Feature<br />
Extraction<br />
Classifier 1<br />
Classifier 2<br />
Classifier n<br />
Audio Features<br />
Cepstral<br />
Features<br />
Feature Selection<br />
Feature Selection<br />
Feature Selection<br />
Time-domain<br />
Features<br />
MFCC Zero crossing rate<br />
.<br />
.<br />
.
points [1, 2, 3] (Fig. 2). SVM is based on kernel functions,<br />
which are used to map data points to a higher dimensional<br />
feature space in order to be linearly separable. The<br />
optimization problem here is the dual optimization problem<br />
which is solved by Lagrangian method and making use of very<br />
important Karush-Kuhn-Tucker (KKT) conditions. Equation<br />
(1) shows the optimization problem for SVM classifiers:<br />
minimize w, b<br />
<br />
n<br />
1 2<br />
2<br />
w + C∑<br />
ξ i<br />
(1)<br />
2<br />
i=<br />
1<br />
subject to<br />
<br />
( < w.<br />
x > + b)<br />
≥ 1−<br />
ξ ξ ≥ 0,<br />
i = 1,.....,<br />
n<br />
yi i<br />
i i<br />
where C is a regularization parameter which is a trade off<br />
between the empirical risk (reflected by the second term in<br />
(1)) and model complexity (reflected by the first term in (1)),<br />
and ξi are slack variables which are introduced to relax the<br />
constraints and make the system more noise-tolerant.<br />
The corresponding dual representation is:<br />
n<br />
n 1<br />
<br />
W ( α ) = ∑αi−∑yiyjαiαjK( xi.<br />
x j ) (2)<br />
2<br />
subject to constraints<br />
n<br />
∑<br />
i=<br />
1<br />
i=<br />
1 i,<br />
j=<br />
1<br />
α i yi<br />
= 0 α i ≥ 0,<br />
i = 1,.....,<br />
n<br />
<br />
where α i ≥ 0 are the Lagrange multipliers and K(<br />
xi.<br />
x j ) is<br />
the kernel function. Note that we don’t need to know the<br />
underlying mapping function, however it is necessary to<br />
define the kernel function. Among the different kernel<br />
functions, the most common kernels are polynomial,<br />
Gaussian Radial Basis function (RBF) and multi-layer<br />
perception (MLP).<br />
Our final decision rule can be expressed as<br />
o<br />
Nsv<br />
= ∑<br />
i=1<br />
*<br />
<br />
f ( x,<br />
α , b ) y α K(<br />
x . x)<br />
+ b (3)<br />
i<br />
where N sv and *<br />
αi denote the number of support vectors and<br />
the non-zero Lagrange multipliers corresponding to the<br />
support vectors respectively. This result reveals the important<br />
fact that only support vectors contribute to the final boundary.<br />
In fact this is a way to beat the curse of dimensionality which<br />
is a big worry for most of the classifiers. The dimension of<br />
input space can be as high as it needs to be, without having to<br />
worry about having too many free parameters which usually<br />
leads to overfitting.<br />
In this paper for training SVM we use the LS-SVM (Least<br />
Squares Support Vector Machine) MATLAB toolbox. LS-<br />
SVMs are reformulations to the original SVMs which lead to<br />
solving linear KKT systems [6]. In LS-SVMs the inequality<br />
constraints in SVM are replaced with equality constraints. As<br />
a result the solution follows from solving a set of linear<br />
equations instead of a quadratic programming problem which<br />
we have in original SVM formulation of Vapkin, and<br />
obviously we can have a faster algorithm.<br />
The primal problem of the LS-SVMs is defined as:<br />
*<br />
i<br />
i<br />
o<br />
Figure 2. A linear SVM classifier. Support vectors are those<br />
elements of the training set which are on the boundary hyperplanes of two<br />
classes.<br />
subject to<br />
<br />
d<br />
2<br />
min w,<br />
b,<br />
e J p ( w,<br />
b,<br />
e)<br />
= 1 2 w + γ 1 2∑<br />
e<br />
=<br />
2<br />
i<br />
i 1<br />
T<br />
[ w ϕ(<br />
x ) + b]<br />
= 1−<br />
e , i = 1,.....,<br />
d<br />
yi i<br />
i<br />
where γ is a parameter analogous to SVM’s regularization<br />
parameter (C).<br />
The main characteristic of LS-SVMs is the low<br />
computational complexity comparing to SVMs without<br />
quality loss in the solution.<br />
V. EXPERIMENTAL RESULTS<br />
Our database consists of 1260 instances of utterances 60%<br />
of which was used for the training phase exclusively and the<br />
remaining 40% for evaluating the trained classifiers (the<br />
division is done in random). All the binary LS-SVM<br />
classifiers are trained using RBF kernel function with different<br />
regularization and kernel parameters. The linear classifiers are<br />
trained using the gradient descent algorithm and perceptron<br />
Figure 3. The perfomance of one of the binary LS-SVMs by adding<br />
a new feature at each iteration of Forward Selection algorithm<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />
347<br />
22<br />
margin<br />
Support<br />
Vectors<br />
Hyperplane<br />
(4)
criterion function. The confusion matrix and the final<br />
recognition results are presented in Table II and Table III<br />
respectively. The abbreviations in Table II stand for the six<br />
different emotions: anger, fear, disgust, happiness, sadness,<br />
and surprise, and FS in Table III means Feature Selection.<br />
As it is shown in Table III, the best performance (81.3%)<br />
belongs to fuzzy-pairwise LS-SVM using the features selected<br />
by Forward Selection algorithm. Table II shows that the most<br />
difficult emotion to recognize in our experiment is surprise<br />
and the easiest ones are sadness and happiness. And fear and<br />
sadness have the highest probability to be confused with each<br />
other.<br />
VI. CONCLUSION<br />
In this contribution, we introduced a set of new acoustic<br />
features which are used for the first time in the application of<br />
AER. For classification we used LS-SVM which is a recent<br />
and powerful classifier with many advantages to other<br />
conventional and popular classifiers such as Neural Networks.<br />
We also implemented different schemes to adapt our binary<br />
classifiers to a multi-category problem. The result of a Linear<br />
Classifier is compared with LS-SVM performance. We<br />
achieved an overall classification accuracy of 81.3% with<br />
fuzzy-pairwise LS-SVM<br />
TABLE II. CONFISION MATRIX OF THE LS-SVM CLASSIFIER<br />
(FUZZY PAIRWISE WITH FEATURE SELECTION)<br />
Recognized Emotions (%)<br />
Ang Fea Dis Hap Sad Sur<br />
Ang 83.3 0 2.7 6.4 2.7 4.6<br />
Fea 1.8 71.9 7.4 1.8 13 3.7<br />
Dis 4.6 5.5 79.6 0 3.7 6.4<br />
Hap 1.8 1.8 0 92.4 1.8 1.8<br />
Sad 0 6.1 0.9 0 90.5 2.3<br />
Sur 11.1 9.2 5.5 4.6 13.8 55.5<br />
TABLE III. FINAL RECOGNITION RESULTS<br />
Recognition Rate<br />
One-Vs-All SVM 44.9%<br />
fuzzy One-Vs-All SVM 53.6%<br />
Pairwise SVM 74.5%<br />
fuzzy pairwise SVM 78.4%<br />
fuzzy pairwise SVM, FS 81.3%<br />
fuzzy pairwise LDA 37.7%<br />
348<br />
REFERENCES<br />
[1] N. Cristianini and J. SH. Taylor, An Introduction to<br />
Support Vector Machines and Other Kernel-based Methods. United<br />
Kingdom: Cambridge <strong>University</strong> Press, 2000.<br />
[2] C.J. Burges, “A tutorial on support vector machine for pattern<br />
recognition,” Knowledge Discovery and Data Mining, vol. 2, pp. 121-<br />
167, June, 1998.<br />
[3] I.E. Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and R. M.<br />
Nohikawa, “A support vector machine approach for detection of<br />
microcalifications,” IEEE trans. Med. Imag., vol.21, NO. 12,<br />
December, 2002.<br />
[4] P.H. Chen, C. J. Lin, and B. Schölkopf, “A tutorial on ν – support<br />
vector machines,” unpublished.<br />
[5] B. Schölkopf and A. J. Smola, Learning with kernels – support vector<br />
machines, regularization, optimization, and beyond. Cambridge, MA:<br />
MIT press, 2002.<br />
[6] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J.<br />
Vandewalle, Least square support vector machines. Singapore: World<br />
scientific publishing Co. Pte. Ltd., 2002.<br />
[7] S. Hoch, F. Althoff, G. McGlaun, and G. Rigoll, “Bimodal fusion of<br />
emotional data in an automotive environment,” proceeding of IEEE<br />
international conference on acoustic, speech, and signal processing.<br />
Vol. 2, PP. 1085-1088, 18-23 March 2005.<br />
[8] C.A. Martinez and A.B. Cruz, “Emotion recognition in non-structured<br />
utterance for human-robot interaction”, IEEE international workshop<br />
on robot and human interactive communication, PP. 19-23, 13-15 Aug.<br />
2005.<br />
[9] T. Nguyen and I. Bass, “Investigation of combining SVM and decision<br />
tree for emotion classification,” seventh IEEE international<br />
symposium on multimedia, PP. 540-544, 2005.<br />
[10] ZJ. Chuang, CH. Wu, “Emotion recognition using acoustic features and<br />
textual content”, IEEE international conference on multimedia and<br />
expo, Vol. 1, PP. 53-56, 27-3- June 2004.<br />
[11] YL. Lin and G. Wei, “Speech emotion recognition based on HMM and<br />
SVM,” proceeding of International conference on machine learning and<br />
cybernetics, Vol. 8, PP 4898-4901, 18-21 Aug. 2005.<br />
[12] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition<br />
combining acoustic features and linguistic information in a hybrid<br />
support vector machine-belief network architecture, ” Proceedings of<br />
IEEE International Conference on Acoustic, Speech, and <strong>Signal</strong><br />
Processing, vol.1, PP. I-577-80, 17-21 May, 2004.<br />
[13] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based<br />
speech emotion recognition, ” Proceeding of the IEEE International<br />
Conference on Acoustic, Speech, and <strong>Signal</strong> Processing, PP. I-401-04,<br />
6-10 April, 2003.<br />
[14] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in<br />
speech using Nueral Networks,” Proceedings of the 6 th International<br />
Conference on Neural Information Processing, vol. 2, PP. 495-501,<br />
1999.<br />
[15] V. A. Petrushin, “Creating emotion recognition agents for speech<br />
signal, ” unpublished.<br />
[16] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The eNTERFACE’05<br />
audio-visual emotion database,” Proceedings of the 22 nd International<br />
Conference on Data Emgineering Workshop, 3-7 April 2006.<br />
[17] D. Tsujinishi, Y. Koshiba, and SH. Abe, “Why pairwise is better that<br />
One-against-All or All-at-Once,” Proceedings of IEEE International<br />
Conference on Neural Networks, vol. 1, PP. 693-698, July 2004.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:29 from IEEE Xplore. Restrictions apply.<br />
23
A WATERMARKING METHOD FOR SPEECH SIGNALS BASED ON THE TIME–WARPING<br />
SIGNAL PROCESSING CONCEPT<br />
Cornel Ioana (1) , Arnaud Jarrot (2) , André Quinquis (2) , Sridhar Krishnan (3)<br />
(1) LIS Laboratory<br />
BP 46, 961 rue de la Houille Blanche<br />
38402 Saint-Martin d’Hères cedex, FRANCE<br />
phone: +33(0) 476 826 422<br />
email: cornel.ioana@lis.inpg.fr<br />
ABSTRACT<br />
This paper deals with the watermarking of audio speech signals<br />
which consists in introducing an imperceptible mark in<br />
a signal. To this end, we suggest to use an amplitude modulated<br />
signal that mimics a formantic structure present in the<br />
signal. This allows to exploit the time–masking effect occurring<br />
when two signals are close in the time–frequency plane.<br />
From this embedding scheme, a watermark extraction method<br />
based on nonstationary linear ltering and matched lter detection<br />
is proposed in order to recover information carried by<br />
the watermark. Numerical results conducted on a real speech<br />
signal show that the watermark is likely not hearable and informations<br />
carried by the watermark are easily retrievable.<br />
Index Terms— Watermarking, Time–warping signal processing,<br />
Time–frequency analysis.<br />
1. INTRODUCTION<br />
Today’s digital media have opened the door to an information<br />
era where the true value of a product is generally dissociated<br />
from any physical medium. While it enables a high degree<br />
of exibility in its distribution, the commerce of data without<br />
any physical media raises serious copyright issues. Data can<br />
be easily duplicated turning piracy into a simple data copy<br />
process.<br />
In order to secure the identity of the owner of a media, a<br />
solution consists in hiding digital–subcodes inside data since<br />
no physical media can be used for this purpose. This problematic<br />
is generally referred as watermarking [1]. The main<br />
rules in watermarking context are :<br />
• The watermarking should not be discernible from the<br />
media in order to keep the integrity of the media.<br />
• The watermarking should be easily retrievable. Providing<br />
a priori, the inserted watermark should be recovered<br />
as well as the digital–subcodes carried by the<br />
watermark.<br />
(2) E 3 I 2 Laboratory (EA 3876) – ENSIETA,<br />
2RueFrançois Verny, 29806, Brest,<br />
FRANCE<br />
phone: +33(0) 298 348 720<br />
emails: [jarrotar, quinquis]@ensieta.fr<br />
(3) Department of Electrical Engineering –<br />
<strong>Ryerson</strong> <strong>University</strong><br />
350 Victoria Street, Toronto, CANADA<br />
phone: 416.979.5000 x6086<br />
email: krishnan@ee.ryerson.ca<br />
• The watermarking should be robust to attacks (i.e. compression<br />
or noise insertion) since these phenomenons<br />
often occur in media transmissions.<br />
In this paper we propose a watermarking procedure that<br />
attempts to exploit the time–frequency region available between<br />
two formants. We suggest to use, for the watermark, an<br />
amplitude modulated signal whose carrier frequency is modulated<br />
according to the modulation law of a formant. In this<br />
way, the time-frequency content of the watermark follows the<br />
time-frequency content of the formant. This allows to put<br />
the watermark signal very close to the formant. As will be<br />
seen, this embedding strategy makes the watermark likely not<br />
perceptible from an acoustical point of view. The recovery<br />
of the watermark is ensured by nonstationary linear ltering<br />
and matched ltering method. Numerical results show that<br />
the watermark can be easily recovered as well as the coded<br />
sequence carried by the watermark.<br />
The paper is organized as follows. Section 2 is devoted<br />
to a short presentation of the time–warping signal processing<br />
concept. Based on this concept, a new watermarking procedure<br />
is proposed in Section 3. Numerical results presented<br />
in Section 4 illustrate the benets of the proposed technique.<br />
Concluding remarks are given in Section 5.<br />
2. TIME–WARPING SIGNAL PROCESSING<br />
CONCEPT<br />
2.1. Non-unitary Time–Warping Operators<br />
Let x(t) ∈ L 2 (R) be a squared integrable signal. The set of<br />
unitary time–warping operators {W, w(t) ∈C 1 , ˙w(t) ≥ 0:<br />
x(t) → (Wx)(t)}, isdenedin[2]by<br />
(Wx)(t) =| ˙w(t)| 1/2 x (w(t)) , (1)<br />
where ˙w(t) stands for the derivative of the warping function<br />
w(t) with respect to t. Properties of this transformation<br />
include linearity and unitary equivalence since the envelope<br />
| ˙w| 1/2 preserves the energy in the signal at the output of W.<br />
1424407281/07/$20.00 ©2007 IEEE II 201<br />
24<br />
ICASSP 2007<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.
In what follows, we deal with a modied version of time–<br />
warping operators that does not fulll the unitary equivalence<br />
property anymore.<br />
We dene the class of non–unitary time–warping operators<br />
by the set { ˘ W,w(t) ∈C1 , ˙w(t) ≥ 0:x(t) → ( ˘ Wx)(t)}<br />
for which<br />
<br />
˘Wx(t) = x(t ′ )δ(w(t) − t ′ ) dt ′<br />
(2)<br />
R<br />
Because ˙w(t) ≥ 0, w-1 (t) exists, we can dene the inverse<br />
projector by<br />
˘W -1 <br />
x(t) = x(t ′ )δ(w -1 (t) − t ′ ) dt ′<br />
(3)<br />
2.2. Time–warping convolution operator<br />
R<br />
The stationary convolution operator applied on x(t),h(t) ∈<br />
L2 (R) is given by<br />
<br />
x(t) ∗ h(t) = x(t ′ ) h(t ′ − t) dt ′<br />
(4)<br />
R<br />
From this denition, it is natural to ask whenever the convolution<br />
operator has an equivalent expression in the time warped<br />
space. We dene the time warping convolution operator by<br />
x(t) w(.)<br />
∗ h(t) = ˘ W -1 <br />
Wx(t ˘ ′<br />
) ∗ h(t) (5)<br />
where w(.)<br />
∗ stands for the time–warping convolution operator<br />
along the warping function w(t). Using Equ. 2, Equ. 3,<br />
Equ. 4, some straightforward algebra manipulations lead to<br />
x(t) w(.)<br />
∗ h(t) =<br />
<br />
2.3. Time–warping lter<br />
R<br />
x(t ′ ) d ˘ Wt<br />
dt h w -1 (t) − w -1 (t ′ ) dt ′<br />
From Equ. 2, one can show that any signal x(t) of the form<br />
x(t) =exp(2iπf0w-1 (t)), f0∈ R is transformed via non–<br />
unitary time–warping operators into<br />
˘Wx(t) =exp(2iπf0 w -1 (w(t)) (7)<br />
=exp(2iπf0t) (8)<br />
which is a pure harmonic signal with frequency fo. One can<br />
exploit this stationarisation effect to design efcient time–<br />
varying lters. Let hH (t) be the impulse response of a lin-<br />
fc<br />
ear time–invariant highpass lter, and hL (t) be the impulse<br />
fc<br />
response of a linear time–invariant lowpass lter. Both lters<br />
are designed to have a cutoff frequency equal to fc. Using<br />
the time–warping convolution operator dened in Equ. 6, we<br />
dene x H (t) and x L (t) by<br />
(6)<br />
x H (t) =x(t) w(.)<br />
∗ h H (t), fc (9)<br />
x L (t) =x(t) w(.)<br />
∗ h L fc (t). (10)<br />
II 202<br />
25<br />
Then, Equ. 9 and Equ. 10 dene a non–stationary ltering<br />
procedure for which<br />
e(t) =fc<br />
˙<br />
w -1 (t) (11)<br />
is the time–varying cutoff frequency of the time–varying lter.<br />
3. TIME–WARPING–BASED<br />
AUDIO–WATERMARKING<br />
3.1. Watermark embedding<br />
w(t)<br />
Warping function<br />
m(t) =<br />
a(t) e<br />
x(t)<br />
j2πfot<br />
Watermark<br />
<strong>Signal</strong><br />
<br />
˘Wm(t)<br />
• δ(w(t) − t ′ )dt ′ xm(t)<br />
Fig. 1. Watermark embedding procedure.<br />
Watermarked<br />
signal<br />
The proposed watermarking embedding scheme is depicted<br />
in the Fig. 1. Roughly speaking, the embedding of the watermark<br />
is carried out in two steps. First, the watermark is<br />
matched to the specicity of the audio signal by means of<br />
adapted warping operator. Then, the watermark is added to<br />
the original signal.<br />
Human hears are sensitive to frequency–spread signals,<br />
which are interpreted as shufe [3]. For this reason we suggest<br />
to use a watermark m(t) that belongs to the class of frequency<br />
coherent signals expressed by<br />
m(t) =a(t) e j2πf0t , f0 ∈ R +<br />
(12)<br />
where a(t) is assumed to be a positive slowly time–varying<br />
signal. This class of signals is concentrated around the carrier<br />
frequency f0.<br />
In the proposed method, the rule of insertion of the watermark<br />
is based on the fact that two close signals with similar<br />
instantaneous frequency laws are very similar in an auditive<br />
point of view [3]. Therefore one can exploit this time–<br />
masking effect by choosing an area, on the time–frequency<br />
plane, where the watermark is designed to mimics some frequency<br />
concentrated component which is present in the signal.<br />
In what follows, we denotes by the term “masking component”<br />
such component. In the case of speech signals, a natural<br />
choice for the masking component is to select a formant<br />
that has a long enough time–duration.<br />
Let f(t) be the model of a formant described by<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.<br />
f(t) =af(t) e j2πφf (t) , t ∈ [ti,tf ],ti
In order to exploit the masking effect provided by the masking<br />
component f(t), the time–warped watermark ˘ Wm(t) should<br />
be as close as possible of the formant in the time-frequency<br />
plane. Therefore, we dene the time–warped watermark by<br />
˘Wm(t) =a(w(t)) e j2π(φf (t)+εt) , t ∈ [ti,tf ], (14)<br />
where ε ∈ R is the frequency shift of the watermark. The<br />
choice of ε depends on a trade–off between the separability<br />
of the watermark and performances of the masking effect. If<br />
ε is too large, the masking effect decreases. If ε is too small,<br />
the watermark cannot be retrieved because of the proximity<br />
of the formant.<br />
Beyond the stealthiness of the watermark, another topic<br />
of the watermarking concept is the coding of some specic<br />
information on signals. To achieve this topic, we suggest to<br />
use the amplitude of the watermark for information coding.<br />
Let the atom g(t) ≥ 0, t∈ <br />
−T T<br />
2 , 2 be a positive compactly<br />
supported function for which T is small compared to<br />
the time–duration of the masking component Tf − Ti. Based<br />
on this denition, we suggest to construct the amplitude of<br />
the watermark a(t) as a superposition of time–delayed versions<br />
of the atom g(t).<br />
The choice of the g(t) function can be be guided by physiological<br />
aspect of the human hear. It is generally accepted<br />
that hears are very sensitive to fast variations of signals since<br />
they produce a large spread in the frequency domain [3]. For<br />
this reason, we force the atom g(t) to be as smooth as possible<br />
which can be translated into a mathematical notation by<br />
requiring the atom g(t) to be of class C∞ , the class of in-<br />
nitely derivable functions. In the remaining of this paper we<br />
dene g(t) as a scaled version of the mother atom gm(t)<br />
⎧<br />
2<br />
⎨ −(t/a)<br />
exp<br />
gm(t) =<br />
1−(t/a)<br />
⎩<br />
2<br />
2 , t ∈ [−1, 1],<br />
(15)<br />
0, t /∈ [−1, 1],<br />
where a ∈ R + is the scaling factor. From empirical evidences,<br />
we saw that for detection reasons, atoms g(t) have<br />
to be separated each other of at least 5σg, whereσ 2 g is the<br />
variance of g(t).<br />
Let τ be the digital information that has to be watermarked<br />
in the audio signal which is expressed in binary by (τ)2 =<br />
τ0τ1 ...τN where τi are the bits of τ. Then, the amplitude<br />
a(t) of the watermark is encoded as follows<br />
a(t) =<br />
N<br />
τn g(t − 5iσg), (16)<br />
i=0<br />
which is known as an amplitude modulation coding scheme.<br />
3.2. Watermark recovery<br />
Once a signal has been watermarked, next step is to deal with<br />
the recovery of the watermark sequence. However, because<br />
of different aspects related to the transmission of the signal<br />
(compression, quantization, noise, ...) this recovery is generally<br />
performed on a modied version xm(t) of xm(t). Inthe<br />
proposed method, the watermark is said to be recovered if the<br />
digital information τ has been estimated from xm(t) without<br />
error. The recovery procedure is depicted in the Fig. 2 where<br />
the symbol (ˆ.) denotes an estimation of the quantity (.). As<br />
seen, the watermark recovery is carried out in three steps.<br />
xm(t) • w1(.)<br />
Watermarked<br />
signal<br />
II 203<br />
26<br />
˘Wm(t)<br />
w1(t) =<br />
[φf (t)+(ε − Δ)t] −1<br />
<br />
∗ hH 1 (t) • w2(.)<br />
∗ hL 1 (t)<br />
• δ(w -1 (t) − t ′ )dt ′<br />
w2(t) =<br />
[φf (t)+(ε +Δ)t] −1<br />
➊ ➋<br />
<br />
•g(t − 5iσg)dt 1<br />
ˆm(t)<br />
g<br />
≷<br />
0 2<br />
➌ ➍<br />
Fig. 2. Watermark extraction procedure.<br />
τi<br />
Estimation<br />
of bit τi<br />
First step corresponds to the extraction of the time–warped<br />
watermark ˘ Wm(t) by means of time–warped lters (blocks<br />
➊ and ➋). Two time–varying lters are necessary to extract<br />
the watermark : one highpass (block ➊), and one lowpass<br />
(block ➋). This ltering stage denes a time–varying pass–<br />
band lter expressed by<br />
<br />
˙φf (t)+ε +Δ, the upper cuttof frequency,<br />
(17)<br />
˙φf (t)+ε − Δ, the lower cuttof frequency.<br />
It is well–known that the frequency spread of a time–<br />
varying signal around its instantaneous frequency law depends<br />
on the regularity of its amplitude. Because the amplitude of<br />
the watermark is of class C∞ , the frequency decay is faster<br />
than any power of f. Therefore, only a small Δ value is necessary<br />
to extract the time–warped watermark.<br />
Second step corresponds to the unwarping of the estimated<br />
time–warped sequence (block ➌) in order to recover an estimation<br />
of the original sequence m(t).<br />
Last step corresponds to the estimation of bits τi with<br />
matched ltering (block ➍). The estimation is performed as<br />
follows<br />
<br />
τi = ˆm(t) g(t − 5iσg)dt<br />
R<br />
1<br />
≷<br />
0<br />
where g is the norm of g(t).<br />
4. NUMERICAL RESULT<br />
g<br />
, i =1..N, (18)<br />
2<br />
The test signal is a male utterance of the word “bingo” sampled<br />
at 8 kHz. The Log–spectrogram of the test signal is depicted<br />
in the Fig. 3. The selected masking component is the<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.
formant referenced by the black arrow. The watermark is em-<br />
Normalized frequency<br />
0.5<br />
0.25<br />
0<br />
0 1000 2000 3000 4000<br />
Normalized time<br />
Fig. 3. Log–spectrogram of the test signal. Male utterance of<br />
the word “bingo” with a sampling rate of 8 kHz.<br />
bedded as described in Sec. 3.1. First, the data (τ)2 = 010011<br />
is used to generate the amplitude of the watermark by means<br />
of the Equ. 16. Then the insertion zone is manually choseninordertodene<br />
the warping operator used to generate<br />
the time–warped watermark. Finally, the time–warped watermark<br />
is added to the original signal. Result of the watermark<br />
embedding is shown in Fig. 4. As can be seen, the time–<br />
Normalized frequency<br />
0.5<br />
0.25<br />
0<br />
0<br />
1000<br />
Normalized time<br />
(a) Original signal<br />
Normalized frequency<br />
0.5<br />
0.25<br />
0<br />
0<br />
1000<br />
Normalized time<br />
(b) Watermarked signal<br />
Fig. 4. Zoomed log–spectrogram of the original and watermarked<br />
test signal.<br />
warped watermark is very close to the original formant. As<br />
expected, the frequency spread decreases very fast, thanks to<br />
the smoothness of the amplitude of the watermark sequence.<br />
We found the embedding strategy satisfactory since we<br />
were not able to guess wether the signal was watermarked or<br />
not during blind tests. In order to provide a more objective<br />
comparison criterion, we make use of the “Auditory Toolbox”<br />
[4] to generate auditory representations of original and<br />
watermarked signals which is a pseudo–time–frequency representation<br />
based on physiological aspects of human hears.<br />
Auditory representations of original and watermarked signals<br />
are depicted in Fig. 5. Both representations are very similar<br />
which conrms stealthiness of the watermark.<br />
Next step consists in the recovery of the watermark sequence.<br />
We tested the proposed approach on the true watermarked<br />
signal, and on two different deteriorated versions of<br />
the watermarked signal : the rst is a MP3 compression attack,<br />
and the second is an additive Gaussian noise attack with<br />
a signal–to–noise ratio of 0 dB.<br />
Results of the matched ltering estimation are presented<br />
in Tab. 1. Results of the estimation step show that the watermark<br />
is perfectly extracted and has resisted to the MP3 attack<br />
as well as the white–noise attack.<br />
II 204<br />
27<br />
Channels<br />
120<br />
60<br />
0<br />
0<br />
1000<br />
Normalized time<br />
(a) Original signal<br />
Channels<br />
120<br />
60<br />
0<br />
0<br />
1000<br />
Normalized time<br />
(b) Watermarked signal<br />
Fig. 5. Auditory representation of the original signal and the<br />
watermarked signal.<br />
τ τ1 τ2 τ3 τ4 τ5 τ6<br />
True 0 1 0 0 1 1<br />
No attack 0 1 0 0 1 1<br />
MP3 attack 0 1 0 0 1 1<br />
Noise attack 0 1 0 0 1 1<br />
Table 1. Results of the estimation of the set {τi} by matched<br />
ltering.<br />
5. CONCLUSION<br />
In this paper we have proposed a new watermarking method<br />
for speech signals, based on the time–warping signal processing<br />
concept. We have shown that it is possible to exploit physiological<br />
aspects of the human hear in order to carry information<br />
while keeping stealthiness of the inserted watermark.<br />
Then, we have developed a complete extraction method based<br />
on time–varying lter, time–warping operators and match ltering,<br />
to recover the watermark sequence. Numerical results<br />
show that the watermark is likely not hearable and numerical<br />
information carried by the watermark are retrievable. Future<br />
work will include a close study the robustness of the method<br />
against various attacks. For real applications, another topic<br />
is the unsupervised embedding of the watermark according to<br />
the position of formant. This issue is left for future work.<br />
6. REFERENCES<br />
[1] M. Arnold, “Audio watermarking: Features, applications<br />
and algorithms,” in IEEE International Conference on<br />
Multimedia and Expo, New York, USA, July 2000.<br />
[2] R. Baraniuk, “Unitary equivalence: A new twist on signal<br />
processing,” IEEE Trans. on <strong>Signal</strong> Processing, vol. 43,<br />
no. 10, pp. 2269–2282, Oct. 1995.<br />
[3] M.C. Botte, G. Canevet, L. Demany, and C. Sorin, Psychoacoustique<br />
et perception auditive, Inserm, 1989.<br />
[4] M. Slanley, “Auditory toolbox, version 2.0,” Avaiable at<br />
http://www.slaney.org/malcolm/pubs.html, 1994.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:18 from IEEE Xplore. Restrictions apply.
Chirp-based image watermarking as error-control coding<br />
Behnaz Ghoraani, and Sridhar Krishnan<br />
Dept. of Elec. & Comp. Eng., <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />
E-mail: bghoraan@rnet.ryerson.ca, and krishnan@ee.ryerson.ca<br />
Abstract<br />
In this paper, we use post processing methods to compensate<br />
the bit errors occurred in watermark embedding<br />
and extracting. Forward error correction (FEC)-based and<br />
chirp-based techniques are applied to encode and shape<br />
the embedded watermark message so that even at the presence<br />
of some bit error rates (BERs) in the extracted watermark,<br />
the watermarking algorithm be able to successfully<br />
estimate the correct embedded watermark message.<br />
Repetition and Bose-Chaudhuri-Hocquenghem (BCH) codings<br />
are used as two well-known FEC schemes, and discrete<br />
polynomial transform (DPPT) and Hough-Radon transform<br />
(HRT) are utilized as two chirp detectors in chirp-based watermarking.<br />
Robustness of all the proposed post processing<br />
methods are tested for checkmark benchmark attacks,<br />
and we found that the chirp-based watermarking using the<br />
DPPT chirp detector offers the highest watermark extraction<br />
rate, and the best bit error compensation even at BERs<br />
of higher than 17%.<br />
1. Introduction<br />
The worldwide trend of using the Internet to electronically<br />
distribute multimedia offers lot of advantages such as<br />
huge cost reduction and considerable time saving of transportation<br />
process to both owners and consumers. However,<br />
the available methods for distribution of multimedia lacks<br />
the privacy and ownership proof. One of the suggested solutions<br />
to protect the copyrights and prevent illegal use of the<br />
multimedia is watermarking. Watermarking is embedding a<br />
hidden signature into the multimedia signal containing information<br />
about the content authentication, the access control<br />
and copy protection, and the identification and traitor<br />
tracing in the case of illegally distribution. The embedded<br />
watermark signal should be imperceptible, and do not effect<br />
the quality of the multimedia content. Also, since users normally<br />
apply many signal manipulations such as lossy compressions<br />
on a multimedia signal, the watermark should be<br />
robust to these typical signal operations. However, even us-<br />
Proceedings of the 2006 International Conference on Intelligent<br />
Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />
0-7695-2745-0/06 $20.00 © 2006<br />
ing the robustest embedding techniques, there will be some<br />
bit errors in the received watermark message. Therefore, the<br />
watermark detection process encounters some difficulties in<br />
extracting the exact watermark message, and retrieving the<br />
hidden information. Shaping the embedded signature in a<br />
certain way that the extractor could compensate the bit error<br />
rate (BER) of the message using the prior knowledge<br />
about the watermark structure can be useful in estimating<br />
the exact watermark message.<br />
In this study, we focus on utilizing chirps as watermark<br />
message structures [1][7], and comparing its results with<br />
forward error correction (FEC) schemes. To do the experiments,<br />
we use spread spectrum method to embed the watermark<br />
messages to discrete cosine transform (DCT) coefficients<br />
of the image. After extracting the watermark message<br />
from the watermarked image, there is a post processing<br />
stage. We utilize BCH and repetition codings for FECbased<br />
post processing, and discrete polynomial transform<br />
(DPPT) and Hough-Radon transform (HRT) detectors in<br />
chirp-based watermarking. In this paper, we present the results<br />
of chirp-based and FEC-based post processings, and<br />
show that the chirp-based watermarking is found comparable<br />
with the FEC schemes.<br />
2. Watermarking method<br />
In this study we use spread-spectrum watermarking<br />
scheme which is a correlation method that embeds pseudorandom<br />
sequence and detects watermark by calculating correlation<br />
between pseudo-random noise sequence and watermarked<br />
signal. Spread-spectrum scheme is the most popular<br />
scheme and has been studied well in literature [2]. The<br />
spread-spectrum method can be applied in time domain or<br />
transformed frequency domain. We utilized DCT coefficients,<br />
which are widely used in compression applications<br />
and are easier to impose human visual system (HVS) constraints<br />
on them. Figure 1 shows the block diagram of the<br />
watermarking embedding and extraction schemes [7].<br />
As mentioned earlier, because of the intentional and nonintentional<br />
signal processings there will be some BERs in<br />
the received message. In the next section we try to compen-<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />
28
Figure 1. Detection and extraction block diagram<br />
of the watermarking method<br />
sate these bit errors with concentrating more on the structure<br />
of the watermark message.<br />
3. Post processing in watermarking<br />
In case of severe signal manipulations, the extracted watermark<br />
has some bit errors. One of the post processing<br />
methods that can be useful in correcting the bit errors is encoding<br />
the watermark signature with a FEC scheme before<br />
embedding it to the multimedia signal.<br />
3.1 Forward Error Correction schemes<br />
FEC schemes, or channel codings, are used to protect<br />
digital communication by inserting redundant bits to the<br />
data. These additional bits contribute in detecting and correcting<br />
the errors happened in the data. Due to the similarity<br />
between watermarking and communication systems,<br />
FEC methods have been commonly used to increase the bit<br />
error compensation capacity of watermarking techniques.<br />
BCH, turbo and repetition codings are the most commonly<br />
used FEC schemes in watermarking application [4]. In this<br />
study, we utilized BCH and repetition codings as two wellknown<br />
FEC schemes.<br />
3.1.1 Bose-Chaudhuri-Hocquenghem (BCH) coding<br />
BCH is a kind of block coding scheme. A binary BCH<br />
code (n, k) segments the data into block of k bits, and transforms<br />
each k-bit data block into N-bit block. The (n − k)<br />
bits are called redundant bits, and the code rate is k/n.<br />
Since in this study our target is comparing different types of<br />
post processing methods, all the watermark messages used<br />
in each method have almost the same number of bits and<br />
redundancy rates. n and k for the BCH code that results in<br />
Proceedings of the 2006 International Conference on Intelligent<br />
Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />
0-7695-2745-0/06 $20.00 © 2006<br />
a watermark message length closer to 180 bits, and gives<br />
the highest redundancy rate are 63 and 7 respectively. BCH<br />
(63, 7) encodes a 21-bit watermark signature to a 189-bit<br />
long embedded message with 10.7/12 redundancy rate.<br />
3.1.2 Repetition coding<br />
Repetition coding is a very simple and well-known coding<br />
technique. Repetition coding with repetition number of<br />
n repeats each bit n times, so results in a redundancy rate of<br />
n/(n+1). We choose n =11to encode a 15-bit watermark<br />
to an embedded watermark of 180-bit long and redundancy<br />
rate of 11/12.<br />
Having a prior knowledge about the structure of the watermark<br />
message could also be useful in increasing the compensation<br />
of the BERs in the extracted watermark. The<br />
structure that has been proposed is embedding chirp signals<br />
as the watermark message in chirp-based watermarking.<br />
3.2. Chirp-based watermarking<br />
In chirp-based watermarking, the idea is embedding<br />
chirps in the multimedia signal as watermark signatures.<br />
Before embedding the watermark signal into the image, the<br />
watermark message is encoded to the embedded chirp according<br />
to a predefined codebook. Chirps are time varying<br />
frequency signals and they can be best detected in the TF<br />
plane; also, different chirp rates represent different watermark<br />
messages. Because the extracted watermark message<br />
should be in the form of a chirp, by applying a post processing<br />
step such as a chirp detector on the extracted watermark,<br />
the embedded chirp could be estimated successfully even in<br />
the presence of some BERs. HRT and DPPT are the two<br />
chirp detection tools that have been used in our experiments.<br />
3.2.1 Hough-Radon transform (HRT)<br />
The HRT is a parametric tool to detect the pixels that belong<br />
to a parametric constraint of either a line or curve in a<br />
gray level image [8]. HRT divides the Hough-Radon parameter<br />
space to cells, then calculates the accumulator value<br />
for each cell in the parameter space. The cell with the highest<br />
accumulator value represents the parameter of the HRT<br />
constraint. Since in the application of post-processing of<br />
chirp-based watermarking, we are looking for the embedded<br />
chirp as straight lines in the TF plane, we can apply<br />
the HRT method to detect the embedded chirp. First, the<br />
extracted watermark bits are transformed to the TF plane;<br />
then the HRT detects the line representing the chirp in TFD.<br />
In order to achieve a good detection performance, Wigner-<br />
Ville Transform (WV) is used as the TFD representation of<br />
the signal. In this study, HRT space has 182 X 182 cells that<br />
supports a 15-bit long watermark message. HRT-based post<br />
processing has a redundancy rate of 11/12.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />
29
3.2.2 Discrete Polynomial Phase Transform (DPPT)<br />
DPPT is a spatial parametric signal analysis for estimating<br />
the phase parameters of constant amplitude polynomial<br />
phase signals even under the presence of some BERs in the<br />
signal [5]. The embedded watermark in the chirp-based watermarking<br />
is in the form of:<br />
chirp(n) =exp j(a0 + a1n + a2n 2 ) <br />
(1)<br />
The DPPT gives an estimation of a0,a1 and a2 which enables<br />
us to synthesize the original chirp. The DPPT algorithm<br />
defines ambiguity functions, which applying the 2order<br />
function to a constant-amplitude chirp transforms the<br />
broadband signal into a single tone signal with frequency<br />
related to a2. The position of this spectral peak provides<br />
an estimate of the coefficient â2. Multiplying the signal<br />
with exp −jâ2n 2 reduces the order of the polynomial<br />
to 1, and repeating the procedure gives an estimation of all<br />
parameters. The judgment about the successful watermark<br />
extraction is done based on the following methods:<br />
Threshold-based (DPPT[T]) [3] - We make decision<br />
on the correct detection of the watermark based on<br />
the correlation between the estimated chirp and the<br />
embedded watermark. The threshold for 182-bit long<br />
chirp as embedded chirp, and 15 bit watermark signature<br />
is experimentally set up to 0.9.<br />
Correlation-based (DPPT[C]) - Searches for the chirp<br />
in the codebook which has the highest correlation coefficient<br />
with the estimated chirp. In this case to have<br />
a better watermark extraction rate the correlation between<br />
the chirps in the codebook is limited to a maximum<br />
of 0.93 that offers a redundancy rate of 11.08/12<br />
for a chirp length of 182 bits.<br />
Initial and final frequency-based (DPPT[F]) - Finds the<br />
chirp which has the closest initial and final frequencies<br />
with the estimated initial and final frequencies of<br />
the recovered chirp. Due to BER in the received watermark<br />
signal, the DPPT estimates the initial and final<br />
frequencies with some variations from the original<br />
ones. To increase the watermark extraction rate the<br />
minimum difference between the initial and final frequencies<br />
of chirps in the codebook are defined 4 Hz<br />
for 182-bit long chirp. This settings give the 11.09/12<br />
redundancy rate.<br />
4. Results<br />
To measure the robustness of the post processing algorithms,<br />
we perform the checkmark benchmark attacks [6]<br />
for 10 different images of size 512 X 512. The PN sequence<br />
is 100,000 samples and the watermark sampling frequency<br />
is 1 kHz. To have a fair comparison of all the post<br />
Proceedings of the 2006 International Conference on Intelligent<br />
Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />
0-7695-2745-0/06 $20.00 © 2006<br />
processing techniques, the methods have been set up with<br />
almost equal message lengths and redundancy rates. Figure<br />
2 shows important features of the applied schemes.<br />
Figure 2. Features of each coding schemes<br />
used to code the watermark message<br />
Figure 3 shows the results of all the post processing techniques.<br />
The first column shows different types of attacks ap-<br />
Figure 3. Watermark Detection Results for<br />
Checkmark Benchmark Attacks<br />
plied on the watermarked image, and the number shows the<br />
number of attacks. The number under each column represents<br />
the percentage of successful watermark detection under<br />
each class of attack. Although DPPT[T] method shows<br />
higher result comparing to DPPT[C] and DPPT[F], it results<br />
in 13% False positive error rate, which is too high to<br />
be applicable for multi user watermarking system. Therefore,<br />
DPPT[T] is useful in watermarking applications that<br />
we are interested in detecting the watermark rather than extracting<br />
the watermark message. DPPT[F] based method<br />
offers higher or in some cases equal results comparing<br />
to DPPT[C]-based method. In addition, DPPT[F]-based<br />
method does not require the long process of correlating the<br />
estimated chirp with all chirps in the CodeBook and is faster<br />
than DPPT[C]. Thus, we conclude that the initial and the final<br />
frequency comparison is the best DPPT-based method<br />
to find the embedded message in the CodeBook.<br />
Although HRT is an optimum line detection tool, hav-<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />
30
ing a large number of cells in Hough-Radon space shifts<br />
the peak of the accumulator to the neighbor cells and consequently<br />
results in a wrong detection of the slope in TFD.<br />
Therefore, we see in Figure 3 that DPPT-based method outperforms<br />
HRT-based algorithm in most of the attack types<br />
with a total detection of 92% compared to 87%. Also, comparing<br />
the complexity order of HRT-based and DPPT-based<br />
techniques, we conclude that the DPPT-based method is<br />
more practical for real-time applications. Figure 4, shows<br />
the complexity order and running time based on Pentium<br />
IV, CPU 2.66GHz and 512MB of RAM. DPPT-based watermark<br />
extraction is about 55 times faster than HRT-based<br />
algorithm.<br />
As we observe in Figure 3, the detection result for the<br />
BCH coding and repetition codings have almost the same<br />
detection rates, but DPPT-based method offers better or in<br />
some cases equal results when compared to REP and BCH<br />
codings. Figure 5 shows the detection results considering<br />
BER in the received message. As we see in this figure, both<br />
DPPT and BCH detect 100% watermark messages successfully<br />
up to a BER of 17%. However, the maximum BER<br />
that BCH detects a watermark correctly is 22% with 17%<br />
detection, while DPPT shows 50% detection rate at a BER<br />
of 28%. To highlight the outstanding performance of DPPT<br />
at high BERs, we calculate the watermark detection rate<br />
for BERs bigger than 17%. We see that the DPPT-based<br />
method offers 52% detection rate in higher BERs, while<br />
BCH and Repetition codings have 47% and 41% detection<br />
rates.<br />
Figure 4. Order of complexity of each coding<br />
schemes used to code the watermark message<br />
5. Conclusions<br />
In this paper, we compared FEC-based and chirp-based<br />
post processing methods in watermarking. The robustness<br />
of the proposed techniques was tested against checkmark<br />
benchmark attacks. The DPPT-based and BCH-based methods<br />
were able to compensate the BER of up to 17%. The<br />
DPPT-based post processing offered the highest detection<br />
rate of 92%, and showed the highest detection rate for BERs<br />
of higher than 17%. Also, we compared the computation<br />
complexity of the proposed methods; BCH, repetition<br />
and DPPT-based methods have almost the same complex-<br />
Proceedings of the 2006 International Conference on Intelligent<br />
Information Hiding and Multimedia <strong>Signal</strong> Processing (IIH-MSP'06)<br />
0-7695-2745-0/06 $20.00 © 2006<br />
Successful watermark estimation(%)<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
HRT<br />
REP<br />
BCH<br />
DPT<br />
0<br />
0 5 10 15 20 25 30 35 40<br />
BER of the received watermark message<br />
Figure 5. Watermark detection under different<br />
bit error rates.<br />
ity, while HRT has a complexity of 55 times higher than the<br />
other methods, and this is because HRT operates on the TF<br />
plane, and calculates the accumulator value for all the cells<br />
in Hough-Radon plane.<br />
References<br />
[1] S. Erkucuk, S. Krishnan, and M. Zeytinoglu. Robust audio<br />
watermarking using a chirp based technique. IEEE Intl. Conf.<br />
on Multimedia and Expo, 2:513–516, July 2002.<br />
[2] D. Kirovski and H. Malvar. Spread-spectrum watermarking<br />
of audio signals. IEEE Transactions on <strong>Signal</strong> Processing,<br />
special issue on data hiding, 51:1020–1034, April 2003.<br />
[3] L. Le and S. Krishnan. Time-frequency signal synthesis<br />
and its application in multimedia watermark detection.<br />
EURASIP Journal on Applied <strong>Signal</strong> Processing, 2006:Article<br />
ID 86712, 14 pages, 2006.<br />
[4] J. Lee, H. Kim, and J. Lee. Information extraction method<br />
without original image using turbo code. Proc. International<br />
Conference on Image Processing, Greece, 3:880–883, October<br />
2001.<br />
[5] S. Peleg and B. Friedlander. The discrete polynomialphase<br />
transform. IEEE Transactions on <strong>Signal</strong> Processing,<br />
43:1901–1914, August 1995.<br />
[6] S. Pereira, S. Voloshynovskiy, M. Madueno, S. Marchand-<br />
Maillet, and T. Pun. Second generation benchmarking and<br />
application oriented evaluation. In Information Hiding Workshop<br />
III, Pittsburgh, PA, USA, April 2001.<br />
[7] A. Ramalingam and S. Krishnan. Robust image watermarking<br />
using a chirp detection-based technique. IEE Proceedings on<br />
Vision, Image and <strong>Signal</strong> Processing, 152:771–778, December<br />
2005.<br />
[8] R. Rangayyan and S. Krishnan. Feature identification in the<br />
time-frequency plane by using the Hough-Radon transform.<br />
IEEE Trans. on Pattern Recognition, 34:1147–1158, 2001.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:31 from IEEE Xplore. Restrictions apply.<br />
31
2006 International Joint Conference on Neural Networks<br />
Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada<br />
July 16-21, 2006<br />
Automatic Content-Based Image Retrieval Using Hierarchical<br />
Clustering Algorithms<br />
Kambiz Jarrah, Student Member, IEEE, Sri Krishnan, Senior Member, IEEE, and Ling Gum, Senior<br />
Member, IEEE<br />
Amt-The overall objective of this paper is to present a<br />
methodology for guiding adaptations of an RBF-based<br />
relevance feedback network, embedded in automatic content-<br />
based image retrieval (CBIR) systems, through the principle of<br />
unsupervised hierarchical clustering. The self-organizing tree<br />
map (SOTM) is essentially attractive for our approach since it<br />
not only extracts global intuition from an input pattern space<br />
but also injects some degree of localization into the<br />
discriminative process such that maximal discrimination<br />
becomes a priority at any given resolution. The main focus of<br />
this paper is twwfold: introducing a new member of SOTM<br />
family, the Directed SOTM (DSOTM) that not only provides a<br />
partial supervision on cluster generation by forcing divisions<br />
away from the query class, but also presents a flexible verdict<br />
on resemblance of the input pattern as its tree structure grows;<br />
and modifying the current structure of the normalized graph<br />
cuts (Ncut) process by enabling the algorithm to determine<br />
appropriate number of clusters within an unknown dataset<br />
prior to its recursive clustering scheme through the principle of<br />
self-or ganizing normalized graph cuts (SONcut).<br />
Comprehensive comparisons with the Self-Organizing Feature<br />
Map (SOFM), SOTM, and Ncut algorithms demonstrate<br />
feasibility of the proposed methods.<br />
I INTRODUCTION<br />
Content-based image retrieval (CBIR) relies on<br />
characterization of images based on their visual contents.<br />
These visual contents consist of low-level features<br />
including colour, texture, shape, etc.- that offer a multidimensional<br />
vector representation of an image within the<br />
feature space.<br />
One of the major requirements for designing an effective<br />
CBIR system is to reduce the gap between low-level features<br />
and high-level concepts by tailoring the human perceptual<br />
subjectivity to the retrieval process. Human-computer<br />
interaction (HCI) systems have demonstrated a successful<br />
training the system with suitable samples (images). This<br />
dependency of the system on users' inputs may add<br />
excessive human errors to the adaptation process due to<br />
subjective interpretations of image contents by each<br />
individual.<br />
To overcome this problem, an unsupervised learning<br />
approach with a hierarchical architecture is required to guide<br />
these adaptations automatically and more toward relevant<br />
samples. SOTM has show an effective behavior in<br />
minimizing human interactions and automating the search<br />
process by efficiently classifying an unknow and non-<br />
uniform data space into more meaningful clusters.<br />
The main focus of this paper is as follows: a) introducing<br />
a new member of SOTM family, the Directed SOTM<br />
(DSOTM), that dynamically controls generation of new<br />
centres and decides on resemblance of input samples, with<br />
respect to the query, during the learning phase of the<br />
algorithm; and b) modifying the current structure of the<br />
normalized graph cuts (Ncut) algorithm [2] to make it more<br />
adaptive to the nature of the input pattern by adding a self-<br />
determination mechanism to its algorithm to decide on<br />
appropriate number of clusters prior to its iterative clustering<br />
process. We call the modifiedNcut, the self-organizing Ncut<br />
(SONcut).<br />
This paper provides some details on both DSOTM and<br />
SONcut algorithms in Section 2; Section 3 presents an<br />
overall description on the structure of CBIR system used in<br />
this work; a comprehensive comparison between the<br />
proposed classifiers and their conventional counterparts is<br />
also presented in Section 4; Section 5 concludes the paper<br />
with some remarks.<br />
11. UNSUPERVIXD CLUSTERING APPROACHES<br />
behavior for this purpose [I]. In such systems, users directly In this section, an overview of the proposed unsupervised<br />
supervise the learning process by constantly providing and hierarchical clustering algorithms using DSOTM and<br />
SONcut is presented.<br />
This work was supported by Natural Sciences andEngineering <strong>Research</strong> A. Drected Sekf-Organ~zmg Tree Map (DSOTM)<br />
Council of Canada (NSERC) and <strong>Ryerson</strong> Graduate Scholarship Program<br />
K Jarrah, barrak@e.ryerson.ca, is affiliated with the Multimedia<br />
<strong>Research</strong> Laboratory (RML) and <strong>Signal</strong> <strong>Analysis</strong> <strong>Research</strong> <strong>Group</strong> (<strong>SAR</strong>),<br />
L31 is an unsupervised machine learning 'gorithm<br />
and is inspired by principles found in Kohonen's self-<br />
<strong>Ryerson</strong> <strong>University</strong> ( mw ryerson ca), Toronto, Canada.<br />
S Knshnan, hisknan@e.ryersora.ca, is the char of Department of<br />
Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto,<br />
Canada, andis affiliatedwith the <strong>Signal</strong> halysis <strong>Research</strong> <strong>Group</strong> (<strong>SAR</strong>) at<br />
organizing feature map (SOFM) [4].<br />
The tree structure of the SOTM is constructed by<br />
randomly selecting an isolated root node (centre) and<br />
the same university<br />
L Gum, dguan@e.ryerson.ca, is the Canada <strong>Research</strong> Char, <strong>Ryerson</strong><br />
<strong>University</strong>, Toronto, Canada, and is affiliated with the Multimedia <strong>Research</strong><br />
repeatedly presenting the remaining of the patterns to the<br />
network. The pattern (sample) which is found to be closest<br />
to the centre with respect to current similarity measurement<br />
Laboratory (RML) at the same university<br />
0-7803-9490-9/06/$20.00/©2006 IEEE<br />
3532<br />
32<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.
Fig 1 Two-dimensional mapping (Left) clustering using SOFM and @ ight) clustering using SOTM. It is evident that redundant nodes in the<br />
lattice topology of SOFIA can produce unnecessary boundmes by having some of the centres trapped in low-density regions of the input pattern<br />
is declared to be the winner. Every such presentation of<br />
input patterns slightly modifies the winning node's position<br />
in the network: aposition that eventually evolves toward the<br />
centre of mass of the current class. This gradual adaptation<br />
of the node's position is controlled by an exponential<br />
decaying function called the kearnzng rate. The learning rate<br />
resets to its initial value each time a new centre is generated.<br />
Therefore, sufficient time is given to the network to adapt<br />
itself to the presence of new samples, thus, the tree grows<br />
larger and the similarity measurement tends to be more<br />
accurate. The generation of new centres (branches of the<br />
tree) is controlled by a hierarchy function, called the<br />
th~shokd~nction, which decreases over time. If an input<br />
node is encountered whose similarity exceeds this threshold<br />
function (i.e. is significantly different from all nodes in the<br />
current SOTM map) a new node is generated. The new node<br />
is attached as a leaf node of its closest representation in the<br />
current SOTM mapping, thus over time, a tree structure<br />
evolves [q.<br />
Similar to SOFM, SOTM aims at preserving the<br />
topological relationships between patterns in the original<br />
input space. However, unlike SOFM, SOTM classifies a<br />
large group of patterns by building and evolving a tree<br />
structure that tends to form neighborhood relationships by<br />
reflecting a degree of similarity between the new and<br />
already classified patterns.<br />
Although, image indexing with the SOFM was perceived<br />
to be a robust and effective solution that tolerates even very<br />
high input vector dimensionalities [5], the lattice topology of<br />
SOFM network makes it essentially undesirable for<br />
clustering purposes due to concentration of a fraction of<br />
nodes in the map resulting of the best-matching unit<br />
computation [6]. SOFM has nodes that can easily get<br />
trapped in regions of low density and, therefore, can simply<br />
lose its ability to represent the underlying topology of the<br />
input pattern. For instance, let us assume there are two high<br />
density regions in the input space, representing two distinct<br />
clusters. Let us also assume that there are maximum two<br />
nodes in the structure of SOFM to correctly allocate both<br />
regions. If those two nodes were separated by a third node<br />
and each converged to the two adjacent regions of high<br />
density, then the third node could easily get trapped in<br />
between the regions. As a result, it can change the true<br />
boundaries of high density classes by pulling some of the<br />
3533<br />
33<br />
samples from the two real clusters and allocating them to the<br />
middle node as is illustrated in Fig. 1. In this figure, the<br />
second node of the SOFM network has dragged some of the<br />
data points from the first cluster and has generated a new but<br />
redundant class. The tree structure of SOTM, however, is<br />
succesdul in determining the high-density regions.<br />
Problem with the SOTM algorithm is two-fold: it<br />
unsuitably decides on the relevant number of classes; and<br />
often loses track of the true query position. Decision on<br />
which clusters are relevant in the SOTM is postponed until<br />
after the algorithm has converged. This is because there is<br />
no innate controlling process available for the algorithm to<br />
influence cluster generation around the query centre (the<br />
SOTM clusters entirely independently). Losing a sense of<br />
query location within the input space can have an undesired<br />
effect on the true structure of the relevant class and can<br />
force the SOTM algorithm to spawn new clusters and form<br />
unnecessary boundaries within the query class as is<br />
illustrated in Fig. 2. In this figure, the SOTM forms a<br />
boundary near the query contaminating relevant samples,<br />
where as some supervision is maintained in the DSOTM<br />
case, preventing unnecessary boundaries fiom forming.<br />
Therefore, retaining some degree of supervision on the<br />
cluster generation around the query class seems to be vital.<br />
Due to the limitations of SOTM, we propose the<br />
Directed SOTM (DSOTM) algorithm in this work. In the<br />
DSOTM algorithm, decision on association of input pattern<br />
to query image is gradually made as each sample is<br />
presented to the system. It also contains a controlling<br />
mechanism that keeps track of the query centre by forcing<br />
the centre of relevant class to remain in the vicinity of the<br />
query position. Therefore, it can dynamically control<br />
generation of new centres and can determine the relevance<br />
of input samples, with respect to the query, as the tree<br />
structure grows. On the other hand, it limits the synaptic<br />
vector adjustments according to its reinforced learning rules<br />
and constrains cluster generation by preventing the<br />
spawning of redundant centres around the query position;<br />
since this part of the map is already occupied by relevant<br />
class centre.<br />
The DSOTM algorithm is summarized as follows:<br />
InihkEzrslion: Choose a mot node, {u3t fTom the<br />
available set of input vectors, { xk)f , in a random manner.<br />
J is the total number of centroids (initially set to 1) and K is<br />
the total number of inputs (i.e. images);<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.
Fig 2 Two-dimensional mapping (Left) Input pattern with 5 distinct clusters, (Middle) 14 centres are generatedusing SOTM, and (hght) 5 centres are<br />
generated using DSOTM Over-classification around the query (bangle) mll result in erroneous relevance identification<br />
Si&@ Meus~~remnt Randomly select anew data point, centre to the query, then mark x(t) as a relevant sample and<br />
x, find the best-matching (winning) centroid, j*, by update its centroid (winning neuron) toward the query<br />
minimizing a predefined Euclidean distance criterion in (1). position according to the degree of resemblance of the<br />
sample using:<br />
U m :<br />
If Ix(t)-wj*(t) i H(t), where H(t) is the<br />
hierarchy function used to control the levels of the tree and<br />
decays exponentially over time from its initial<br />
value,H(t,)>cx, according toH(t+l)=/l-H(t)-exp(-tlp),<br />
where p = max(t) llog,,[H(t)] and h is the threshold<br />
constant, 0
Fig 3: Two-dmensional mapping: (Left) Input pattern with 7 dstinct clusters, (Middle) 8 centres are generated using Ncut, and (fight) 7 centres are<br />
generated using SONcut. Over-classification around the quev (triangle) will result in enoneous classification of the relevant class.<br />
nodes in the input pattern,) and assoc(A, A) and where H(t) is defined similar to the hierarchy function used<br />
assoc(B, B) measure the total intra-cluster similarity in the DSOTM algorithm, then increment maximum allowed<br />
Ncut by 1;<br />
(association) in A and B, assoc(A,V) is the Continuation: Continue with the Similarity Measurement<br />
total connection from nodes in cluster A to all nodes in the ,tep until no noticeable changes in the feature map are<br />
graph; assoc(B,V)is defined similarly. w, is a nonnegative observed;<br />
weight function measuring degree of similarity between two Graph Generation: Given the input pattern, set up a<br />
samples of input patterns and is defined as: weighted graph, G = (V, E), compute the weight, wm, on<br />
('1<br />
each edge, Em, using (8) and create affinity, W, and<br />
diagonal, D, matrices;<br />
where d(p,q) is a pre-defined distance metric (i.e.<br />
Euclidean distance) and k is a user defined constant to<br />
control decreasing rate of the weight function and is<br />
empirically sets to 0.2. By using this function, the smallest<br />
eigenvector remains constant and Ncut can find relatively<br />
right partitions [2].<br />
As Shi and Malik have also discussed, the optimal<br />
partitioning (minimum possible Ncut) can be computed by<br />
solving the generalized eigenvalue system. The second<br />
smallest eigenvector of the generalized eigensystem is then<br />
used to partition the graph.<br />
In this paper we have used the Ncut algorithm [2] for the<br />
purpose of unsupervised data clustering. The Ncut<br />
partitioning method can be recursively applied on the input<br />
pattern to generate more than two clusters. Decision on<br />
maximum number of centres in the input pattern to stop the<br />
clustering process is a challenging problem. In this work, we<br />
Eigensystem Transformation: Solve (D- W)x = A- Dx for<br />
eigenvector with the smallest eigenvalue;<br />
Graph Bipartition: Use the eigenvector with the second<br />
smallest eigenvalue to bipartition the graph;<br />
Partitioning Continuation: Consider current partitions for<br />
supplementary subdivision. Continue with repartitioning<br />
until the Ncut value reaches to its maximum allowed.<br />
In summary, we have proposed an unsupervised<br />
hierarchical Ncut algorithm that is able to estimate the<br />
maximum number of allowed Ncuts by training the<br />
algorithm using the principles found in the DSOTM<br />
architecture. Thus, by dynamically adapting the Ncut<br />
algorithm to the nature of the input pattern, problem of overpartitioning<br />
the relevant class can be prevented. Fig. 3<br />
depicts importance of adapting such predictive mechanism<br />
for the Ncut clustering algorithm and illustrates<br />
effectiveness of employing such mechanism to avoid overclassification<br />
around the query centre.<br />
have integrated - the Ncut algorithm - with the principles found<br />
in DSOTM to automatically estimate appropriate number of<br />
clusters in the input pattern and set the maximum allowed<br />
Ncut accordingly. We call this Ncut algorithm with self-<br />
oriented centre detection, the Self-Organizing Normalized<br />
cuts, SONcut.<br />
The proposed SONcut algorithm is as follows:<br />
Initialization: Choose a root node, {I-,)~=~,<br />
from the<br />
available set of input vectors, { x,)L, in a random manner.<br />
N is the maximum allowed Ncut (initially set to 1) and K the<br />
total number of inputs;<br />
Similarity Measurement: Randomly select a new data point,<br />
x, and find the winning centroid, n*, by minimizing a<br />
predefined distance criterion in (1);<br />
Maximum Allowed Ncut Estimation: If Ilx(t) - I-,, 11 > H (t)<br />
3535<br />
35<br />
Previously in [7], we proposed an automatic CBIR engine<br />
that was structured around an unsupervised learning<br />
algorithm, the DSOTM. To reduce the gap between high-<br />
level concepts (semantics) and low-level statistical features<br />
and to evolve the search process according to what the<br />
system believes to be the significant content within the<br />
query, the above engine was integrated with a process of<br />
feature weight detection using genetic algorithms (GA) as<br />
illustrated in Fig. 4b. In this paper we use a relatively<br />
simpler CBIR architecture (see Fig. 4a and Fig. 5) to solely<br />
compare abilities of the proposed hierarchical clustering<br />
algorithms for the purpose of data classification with other<br />
three techniques, SOTM, SOFM, and Ncut.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.
p3<br />
Feature<br />
Extraction<br />
u<br />
Database<br />
+ Mrbal<br />
Search<br />
Unit<br />
+<br />
mlh<br />
Feature<br />
b qwp<br />
Extraction<br />
u<br />
Database<br />
Rehkval<br />
Search<br />
Unit<br />
/<br />
/<br />
queryimeiric<br />
adaptation<br />
DeteFted We@k<br />
query/meiri<br />
daptatnn<br />
Relevance<br />
Genetic<br />
Feedback<br />
Algorithm<br />
Fig 4a Machine controlled CBIR system [I] Fig 4b -Machine controlled GA-based CBIR system [7]<br />
1<br />
- - - - - - - - - Inifiaal - - - - Search - - - - - - - - - - - - Interface - - - - - - - - - - - - A&mafic - - - - - - - Search - - - - - - - - - - - - - - -<br />
Feature Extraction 1<br />
.....* Fmture Extraction 2<br />
4<br />
Unsupervised Ckilier<br />
I j<br />
I j<br />
I 4<br />
I j<br />
I<br />
Query Imge<br />
I Query Modfiation<br />
'<br />
L ------------------------ I I - -------------------------------- 2<br />
Fig 5 Another representation of the CBIR system of Fig 4a<br />
reaulta<br />
b<br />
The whole idea of the automatic image retrieval process is from the previous iterations, is then selected to substantially<br />
to unsupervisedly tailor the retrieval process to users' notion represent the relevant class through the Query Modz$cahon<br />
of similarity by utilizing an unsupervised learning technique module.<br />
to perform required decision makings about relevance of In this work, Colour Histograms, Colour Moments,<br />
retrieved images instead of the human user.<br />
Wavelet Moments, and Fourier Descriptors were used in the<br />
This automatic refinement mechanism is possible by Feature Extrachon I, whereas Hu's seven moment<br />
adapting a flexible architecture for the classifier to learn invariants (HSMI) and Gabor Descriptors accompanied with<br />
from and re-adjust to the nature of input pattern in a greater Colour Histograms and Colour Moments were used in<br />
extent using predefined competitive learning algorithms: a Feature Extract~on 2.<br />
method that is capable of giving a conforming ability to the<br />
network to perform avariety of computational tasks [q.<br />
Colour histograms and colour moments were computed in<br />
the HSV and RGB colour spaces respectively. Wavelet<br />
In Fig. 5, the second unit of Inzhak Sea~ch module, the Moments were extracted f?om mean, p, and standard<br />
Feature Extractzon I, deals with calculating features from a deviation, a, of three-level wavelet transform applied on an<br />
high volume image database. Consequently, standard set of image. Boundary-based Fourier shape parameters were<br />
content descriptors (an example might be MPEG-7), are extracted by converting the edge parameters from Cartesian<br />
extracted in this module to provide a more generic and rapid to Polar coordinates and, subsequently, applying the Fast<br />
interface to existing databases based on the chosen standard. Fourier Transform (FFT) to obtain top 10 low-frequency<br />
Extracted features are used to retrieve the most similar components. Texture features were computed from p and a<br />
images (in the relative sense) based on predefined distance<br />
metric. The top Q images are then displayed back to the user<br />
through an Interface block. Subsequently, user may request<br />
an automatic search which operates independently. Upon<br />
this request, control of the system is switched to the<br />
automatzc search module, wherein, another set of features<br />
with higher perceptual quality are extracted ffom the top Q<br />
of Gabor filtered images to construct 48 dimensional feature<br />
vectors, and finally, region-based HSMI shape parameters<br />
were extracted by converting the colour images into binary<br />
segmented ones and then extracting the shape parameters<br />
from those segmented images.<br />
retrieved images from the initial search, using Feature<br />
Extraction 2. Although computation of those features could A number of experiments were conducted to compare<br />
be intensive, this module allows for the use of more behaviors of the automatic CBIR engine in the Fig. 5 with<br />
proprietary or specialty features. Such a module may the presence of various unsupervised clustering algorithms<br />
enhance perceptual discrimination beyond that which might such as Ncut, SONcut, SOFM, SOTM, and DSOTM<br />
othenvise be possible through standard features alone. These classifiers. The simulations were carried out using a subset<br />
features are then used as seeds to train the Unsupemzsed<br />
Ckasszjep.. A new query, based upon the selected images<br />
of the Core1 image database consisting ofnearly 5 100 JPEG<br />
3536<br />
36<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.
TABLE I<br />
EXPERIMENTAL RESULTS IN TERMS OF RR FOR THE CBIR SYSTEM WITH<br />
NO FEATURE WEIGHT DETECTION MECHANISM FIG 4)<br />
Classifiera Set A Set B Set C Average<br />
Ncut 41 3% 47 3% 51 9% 46.8%<br />
SOFM 51 1% 44 3% 56 6% 50.7%<br />
SOTM 51 5% 51 1% 58 4% 53.7%<br />
SONcut 51.9% 52.8% 57.0% 53.9%<br />
TABLE 11<br />
EXPERIMENTAL RESULTS IN TERMS OF RR FOR THE CBIR SYSTEM WITH<br />
GA-BASED FEATURE WEIGHT DETECTION ALGORITHM FIG 5) [7]<br />
~lasslfier~ Set A Set B Set C Average<br />
Ncut 67 3% 64 5% 65 1% 61 6%<br />
SOFM 65 1% 66 7% 68 3% 66.7%<br />
SOTM 66 8% 72 1% 74 4% 71.1%<br />
SONcut 68 8% 72 8% 73 9% 71.8%<br />
DSOTM 78 3% 76 7% 80 5% 78.5%<br />
a Ncut Normalized Graph Cuts, SONcut Self-Organizing Normalized<br />
Graph Cuts, SOFM Self-Organizing Feature Maps, SOTM Self-<br />
Organizing Tree Maps, DSOTM Directed Self-Organizing Tree Maps<br />
colour images, covering a wide range of real-life photos,<br />
from 51 different categories. Each category consisted of 100<br />
visually associated objects to simplify measurements of the<br />
retrieval accuracy (RR) during the experiments. 3 sets of 5 1<br />
images were drawn from the database to form sets A, B, and<br />
C. Each set consists of randomly selected images such that<br />
no two images were from the same class. Retrieval results<br />
were statistically calculated ti-om each of the 3 sets. In the<br />
simulations, a total of 16 most relevant images were<br />
retrieved to evaluate precision of the retrieval.<br />
The experiment results are illustrated in Table 1. In the<br />
Ncut, SOFM, and SOTM algorithms, the maximum number<br />
of allowed cluster generation was set to P, P < Q, where Q<br />
is the total number of retrieved images from the initial<br />
search; P was empirically set to 8. A 4x2 grid topology was<br />
used in the SOFM structure to locate maximum 8 possible<br />
cluster centres. A hard decision on the resemblance of the<br />
input samples was made: if the sample is closer to one<br />
centre than any other centres, in terms of a predefined<br />
distance metric, it belongs to that centre. Table 2 illustrates<br />
results aRer feature weight detection using GA-based<br />
method described in [I.<br />
Although the Ncut algorithm is a top-down classification<br />
process and aims to extract global impressions of the input<br />
pattern and present a hierarchical description of it,<br />
employing a predictive mechanism to estimate true number<br />
of clusters prior to spawning new neurons is proven to be<br />
beneficial. This predictive mechanism will enforce afrontier<br />
on the classification process and inhibits unnecessary<br />
centres to be generated around the query position. As a<br />
result, a more accurate impression of relevance can be<br />
achieved by using SONcut algorithm.<br />
The SOTM algorithm not only extracts global intuitions<br />
of the input pattern, it also introduces some degree of<br />
localization into the discriminative process to achieve<br />
maximal discrimination at any given resolution (or number<br />
of classes.) Moreover, the ability of SOTM to span and force<br />
3537<br />
37<br />
division in the extremes of the data in the early, delaying<br />
division of most similar aspects until later stages of learning,<br />
and a flexible tree-like topologies (more plastic than SOFM)<br />
makes it essentially sensitive to the most dominant<br />
differences in the data and, thus, less prone to classification<br />
errors and more attractive to the retrieval applications.<br />
Despite all the above advantages of using SOW-based<br />
classifiers, retaining some degree of supervision to prevent<br />
unnecessary boundaries from forming around the query<br />
class seems to be crucial. The DSOTM algorithm not only<br />
provides a partial supervision on cluster generation by<br />
forcing divisions away ti-om the query class, it also makes a<br />
soft decision on resemblance of the input patterns by<br />
constantly modifying each sample's membership during<br />
learning phase of the algorithm. As a result, a more robust<br />
tree structure as well as a better sense of likeness can be<br />
finally achieved.<br />
V. CONCLUSION<br />
The framework for a novel unsupervised clustering<br />
algorithm based on DSOTM was introduced in this work. A<br />
modification on the current structure of Ncut algorithm was<br />
also proposed in this paper. This modification provides a-<br />
priori knowledge for the algorithm to determine appropriate<br />
number of clusters, based on principles found in DSOTM,<br />
prior to its hierarchical clustering operation. Performance of<br />
the proposed methods was compared with other<br />
conventional clustering methods (i.e Ncut, SOFM, and<br />
SOTM) by using a computer controlled CBIR system.<br />
SOTM outperforms both Ncut and SOFM and performs<br />
fairly close to SONcut even with its blind top-down data<br />
exploration. This is due to its flexible tree-shape structure as<br />
well as its competitive learning algorithm that injects some<br />
degree of localization into the discriminative process. The<br />
experimental results also illustrate promising performance of<br />
utilizing DSOTM in the structure of automatic CBIR<br />
engines.<br />
REFERENCES<br />
[I] P. Muneesawang and L Gum, "Minimizing user interaction by<br />
automatic and semiautomatic relevance feedback for Image retrieval,"<br />
Proc. IEEE I ~ Conj L on Image Processing, Rochester, USA, vol 2,<br />
pp 601-604, Sept 2002<br />
[2] J Shi and J Malik, "Normalized cuts and image segmentation," IEEE<br />
Trans. Pattern hadysas and Machine In&lligence, Vo1 22, Issue 8,<br />
pp 888 -905,Aug 2000<br />
[3] H S Kong, "The Self-Organizing Tree Map, and its Applications in<br />
Digital Image Processing," PhD Thesis <strong>University</strong> of Sydney,<br />
Australia, 1998<br />
[4] T Kohonen, "The self-organizing map," Proc. of rdae IEEE, Vol 78,<br />
Issue 9, pp. 1464 - 1480, Sept 1990<br />
[5] S. Haykin, Neural hbtworh: A Covprehensiw Foundufion, Prentice<br />
Hall, Inc , 1999, second edition.<br />
[6] J Randall, L Gum, X Li W Zhang, "Investigations of the selforganizing<br />
tree map," Proc. of6f.h International Conference on Neural<br />
Infirmataon Processing, Vol 2, pp 724 - 728, Nov 1999<br />
[7] K Jarrah, M Kyan, S Krishnan, and L Guan, "Computational<br />
intelligence techniques and their Applications in Content-Based Image<br />
Retrieval," IEEE Inf. Conj on Multamen'aa & Expo (ICME) , submitted<br />
for publication, 2006<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:49 from IEEE Xplore. Restrictions apply.
1424403677/06/$20.00 ©2006 IEEE 33<br />
38<br />
ICME 2006<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.
34<br />
39<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
{ } <br />
<br />
{ } <br />
<br />
<br />
<br />
=<br />
<br />
<br />
<br />
<br />
=<br />
<br />
<br />
<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.
= <br />
<br />
− <br />
= K<br />
<br />
<br />
<br />
≤<br />
− <br />
<br />
> σ <br />
+ <br />
= λ ⋅ <br />
⋅−<br />
ρ<br />
<br />
< λ < ρ = <br />
<br />
<br />
<br />
<br />
<br />
+ <br />
= <br />
<br />
+ α ⋅ β <br />
<br />
⋅<br />
− <br />
<br />
<br />
<br />
α = α <br />
⋅ −<br />
<br />
<br />
≤ α ≤ α<br />
α = <br />
β <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
⎛ ( ) ⎞<br />
∑ ∑⎜<br />
<br />
− <br />
β <br />
= <br />
<br />
− <br />
=<br />
⎟<br />
⎜<br />
− <br />
= <br />
= ⎟<br />
<br />
<br />
⎝ σ ⎠<br />
<br />
=<br />
<br />
− <br />
<br />
<br />
β <br />
<br />
β <br />
<br />
<br />
<br />
<br />
<br />
α <br />
<br />
<br />
<br />
<br />
<br />
<br />
− − ≤ <br />
<br />
<br />
<br />
<br />
<br />
<br />
+ <br />
= <br />
<br />
+ α ⋅ β <br />
<br />
⋅<br />
− <br />
<br />
<br />
<br />
35<br />
40<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
= <br />
<br />
≤ ≤ <br />
<br />
<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.
= × <br />
× <br />
× <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
−<br />
= [ <br />
<br />
]<br />
<br />
<br />
<br />
= <br />
K<br />
K<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
( −) <br />
<br />
<br />
36<br />
41<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:24 from IEEE Xplore. Restrictions apply.
DISCRETE POLYNOMIAL TRANSFORM FOR DIGITAL IMAGE WATERMARKING<br />
APPLICATION<br />
Lam Le, Sridhar Krishnan, and Behnaz Ghoraani<br />
Dept. of Elec. & Comp. Eng., <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />
E-mail: (lle)(krishnan)@ee.ryerson.ca and bghoraan@rnet.ryerson.ca<br />
ABSTRACT<br />
In this study, we propose a new way to detect the image watermark<br />
messages modulated as linear chirp signals. The<br />
spread spectrum image watermarking algorithm embeds linear<br />
chirps as watermark messages. The phase of the chirp<br />
represents watermark message such that each phase corresponds<br />
to a different message. We extract the watermark message<br />
using a phase detection algorithm based on Discrete<br />
Polynomial Phase Transform (DPT). The DPT models the signal<br />
as polynomial and uses ambiguity function to estimate the<br />
signal parameters. The proposed method not only detects the<br />
presence of watermark, but also extracts the embeded watermark<br />
bits and ensures the message is received correctly. The<br />
robustness of the proposed detection scheme has been evaluated<br />
using checkmark benchmark attacks, and we found a<br />
guaranteed maximum bit error rate of 15%, which watermark<br />
message is correctly detected using DPT.<br />
Keywords: Image Watermarking, Spread Spectrum, Data Hiding,<br />
Discrete Phase Polynomial Transform, Hough-Radon Transform,<br />
Chirp Modulation<br />
1. INTRODUCTION<br />
Chirp signals are present ubiquitously in many areas of science<br />
and engineering, so the Discrete Polynomial Phase Transform<br />
(DPT) [1][2] has been extensively studied in recent years<br />
to estimate the phase parameters of the chirp signals. One of<br />
the recent applications of chirp signals is in data watermarking.<br />
The huge success of the Internet allows for the transmission,<br />
wide distribution, and access of electronic data in an<br />
effortless manner. Watermarking is one of the possible solutions<br />
to the problem that content providers are faced with<br />
the challenge of how to protect their electronic data. Thereby,<br />
multimedia data creators and distributors are able to prove<br />
ownership of intellectual property rights without forbidding<br />
other individuals to copy the multimedia content itself. In<br />
this study, we propose a chirp-based detection method to detect<br />
watermark messages in an image watermarking scheme<br />
[3][4] which embeds linear chirps as imperceptible and statistically<br />
undetectable watermark messages. Different chirp<br />
rates, i.e., phases, represent watermark messages such that<br />
each phase corresponds to a different message. The narrowband<br />
watermark messages are spread with a watermark key<br />
(PN sequence) across a wider range of frequencies before embedding.<br />
The resulting wideband noise is added to the perceptually<br />
significant regions of the original image. We use<br />
the block-based discrete cosine transform (DCT) scheme for<br />
inserting the watermark. As a result of image manipulations<br />
some message bits are extracted by the detector may be in<br />
error potentially resulting in the detection of the wrong watermark<br />
message. The proposed watermarking detection algorithm<br />
detects the presence of watermark, and extracts the<br />
embeded watermark message bits even at presence of bit error<br />
in the received watermark message. Our motivation to<br />
use DPT technique as watermark detector is to achieve high<br />
estimation accuracy with less computational complexity.<br />
2. DISCRETE POLYNOMIAL PHASE TRANSFORM<br />
(DPT)<br />
DPT is a parametric signal analysis approach for estimating<br />
the phase parameters of polynomial phase signals. The phase<br />
of many man-made signals such as those used in radar, sonar<br />
and communications can be modeled as a polynomial. The<br />
discrete version of a polynomial phase signal can be expressed<br />
as:<br />
<br />
M<br />
x(n) =b0exp j am(n∆) m<br />
<br />
(1)<br />
m=0<br />
where M is the polynomial order (M =2for chirp signal),<br />
0 ≤ n ≤ N − 1, N is the signal length and ∆ is the sampling<br />
interval. The principle of DPT is as follow. When DPT is<br />
applied to a mono-component signal with polynomial phase<br />
of order M, it produces a spectral line [1] . The position of<br />
this spectral line at frequency ω0 provides an estimate of the<br />
coefficient âM .AfterâM is estimated, the order of the polynomial<br />
is reduced from M to M −1 by multiplying the signal<br />
with exp −jâM (n∆) M . This reduction of order is called<br />
phase unwrapping. The next coefficient âM−1 is estimated<br />
the same way by taking DPT of the polynomial phase signal<br />
of order M − 1 above. The procedure is repeated until all the<br />
coefficients of the polynomial phase are estimated. DPT order<br />
1424403677/06/$20.00 ©2006 IEEE 1569<br />
42<br />
ICME 2006<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.
M of a continuous phase signal x(n) is defined as the Fourier<br />
transform of the higher order DPM [x(n),τ] operator:<br />
DPTM [x(n),ω,τ] ≡F{DPM [x(n),τ]}<br />
=<br />
N−1 <br />
DPM [x(n),τ] exp −jωn∆ ,<br />
(M−1) τ<br />
where τ is a positive number and:<br />
(2)<br />
DP1 [x(n),τ]:=x(n) (3)<br />
DP2 [x(n),τ]:=x(n)x ∗ (n − τ). (4)<br />
DPM [x(n),τ]:=DP2 [DPM−1 [x(n),τ] ,τ] (5)<br />
The coefficients aM (a1 and a2) are estimated by applying the<br />
following formula [1]:<br />
âM =<br />
where<br />
and<br />
1<br />
M!(τM ∆) M−1 argmaxω {|DPTM [x(n),ω,τ]|} ,<br />
(6)<br />
â0 = phase<br />
DPT1 [x(n),ω,τ]=F{x(n)} , (7)<br />
DPT2 [x(n),ω,τ]=F{x(n)x ∗ (n − τ)} , (8)<br />
N−1<br />
<br />
n=0<br />
ˆb0 = 1<br />
N−1<br />
N<br />
n=0<br />
<br />
x(n)exp −j<br />
<br />
<br />
x(n)exp −j<br />
M<br />
am(n∆) m<br />
<br />
m=1<br />
M<br />
am(n∆) m<br />
<br />
m=1<br />
(9)<br />
(10)<br />
The estimated coefficients are used to synthesize the polynomial<br />
phase signal:<br />
ˆx(n) = ˆ <br />
M<br />
b0exp j âm(n∆) m<br />
<br />
(11)<br />
m=0<br />
3. CHIRP-BASED WATERMARKING<br />
The watermarking method used in this study is a novel watermarking<br />
method using a linear chirp based technique applied<br />
on image and audio signal[3][4]. The chirp signal x(t) (or<br />
m) is quantized and having value -1 and 1 as in m q . m q is<br />
then embedded into the multimedia files. The quantization<br />
process introduces harmonics in the time-frequency representation,<br />
but the slope of the quantized chirp is the same as that<br />
of the chirp signal x(t). The detail of the embedding and extracting<br />
of watermark is followed.<br />
1570<br />
43<br />
3.1. Watermark embedding<br />
Each bit m q<br />
k of mq is spread with a cyclic shifted version pk<br />
of a binary PN sequence called watermark key. The results<br />
are summed together and generate the wide band noise vector<br />
w:<br />
w =<br />
N −1<br />
k=0<br />
m q<br />
k pk, (12)<br />
where N is the number of watermark message bits in m q .<br />
Thewidebandnoisew is then carefully shaped and added to<br />
the audio or DCT block of the image so that it will cause imperceptible<br />
change in signal quality. In the audio watermarking<br />
application, to make the watermark message imperceptible,<br />
the amplitude level of the wideband noise w is scaled<br />
down to be about 0.3 of the dynamic range of the signal. In<br />
the image watermarking application, the length of w to be embedded<br />
depends on the perceptual entropy of the image. To<br />
embed the watermark into the image, the model based on the<br />
just noticeable difference (JND) paradigm was utilized. The<br />
JND model based on DCT was used to find the perceptual entropy<br />
of the image and to determine the perceptually significant<br />
regions to embed the watermark. In this method, the image<br />
is decomposed into 8×8 blocks. Taking the DCT on the<br />
block b results in the matrix Xu,v,b of the DCT coefficients.<br />
The watermark encoder for the DCT scheme is described as<br />
X ∗ u,v,b =<br />
Xu,v,b + t C u,v,b wu,v,b, if Xu,v,b >t C u,v,b ;<br />
Xu,v,b, otherwise<br />
(13)<br />
where Xu,v,b refers to the DCT coefficients, X∗ u,v,b refers to<br />
the watermarked DCT coefficients, wu,v,b is obtained from<br />
the wideband noise vector w, and the threshold tC u,v,b is the<br />
computed JND determined for various viewing conditions such<br />
as minimum viewing distance, luminance sensitivity and contrast<br />
masking. Fig.1 shows the block diagram of the described<br />
watermark embeding scheme.<br />
Original<br />
Image<br />
x i,j<br />
PN Sequence<br />
Linear Chirp<br />
Message<br />
Block based<br />
DCT<br />
p<br />
m q<br />
X u,v<br />
Calculate<br />
JNDs<br />
Circular<br />
Shifter<br />
p k<br />
Modulator<br />
w<br />
Watermark<br />
Insertion<br />
J u,v<br />
Fig. 1. Watermark embedding scheme.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.<br />
X* u,v<br />
Watermarked<br />
Image
3.2. Watermark detection<br />
Fig.2 shows the block diagram of the described watermark<br />
decoding scheme. The detection scheme for the DCT based<br />
watermarking can be expressed as<br />
ˆwu,v,b = Xu,v,b − ˆ X∗ u,v,b<br />
tC u,v,b<br />
<br />
ˆwu,v,b, if Xu,v,b >t<br />
ˆw =<br />
C u,v,b ;<br />
0 otherwise<br />
(14)<br />
(15)<br />
where ˆ X∗ u,v,b are the coefficients of the received watermarked<br />
image, and ˆw is the received wideband noise vector. Due to<br />
intentional and non-intentional attacks such as lossy compression,<br />
shifting, down-sampling the received chirp message ˆm q<br />
will be different from the original message mq by a bit error<br />
rate BER. We use the watermark key, pk to despread ˆw,<br />
and integrate the resulting sequence to generate a test statistic<br />
〈 ˆw, pk〉. The sign of the expected value of the statistic depends<br />
only on the embedded watermark bit m q<br />
k . Hence the<br />
watermark bits can be estimated using the decision rule:<br />
ˆm q<br />
k =<br />
<br />
+1, if 〈 ˆw, pk〉 > 0;<br />
(16)<br />
−1, if 〈 ˆw, pk〉 < 0.<br />
We repeat the bit estimation process until we have an estimate<br />
Fig. 2. The proposed watermark detection scheme.<br />
of all the transmitted bits. Though it is possible to form an<br />
estimate of the chirp sequence from the received bits, we improve<br />
the robustness of the detection algorithm by detecting<br />
the chirp using Discrete Polynomial Phase Transform (DPT)<br />
a phase detection algorithm.<br />
3.3. DPT-based watermark estimation method<br />
The embeded watermarks in this algorithm are linear chirps,<br />
and the received watermark can be represented as<br />
x(n) =exp(a1(n∆)j + a2(n∆) 2 j) (17)<br />
Since DPT method is able to estimate the polynomial coefficients<br />
of chirp signals with a very short computation time,<br />
1571<br />
44<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
−1<br />
0 20 40 60 80 100 120 140 160<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
−0.2<br />
−0.4<br />
−0.6<br />
−0.8<br />
(a)<br />
−1<br />
0 20 40 60 80 100 120 140 160<br />
(b)<br />
Fig. 3. Original and Estimated Watermark at (a)BER of<br />
13.6% and correlation coefficient of 0.9891 (b)BER of 19.3%<br />
and correlation coefficient of 0.9516<br />
we apply DPT to estimate a1 and a2 coefficients. Fig.3 shows<br />
the original and estimated watermark messages at bit error<br />
rates of 13.6% and 19.3%; the correlation coefficients of the<br />
original and estimated watermarks are 0.9891 and 0.9516 respectively.<br />
Our computer simulation shows that the required<br />
calculation time is 630 times faster than the other similar chirp<br />
watermark detection scheme[3][4][5].<br />
4. RESULTS AND DISCUSSION<br />
We evaluated the proposed scheme using 10 different images<br />
of size 512×512. The sampling frequency fsb of the watermarks<br />
equals to 1 kHz. Hence the initial and final frequencies,<br />
f0b and f1b of the linear chirps representing all watermark<br />
messages are constrained to [0-500] Hz. We embed these<br />
chirps into the images for a chip length of 10000 samples,<br />
In our tests, we used a single watermark sequence having 182<br />
message bits. To measure the robustness of the watermarking<br />
algorithm, we performed the attacks specified in the checkmark<br />
benchmark attacks[6]. Table 1 shows the watermark<br />
detection results for ten watermarked images after performing<br />
the attacks specified in the checkmark benchmark attacks.<br />
The numbers in the brackets under category ‘Attack’ represent<br />
the number of attacks in that particular class of attacks.<br />
The ‘Detection Average’ represents the percentage of attacks<br />
for which the watermark is detected under each class of attacks.<br />
We make decesion on the correct detection of watermark<br />
based on the correlation between the estimated chirp<br />
and the embedded watermark. Experimentally, the threshold<br />
for correlation coefficient is set to 0.9. The maximun BER<br />
for MAP attack with 100% detection rate is 15%, and for the<br />
case of JPEG attack, in which the maximum BER is 19.9%,<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.
DPT was able to detect 100% of watermark messages. Table<br />
2 shows DPT-based technique performance for one of the images<br />
under the specified attacks. The results demonstrate the<br />
fact that the proposed scheme based on DPT-based technique<br />
promises to estimate the watermark messages up to a BER of<br />
15%; also in many cases it shows the ability of watermark detection<br />
up to a BER of 19%.<br />
Attacks<br />
Detection Average(100%)<br />
DPT HRT<br />
Remodulation(4) 65 57.5<br />
Copy(1) 100 97<br />
MAP(6) 100 100<br />
Wavelet(10) 84 84<br />
JPEG(12) 100 97.5<br />
ML(7) 57 56<br />
Filtering(3) 100 100<br />
Resampling(2) 85 90<br />
Colour Reduce(2) 35 45<br />
Table 1. Watermark detection results of 10 images for checkmark<br />
benchmark attacks.<br />
Attacks<br />
image1<br />
BER(%) DPT<br />
dpr(3) 2.84 0.9988<br />
dpr(5) 11.93 0.9954<br />
dprcorr(3) 6.25 0.9871<br />
dprcorr(5) 14.77 0.9916<br />
medfilt(2) 1.7 0.9950<br />
medfilt(3) 3.4 0.9884<br />
medfilt(4) 23.3 0.1355<br />
trimmedmean(3) 13.63 0.9891<br />
trimmedmean(5) 31.82 0.1274<br />
midpoint(3) 3.98 0.9965<br />
midpoint(5) 23.4 -0.0008<br />
dither 6.25 0.9936<br />
thresh 19.31 0.9516<br />
Table 2. Bit error rates and the correlation coefficients of the<br />
proposed method respectively.<br />
The performance of the algorithm is compared with another<br />
similar approach, Hough-Radon Transform (HRT)[3] [4][5].<br />
Table 1 compares the detection result for DPT and HRT-based<br />
methods with the same watermarking capacity. The DPTbased<br />
algorithm results in higher or equal detection rate in the<br />
seven types of attacks, and has less computational complexity<br />
than HRT-based method. Typically, running time of DPTbased<br />
is 6000 time less than that of the HRT-based method in<br />
Matlab. The watermarking capacity of DPT-based technique<br />
1572<br />
45<br />
depends on the values of coefficients a1 and a2. As expected,<br />
using high resolution of coefficients a1 and a2 will increase<br />
the capacity of the watermarking. However, this would also<br />
reduces the robustness of the method. Compared to the previous<br />
method based on HRT, the proposed method has high<br />
capacity of 182×182; it is also more robust than the HRTbased<br />
method as indicated in Table 1.<br />
5. CONCLUSION<br />
In this paper, we proposed a watermark detection method applied<br />
in an image watermarking algorithm that embeds linear<br />
chirp as watermark messages. The watermark message<br />
is added to the perceptually significant regions of the image<br />
to ensure robustness of the watermark to common image<br />
processing attacks. A phase detection algorithm based on<br />
DPT detects the phase of the watermark message. The proposed<br />
technique has the ability to detect the chirp message<br />
embedded in signals and subjected to different BERs due to<br />
attacks on the image watermark and provides a fast deduction<br />
with high accuracy. Our studies confirm the robustness of the<br />
algorithm to checkmark benckmark attacks.<br />
6. REFERENCES<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:25 from IEEE Xplore. Restrictions apply.<br />
[1] S. Peleg, B. Friedlander, “The discrete polynomialphase<br />
transform,” IEEE Transactions on <strong>Signal</strong><br />
Processing, Vol. 43 Issue 8 , pp. 1901–1914. 1995.<br />
[2] S. Peleg, and B. Friedlander, “Multicomponent signal<br />
analysis using the polynomial-phase transform,” IEEE<br />
Transactions on Aerospace and Electronic Systems,<br />
Vol. 32 Issue 1 , pp. 378–387. 1996.<br />
[3] S. Erkucuk,S. Krishnan and M. Zeytinoglu, “Robust<br />
Audio Watermarking Using a Chirp Based Technique,”<br />
IEEE Intl. Conf. on Multimedia and Expo, vol. 2, pp.<br />
513–616, 2002.<br />
[4] A. Ramalingam and S. Krishnan, “A Novel Robust Image<br />
Watermarking Using a Chirp Based Technique,”<br />
IEEE Canadian Conf. on Electrical and Computer Engineering,<br />
vol. 4, pp. 1889–1892, 2004.<br />
[5] R.M. Rangayyan and S. Krishnan, “Feature identification<br />
in the time-frequency plane by using the Hough-<br />
Radon transform,” Trans. Pattern Recognition, vol. 34,<br />
pp. 1147–1158, 2001.<br />
[6] S. Pereira, S. Voloshynovskiy, M. Madueno S.<br />
Marchand-Maillet, and T. Pun, ‘Second Generation<br />
Benchmarking and Application Oriented Evaluation’,<br />
Information Hiding Workshop III, Pittsburgh, PA, USA,<br />
April 2001.
IMPROVING POSITION ESTIMATES FROM A STATIONARY<br />
GNSS RECEIVER USING WAVELETS AND CLUSTERING<br />
Mohammad Aram, Baice Li, Sridhar Krishnan, Alagan Anpalagan<br />
<strong>Ryerson</strong> <strong>University</strong> Multipath Mitigation (RUMM) Lab<br />
Department ofElectrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, ON, M5B 2K3<br />
maram bli krishnan alagan@ee.ryerson.ca<br />
Bern Grush<br />
Applied Location Corporation, 34 Dodge Rd, Toronto, ON, M1N 2A 7 bgrush@appliedlocation.com<br />
Abstract<br />
Many positioning applications utilize global navigation<br />
satellite systems (GNSS) derived position estimates for<br />
stationary positions. Inexpensive navigation-grade receivers<br />
provide estimates within a few meters in relatively open skies,<br />
while more specialized devices, typically distinguished by<br />
specialized antenna design and additional post processing can<br />
achieve sub-meter accuracy. These latter devices can be two<br />
orders of magnitude more expensive than navigation-grade<br />
receivers but are still subject to measurement error due to<br />
severe multipath in built-up areas.<br />
In our experiments we post-process positions computed by<br />
an inexpensive receiver by applying waveletfilteringfollowing<br />
by clustering and characterization. This produces a reliable<br />
and significant reduction in variance of the estimate, a<br />
normalization of the data scatter-distribution and a<br />
characterization of the estimate that is amenable to a wider<br />
range of statistical comparisons and tests than would be<br />
possible for unfiltered, highly non-Gaussian distributions,<br />
especially as occur in urban canyon circumstances.<br />
Keywords. GNSS; Urban canyon; Multipath mitigation;<br />
Wavelets; k-means; RAIM<br />
1. Introduction<br />
Ongoing developments in GNSS space segment (Galileo and<br />
GPS modernization) are poised to provide significantly more<br />
and better ranging signals for positioning applications. Recent<br />
innovation in high-sensitivity receiver technology (HSGNSS)<br />
enables the acquisition of attenuated and obstructed signals.<br />
These additional signals dramatically lower the probability of a<br />
gap [5,6,7] (loss of lock on enough signals to compute a<br />
position) in challenging signal environments such as in "urban<br />
canyon", heavy foliage, indoors, etc. While inertial navigation<br />
may fill in those gaps in dynamic applications (navigation,<br />
logistics tracking), it cannot help stationary or near-stationary<br />
applications such as survey, E9 11, asset or personnel location,<br />
or metered parking.<br />
HSGNSS signal measurements are biased and especially<br />
noisy due to excessive multipath and low-power signals [2].<br />
Taken together, GPS modernization, Galileo and HSGNSS,<br />
means the potential opportunity of many more applications, but<br />
generally in harsh signal environments. Specific noise sources<br />
are entirely dependent on conditions local to the antenna of the<br />
1-4244-0038-4 2006 7758<br />
IEEE CCECE/CCGEI, Ottawa, May 2006<br />
46<br />
receiver in question and are not addressable by augmentation<br />
such as differential GPS (DGPS) or wide area augmentation<br />
systems (WAAS), or broad-area correction, such as atmospheric,<br />
etc. Even traditional receiver autonomous integrity<br />
monitoring (RAIM) has diminished utility since it was<br />
developed for signal environments with an assumption of zero<br />
or one fault in a field of 5 to 11 signals. We can now project<br />
near-future, integrated GPS/Galileo applications with 4 to 22<br />
signals in harsh environments where many or all signals are<br />
disturbed.<br />
To tackle these harsher signal environments, new antenna<br />
designs [2] and new fault detection and elimination techniques<br />
(FDE) that extend RAIM approaches [6,7,8] are being<br />
developed. Specialized antennas add system costs and the FDE<br />
techniques are computationally complex so that they may be<br />
impractical for larger signal sets.<br />
This paper describes an alternate approach: a process that<br />
includes wavelet filtering, weighted clustering and<br />
characterization of position estimates from a stationary<br />
receiver. This approach results in reduced variance of the<br />
estimate and a normalization of the data-scatter which, in turn,<br />
provides an inexpensive method for applications that require<br />
accuracy of 1-2m for short-dwell readings (under ten minutes)<br />
in many multipath circumstances. As space segment<br />
improvements (Galileo, GPS modernization) and receiver<br />
design improvements (high sensitivity) continue to come onstream,<br />
multipath mitigation such as we propose here tends to<br />
reduce the relative difference in accuracy between open skies<br />
and urban canyon.<br />
The next section of this paper describes our experimental<br />
methods, including data collection and processing algorithms.<br />
The third section describes and demonstrates our results for<br />
each of wavelet filtering, widowing and clustering.<br />
2.1. Data Collection<br />
2. Experimental Methods<br />
To support a variety of experiments, we gathered streetlevel,<br />
urban canyon, carrier phase and position data at multiple<br />
locations in downtown Toronto (Canada). Four sites were<br />
selected to represent distinct levels of urban canyon effects<br />
ranging from moderate to extreme multipath interference. At<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.
each location we collected ten 15-minute samples over five<br />
sidereal days, for a total of forty 900-second data sets.<br />
Figure 1 shows the data collection setup we used.<br />
For this particular experiment, we simply used the 3D<br />
position estimates generated by the receiver without<br />
consideration for outlier removal. Fig 2 shows a typical sample<br />
showing high positioning variability. Fig 3 demonstrates that<br />
even the geometric mean of a 15-minute sample can be highly<br />
variable in severe multipath. We wish to mitigate both forms of<br />
variability.<br />
Figure 1: Data collection equipment consisted of: u-blox TM-LP 15<br />
(not HS) evaluation kit with u-center ANTARIS software and a<br />
laptop. An active antenna was mounted on a portable antenna mount<br />
Im above the ground. We did not use an external ground plane.<br />
40<br />
30<br />
20<br />
. 10 w _<br />
10<br />
0)<br />
C~~~~~~~~~~~~~<br />
"-10<br />
0<br />
C<br />
-20<br />
-30<br />
-40<br />
-40 -30 -20 -10 0 1 0<br />
easting deviation (m)<br />
. I<br />
0<br />
-i,= 0 0<br />
) a<br />
*_ E<br />
.-D<br />
tO)t<br />
En<br />
40Ui<br />
U I I<br />
Af<br />
- ALn. .<br />
-40<br />
E asti n g dev i atio n of ge ometric<br />
means of 15-min samples (m)<br />
Figure 3: We sampled the same locations in 15-minute samples over<br />
5 sidereal days. The geometric mean of each of these samples can<br />
drift considerably. At this location, a spread of about 40 meters in<br />
both Easting and Northing is apparent over the 10 samples taken.<br />
2.2. Processing Algorithms<br />
Our process comprises of two fundamental steps: filtering<br />
using wavelet analysis, and an inverse-variance weighted<br />
estimate of the mean position using either a moving window or<br />
a k-means algorithm to cluster the data.<br />
0 moving window Gaussified<br />
from Wavelet variance weighting data scatter<br />
from filtring or with<br />
receiver K-means lower<br />
variance weighting variance<br />
Since multipath error is a time varying process, wavelet<br />
analysis can be used effectively to mitigate its effects. We<br />
tested various wavelets including Daubechies, Coiflets,<br />
Symlets, Morlet, and Meyer. Although the results from<br />
Symlets and Daubechies were very similar, the analysis was<br />
carried out using the 'Daubechies order 7 (db7)' filter and<br />
wavelet coefficients were modified based on thresholding [4].<br />
Outlier removal was applied to the wavelet output by<br />
fi: excluding all points exceeding 3G from the mean of the filtered<br />
data (where a is the standard deviation).<br />
Rather than simply computing the geometric mean of the<br />
wavelet filtered data as a new position estimate, a subsequent,<br />
independent process was applied to the position data output<br />
from the wavelet filter. Noting that the variance of positioning<br />
data, especially in urban canyon, is non-stationary (varies with<br />
time), we reasoned that weighting each datum inversely with<br />
20 30 40 its local variance would tend to suppress the contribution to the<br />
mean estimated position from high-velocity data segments.<br />
Figure 2: 15 minutes of data collected with the ecluipment<br />
in Fig 1. Such data segments<br />
This is typical of about half of our street-level read<br />
Toronto.<br />
can be caused by a satellite rising or<br />
lings in downtown falling at the horizon or changing from line-of sight to non-line<br />
of sight multipath (or vice-versa) and other biasing effects.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.<br />
759<br />
47<br />
0<br />
4<br />
40
_ . r W-. n r<br />
We tried two ways of estimating local variance: temporal<br />
and spatial. Temporal variance weighting is easily achieved by<br />
computing the variance of short temporal data segments<br />
(windows) and then by inversely weighting the local means of<br />
those temporal windows relative to their local variance.<br />
Spatial variance weighting can be achieved via spatial data<br />
clustering. Over a 15 minute sample in a harsh signal<br />
environment one can observe the spatial non-stationarity of the<br />
position estimate as two or more clusters of points in the<br />
scatter (fig 4). If we use a statistical clustering algorithm, such<br />
as k-means, we would tend to group spatially similar estimates<br />
regardless of whether they are temporally adjacent. The mean<br />
of each such cluster can then be weighted by the inverse of its<br />
variance. k-means is more computationally intensive than a<br />
moving temporal window, but it can be expected to perform<br />
somewhat better. This is because a cluster is unlikely to span a<br />
positioning discontinuity, while a temporal window is more<br />
likely to do so.<br />
3. Experimental Results<br />
3.1. Wavelet filtering<br />
The effect of our wavelet filtering was to always reduce<br />
variance (fig 4) and to often Gaussify a sample (normalize its<br />
data scatter) by reducing both skew and kurtosis (table 1).<br />
+ o<br />
+<br />
,-0<br />
f<br />
.s 10<br />
10<br />
.<br />
V<br />
0
The output of this wavelet filtering is consistent: lower<br />
variance (fig 5) and Gaussifed data (fig 6), centered very<br />
nearly at the same geometric mean. However we know that the<br />
mean itself "wanders" over time (fig 3), so that the first<br />
moment still retains a bias effect that we now wish to reduce.<br />
3.2. Windowing<br />
Recognizing that the variance process for these data sets is<br />
non-stationary, we wish to weigh more heavily data segments<br />
that are momentarily stationary (low instantaneous velocity)<br />
and weigh less heavily data segments that exhibit high<br />
instantaneous velocity. While such a process can not<br />
necessarily discriminate between multipath and non-multipath<br />
contaminated data, it does take advantage of the fact that<br />
ranging signals undergoing rapid change in multipath<br />
circumstances exhibit more high-velocity bursts, hence we can<br />
reduce the impact of these momentary data subsets for a<br />
stationary receiver.<br />
For our temporal windowing process, we experimented with<br />
several window lengths and window overlaps. We report here<br />
using windows of 20 seconds that overlap by 10 seconds. We<br />
then inversely weighted each local mean by the local variance<br />
and computed a new weighted mean for the full sample as:<br />
N X<br />
2<br />
i=l 07<br />
This process has the effect of causing a set of means from a<br />
single location to show reduced scatter. In other words, this<br />
process tends to remove noise from the mean of a 15-minute<br />
data set collected in urban canyon (fig 7).<br />
40 1<br />
10<br />
-1l<br />
0<br />
-40 L<br />
-40 -20 0<br />
Easting (m)<br />
20 40<br />
Figure 7: Shows the relative shift in final position estimates for all 40<br />
15-minute datasets in our experiment. There is one black and one red<br />
ellipse representing the 3cy bounds for each of four locations and 10<br />
means each calculated from 900 per-second samples for each<br />
location. The black points and ellipses are for the wavelet output and<br />
the red are the same for the output after the windowing process.<br />
761<br />
49<br />
3.3. k-means clustering<br />
When examining raw GPS data plots, especially the noisier<br />
ones, one often sees areas of two or more clusters of data<br />
connected by sparse, high-velocity segments. We reasoned that<br />
if we could isolate those clusters, calculate local means and<br />
once again weight them by their inverse variance we would see<br />
an even greater improvement in the ability to reduce the spread<br />
in the geometric means, sample-over-sample.<br />
To do this, we applied a k-means clustering algorithm<br />
(k=15), calculated the mean and variance for each of these 15<br />
clusters and computed a weighted mean for the entire dataset,<br />
as before.<br />
The overall result of this latter approach (fig 8) provides a<br />
further improvement over the windowing approach (fig 7)<br />
reducing the variance in the "wandering means" (fig 3). By<br />
examining the concentric black-red 3G ellipses one can see a<br />
reduction of 20-35%.<br />
4. Conclusions<br />
Wavelet filtering can be used to reduce variance, skew and<br />
kurtosis in GPS position data collected by a stationary receiver.<br />
Temporal windowing and spatial clustering of that output<br />
can be used to further reduce data biases in urban canyon that<br />
tend to make even aggregated mean estimates "wander" about<br />
their true position.<br />
These experiments, while successful, leave considerable<br />
room for refinement. Future work includes: setting dynamic<br />
thresholds for wavelet filtering, non-linear treatment of the<br />
inverse-weighting for the moving windows, determining k<br />
dynamically for the k-means algorithm, or using fuzzy cmeans.<br />
Indeed, the fixed, 15-minute sampling period of this<br />
experiment can also be dynamic allowing greater accuracy<br />
when time/cost permits and more rapid results in locations of<br />
modest multipath.<br />
-40 -20 0 20 40<br />
Easting (m)<br />
Figure 8: Shows the same information as in fig 7, except that the red<br />
data and ellipses represent the output after the k-means process. It is<br />
evident in individual results as it is in these summary plots that kmeans<br />
out-performs the moving window process in our experiment.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.
Acknowledgements<br />
This work of <strong>Ryerson</strong> <strong>University</strong> Multipath Mitigation<br />
(RUMM) labs was supported by <strong>Ryerson</strong> <strong>University</strong> (Toronto,<br />
Canada) and a grant from the Natural Sciences and<br />
Engineering <strong>Research</strong> Council of Canada (NSERC).<br />
References<br />
[1] M. Aram, S. Krishnan, A. Anpalagan, and B. Grush, "Wavelet<br />
<strong>Analysis</strong> and Data Processing of GPS <strong>Signal</strong>s for High Precision<br />
Position Computation", unpublished.<br />
[2] T. H. D. Dao. H. Kuusniemi, and G. Lachapelle, "HSGPS<br />
Reliability Enhancements Using a Twin-Antenna System",<br />
Proceedings of The European Navigation Conference GNSS<br />
2004, Rotterdam, 17-19 May 2004.<br />
762<br />
50<br />
[3] I. Daubechies, "Ten Lectures on Wavelets", CBMSNSF Regional<br />
Conference Series in Applied Mathematics, vol. 91, SIAM,<br />
Philadelphia, 1992.<br />
[4] D. L. Donoho, "De-noising by Soft-Thresholding" IEEE Trans.<br />
on Information Theory, Volume 41, Issue 3, May 1995 p.613.<br />
[5] Y. Feng, "Predictions Using GPS with a Virtual Galileo Constellation<br />
- Future GNSS Performance", GPS World, March 2005<br />
[6] H. Kuusniemi, "User-Level Reliability and Quality Monitoring<br />
in Personal Satellite Navigation", PhD Thesis, Tampere<br />
<strong>University</strong> of Technology, Finland, 2005.<br />
[7] A. Morrison, S. Krishnan, A. Anpalagan, and B. Grush,<br />
"Receiver Autonomous Mitigation of GPS Non Line-of-Sight<br />
Multipath Errors ", Institute ofNavigation, National Technical<br />
Meeting (ION-NTM) 2006<br />
[8] R. Puri, A. El Kaffas, A. Anpalagan, S. Krishnan, and B. Grush,<br />
"Multipath Mitigation of GNSS Carrier Phase <strong>Signal</strong>s for an On-<br />
Board Unit for Mobility Pricing," CCECE, 2004.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:57 from IEEE Xplore. Restrictions apply.
KEYSTROKE IDENTIFICATION BASED ON GAUSSIAN MIXTURE MODELS<br />
Danoush Hosseinzadeh, Sridhar Krishnan, April Khademi<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON - M5B 2K3 Canada<br />
Email: (danoushh@hotmail.com) (krishnan@ee.ryerson.ca) (akhademi@ieee.org)<br />
ABSTRACT<br />
Many computer systems rely on the username and password<br />
model to authenticate users. This method is widely used, yet it<br />
can be highly insecure if a user’s login information has been<br />
compromised. To increase security, some authors have proposed<br />
keystroke patterns as a biometric tool for user authentication;<br />
they can be used to recognize users based on how<br />
they type. This paper introduces a novel method that applies<br />
GMMs to keystroke identification. The major benefit of this<br />
method is the ability to update the user’s model each time he<br />
or she is authenticated. Therefore, as time goes on, each user<br />
model accurately reflects the changes in that user’s keystroke<br />
pattern. Using this method, a FAR and a FRR rate of approximately<br />
2% was achieved. However, it should be noted<br />
that 50% of the test subjects were the traditional ”two finger”<br />
typists and therefore, this had a disproportionately negative<br />
impact on the results.<br />
1. INTRODUCTION<br />
Undeniably, computers have become an essential part of daily<br />
life for many people around the world. One of the main reasons<br />
for this trend is that computers allow us to access information<br />
from any part of the globe. Additionally, they allow<br />
us to perform many functions that would otherwise require a<br />
physical presence else where, such as banking, shopping and<br />
some personal tasks such as online chatting and so on.<br />
Despite their importance, computer systems are generally<br />
protected with primitive techniques, such as usernames and<br />
passwords. Since passwords can be stolen, accidentally revealed<br />
or even cracked by dictionary programs, there has been<br />
a great number of electronic crimes in recent years. In fact,<br />
reports indicate that in 2002, online retailers lost an estimated<br />
US$1.64 billion dollars in fraudulent sales and an additional<br />
US$1.82 billion in legitimate sales that looked suspicious [1].<br />
To prevent crime and increase security, access should only<br />
be given to the correct users. To achieve this goal, some authors<br />
have suggested the use of keystroke identification as a<br />
method of preventing unauthorized users from accessing a<br />
computer system [2][3][4][5]. Keystroke identification is a<br />
biometric tool based on the principle that every person has<br />
a unique typing pattern, similar to a hand written signature<br />
[2][5]. Particularly, for regularly typed strings, this pattern<br />
can be very consistent and therefore, it can be effective for<br />
user identification. Furthermore, we argue that a person’s<br />
keystroke pattern would be harder to duplicate than a signature<br />
because an intruder does not have an unlimited number of<br />
tries to practice it. In a commercial system, a user who cannot<br />
successfully log in after a predetermined number of attempts,<br />
could be locked out from the system, therefore, limiting the<br />
intruder’s practice time. Studies have also shown that even<br />
among professional typists there is a great deal of variability<br />
in the keystroke patterns [6]. This makes user forgery very<br />
difficult.<br />
By exploiting these keystroke patterns, we can add an additional<br />
layer of security to the username/password model.<br />
Even if authorized persons reveal their passwords, no unauthorized<br />
user can gain access to the system. This idea has<br />
many internet-based applications, especially for online banking,<br />
email and user account protection, just to name a few. In<br />
fact, we can completely change the username/password security<br />
model to a model which only relies on keystroke patterns.<br />
Aside from increased security, this model would benefit users<br />
because they will not have to remember many different usernames/passwords<br />
pairs for different accounts. Also, the possibility<br />
of a user forgetting their password or a user having a<br />
password that is easy to decipher would be reduced.<br />
In this paper, a brief review will be presented on what features<br />
could be extracted from keystroke patterns and under<br />
what conditions good features can be acquired. Also, a new<br />
method for modeling these features based on Gaussian Mixture<br />
Models (GMMs) is proposed. For completeness, a brief<br />
review of GMMs is presented before describing the novel<br />
algorithm used. Lastly, the results and conclusions are presented.<br />
2. KEYSTROKE FEATURES<br />
2.1. Features From Keystrokes<br />
It has been shown that for a given user at least two unique features<br />
can be extracted from keystroke patterns [6]. Keystroke<br />
patterns, which are produced by the user during typing, exhibit<br />
unique timing characteristics. One such characteristic is<br />
the keystroke latencies (KL), which is the time between strik-<br />
142440469X/06/$20.00 ©2006 IEEE III 1144<br />
51<br />
ICASSP 2006<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.
ing two consecutive keys. Another characteristic (feature) is<br />
the key down time (KD), which is the time a particular key<br />
is held down. These features have been used in previous research<br />
to produce good results in user identification.<br />
For a string of length N, thereareN − 1 KL data points<br />
and N KD data points. These data points can be used to create<br />
two feature vectors. Fig. 1 shows the KL and KD plot<br />
for a particular user (one of the authors) that has typed his<br />
name repeatedly. Fig. 1 is included to illustrate the stability<br />
and strong correlation that exists between each of the feature<br />
vectors, KD and KL.<br />
220<br />
200<br />
180<br />
160<br />
140<br />
120<br />
100<br />
80<br />
200<br />
150<br />
100<br />
KL Features Vector<br />
da an no ou us s_ _h hh ho os ss se ei in nz za ad de eh<br />
KD Features Vector<br />
50<br />
d a n o u s h _ h o s s e i n z a d e h<br />
Fig. 1. Several plots of the keystroke latency (KL) and key<br />
down time (KD) feature vectors for one user. The bold line is<br />
the average of the vectors. The space character is represented<br />
by “ ”.<br />
2.2. Designing Good Features<br />
For keystroke identification, a robust feature pattern is one<br />
that is stable over repeated trials. To produce a stable feature<br />
pattern, the typist should be able to type the given text<br />
without any hesitation. Strings that require the typist to stop<br />
and think about the next letter or cause them to pause and<br />
search for a certain key, will result in an unstable pattern. As<br />
mentioned before, research has shown that the best results are<br />
obtained when users type familiar text such as, their first and<br />
last names. Such features are intuitively easy to type because<br />
they have been used for many years. Therefore, a distinct pattern<br />
can be seen when users type their name.<br />
Another important consideration when selecting appropriate<br />
text, is the number of characters. Shorter text tends to<br />
increase classification error because it can be more easily re-<br />
III 1145<br />
52<br />
produced by others [5]. This is true because fewer number of<br />
characters have a less complex patterns and can be imitated<br />
more easily by imposters. The same problem exists with hand<br />
written signatures, where short and simple signatures are often<br />
easy to copy.<br />
In previous work, it has been suggested that no less than<br />
ten characters should be used for keystroke identification [5].<br />
In this work, the user is required to enter at least ten characters,<br />
which can be easily accomplished with the first and last<br />
name of the individual. At the same time it will be said that no<br />
additional effort was made to increase the minimum character<br />
length, because it might be difficult or annoying for some<br />
users to meet the requirement. This would also pose a strict<br />
requirement if the user’s full name does not meet the minimum<br />
character requirement, or if the user chooses a different<br />
string. These factors could have a negative impact on false<br />
acceptance rates (FAR) and false rejection rates (FRR).<br />
2.3. Data Acquisition Model<br />
To collect timing information, a data acquisition application<br />
named ’KbApp’ was designed for the Windows operating system.<br />
With this application, keystroke timing error was minimized<br />
to less than 0.5 milli-seconds, with the option of reducing<br />
it to 100 nano-seconds. However, this error will not have<br />
a significant impact on the results because the average feature<br />
has a time value that is to the order of 100 milli-seconds.<br />
2.4. Review of GMMs<br />
GMMs are a well known method for modeling the probability<br />
distribution of random events. By several weighted L dimensional<br />
Gaussian functions, it is possible to closely approximate<br />
any distribution, provided that enough training data is<br />
available. The complete GMM can be expressed by the mean<br />
vector µi, covariance matrix Σi and mixture weights wi as<br />
given below:<br />
λ = {wi, µi, Σi}, i =1, ...., K (1)<br />
Using the model λ, we can obtain the the likelihood that<br />
x belongs to the model λ by<br />
K<br />
p(x|λ) = wibi(x), (2)<br />
i=1<br />
where bi is given by an L-dimensional Gaussian PDF as shown<br />
below:<br />
1<br />
bi(x) =<br />
(2π) L/2 |Σi| −1/2 <br />
exp − 1<br />
2 (x − µ)t Σ −1<br />
<br />
i (x − µ)<br />
(3)<br />
GMMs can be very effective in modeling the type of distributions<br />
found in keystroke patterns, which are shown in Fig. 1.<br />
To verify the likelihood that a given feature vector x belongs<br />
to the a model λ, the natural logarithm of the associated<br />
probability is used. This value, which we call the Log-<br />
Likelihood (LL) is given below: <br />
K<br />
<br />
LL = log {p(x|λ)} = log wibi(x) (4)<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.<br />
i=1
3. A NOVEL KEYSTROKE IDENTIFICATION<br />
METHOD<br />
The novel method proposed in this paper uses GMMs to model<br />
keystroke timing information and uses the log-likelihood measure<br />
to the authenticate the user based on a threshold.<br />
3.1. GMM Training and Verification<br />
To produce a GMM, the user is first required to enroll into the<br />
system by typing their full name ten times. These ten samples<br />
produce twenty feature vectors; ten KL vectors and ten<br />
KD vectors. From these two sets of ten sample vectors, two<br />
GMMs can be trained, one for the KD feature and one for the<br />
KL feature. The expectation maximization (EM) algorithm<br />
was used to train the GMMs.<br />
Upon verification, the user is required to re-enter their full<br />
name. From this test vector, the KL and KD feature vectors<br />
are extracted and compared with the user’s model, which is<br />
obtained from the enrolment session. Equation 4, is used to<br />
calculate the log-likelihood that the test vector (x) belong the<br />
the given model. This result is then compared with the user’s<br />
threshold before access is granted or denied.<br />
The results show the statistics for the system when access<br />
is based on using the KD feature, the KL feature and a combination<br />
of KL and KD features. In the later case, the test<br />
vector is compared with both user models (KL model and KD<br />
model) before access is granted. Also, each time the user is<br />
authenticated successfully, both GMM models and thresholds<br />
are updated with the new information.<br />
3.2. Calculating Model Thresholds<br />
To obtain the user’s threshold, the Leave-One-Out-Method<br />
(LOOM) is used. The LOOM is as follows: for N feature<br />
vectors, N − 1 vectors are used to train the model and the<br />
last vector is used to test the likelihood that it belongs to that<br />
model, using Equation 4. This test can be performed N times,<br />
where each time a different vector is used to test the model.<br />
The final results of the LOOM produces N likelihood measures<br />
and can be expressed by<br />
LLj = log {p( xj|λ)} , j =1, 2, .., N (5)<br />
where λ is a GMM that has been trained with N − 1 vectors<br />
not including the jth vector and xj is the test vector.<br />
These N log-likelihood results are further processed before<br />
selecting the models threshold. From these likelihood<br />
values, the minimum value that falls within the range of three<br />
standard deviations away from the mean is set as the model<br />
threshold, as given below:<br />
Threshold= min<br />
∀j<br />
LLj | ( LLj − LL ) < 3σ <br />
where LL is the mean and σ is the standard deviation of the<br />
LL values obtained from the leave-one-out-method.<br />
(6)<br />
III 1146<br />
53<br />
The model generation and threshold calculation procedures<br />
are repeated every time the user has been verified so that<br />
the model and threshold are adaptive and can change with the<br />
user over time.<br />
3.3. Authenticating The User<br />
User authentication is the main goal of this work. To achieve<br />
this, the keystroke model should be robust enough to produce<br />
a low false rejection rate (FRR) and a low false acceptance<br />
rate (FAR). FAR is the rate at which intruders can gain access<br />
to a valid user’s account, and FRR is the rate at which valid<br />
user’s are denied access to their own account. Obviously, both<br />
these measures should be as low as possible.<br />
In this approach, authentication is performed in two stages.<br />
In the first stage, if the user is denied access, they are given<br />
a second chance to entre their name. By using this method a<br />
significant improvement was be seen in the FRR and is discussed<br />
in the results section.<br />
4. EXPERIMENTAL RESULTS<br />
Before presenting the results, the reader is reminded that the<br />
number of initial training vectors used to calculate the model<br />
thresholds was ten. Because it is desired to have an accurate<br />
threshold based on the training vectors, the LOOM was used,<br />
as described in Section 3.2. It has been shown that the LOOM<br />
provides the least unbiased estimate for small databases [7].<br />
Therefore, the model thresholds used to authenticate the users<br />
are optimal given the size of the database.<br />
The results for FRRs and FARs for three different cases<br />
are presented in Table 1. It should be noted by the reader<br />
that the algorithm should also function well in terms of FRR<br />
and FAR, over time. The main reason for this behavior is that<br />
the proposed method adaptively selects the threshold that best<br />
suits the individual user, based on the LOOM. Also, the algorithm<br />
has shown that using a two stage verification process<br />
(ie. the user is given a two chances for authentication), decreases<br />
the FRR significantly.<br />
To perform imposter tests, two typists were chosen to observe<br />
and imitate the other users’ typing pattern. The results<br />
indicate an average FAR and FRR of about 2% using both<br />
features. These figures are comparable to other techniques<br />
however, a direct comparison with other methods cannot be<br />
justified because in each experiment a different database has<br />
been used. In our database, four out of the eight typists were<br />
the traditional ”two finger” typers. We believe this led to<br />
poor performance in both the FAR and the FRR because these<br />
types of users do not produce a very stable keystroke pattern<br />
and at the same time can be copied easily. Therefore, because<br />
their finger patterns can be easily seen and imitated by the<br />
imposter users, the FAR results presented here are skewed. In<br />
terms of FRR, these users also do not perform well because<br />
they have a lot of variation in their typing pattern. In fact,<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.
Table 1. Experimental Results for FRR and FAR<br />
KL Feature KD Features KL&Kd Features<br />
User FRR(%) FAR(%) FRR(%) FAR(%) FRR(%) FAR(%)<br />
1 0 0 0 0 0 0<br />
2 0 0 0 0 0 0<br />
3 5.3 14.3 0 14.3 5.3 7.1<br />
4 0 9.5 0 0 0 0<br />
5 0 0 0 0 0 0<br />
6 5.6 0 5.6 0 8.3 0<br />
7 0 50 0 10 0 10<br />
8 0 0 5.9 20 5.9 0<br />
Average(%) 1.4 9.2 1.4 5.5 2.4 2.1<br />
more users should be enrolled before the performance can be<br />
fully evaluated.<br />
This experiment obtained two features from the keystroke<br />
data and performed three similarity tests. The combination<br />
of the KL and KD features should produce a lower FAR and<br />
higher FRR compared to using either of the features individually.<br />
This is due to the fact that a user must correctly produce<br />
both features simultaneously. These trends were observed in<br />
the results, as can be seen from Table 1. A major benefit of<br />
this method over existing techniques is the ability to update<br />
the user model each time as he/she is successfully authenticated.<br />
Therefore, as time goes on, each user’s model accurately<br />
reflects the changes in that person’s keystroke pattern.<br />
5. CONCLUSIONS<br />
A novel method for authenticating computer users based on<br />
keystroke identification was presented. Upon verification, the<br />
keystroke latencies and key hold-down times for the user’s<br />
keyboard inputs were recorded and compared with a predefined<br />
individualistic model. Access was granted if the user’s<br />
input reached a certain threshold. A new method for calculating<br />
model threshold was also introduced using the LOOM<br />
and log-likelihood of the feature vectors.<br />
Ideally the FAR and the FRR should be very small with<br />
more emphasis give to former because a security breach is<br />
more critical than a valid user being forced to re-authenticate.<br />
Based on this logic, the best results were obtained using both<br />
the KL and KD features simultaneously; which produced a<br />
FRR of 2.4% and a FAR of 2.1%.<br />
Despite the fact that these results are based on a small<br />
database, it has been shown by this work that GMMs can be<br />
used effectively to identify users based on their keystroke patterns.<br />
Furthermore, despite the fact that 100% classification<br />
accuracy was not achieved, more users should be enrolled<br />
using this approach before a definitive answer can be given<br />
about the capability of the system. As mentioned earlier, the<br />
results presented are skewed because of the type of users enrolled<br />
(50% of users were two finger typers). This technique<br />
III 1147<br />
54<br />
could be further improved to incorporate the varied nature of<br />
the different typists.<br />
GMMs may be used with other metrics to improve both<br />
the FAR and FRR, or the threshold procedure can be modified<br />
to produce more accurate resutls. In future works, we<br />
intend to investigate these areas with a larger database.<br />
6. REFERENCES<br />
[1] Alen Peacock, Xian Ke, and Matthew Wilkerson, “Typing<br />
patterns: A key to user identification,” IEEE Security<br />
& Privacy Magazie, vol. 2, no. 5, pp. 40–47, Oct. 2004.<br />
[2] Rick Joyce and Gopal Gupta, “Identity authentication<br />
based on keystroke latencies,” Communication of the<br />
ACM, vol. 33, no. 2, pp. 168–176, February 1990.<br />
[3] Oscar Coltell, Jose M. Dabia, and Guillermo Torres,<br />
“Biometric identification system based on keyboard filtering,”<br />
in Proc. IEEE 33rd Int. Carnahan Conf. on<br />
Secutrity Technology, Madrid, Oct. 1999, pp. 203–209.<br />
[4] Saleh Bleha, Charles Slivinsky, and Bassam Hussien,<br />
“Computer-access security systems using keystroke dynamics,”<br />
Pattern <strong>Analysis</strong> and Machine Intelligence,<br />
IEEE Transactions on, vol. 12, no. 12, pp. 1217–1222,<br />
December 1990.<br />
[5] Livia C. F. Araujo, Luiz H. R. Sucupira Jr., Miguel G.<br />
Lizarraga, Lee L. Ling, and Joao B. T. Yabu-Uti, “User<br />
authentication through typing biometrics features,” <strong>Signal</strong><br />
Processing, IEEE Transactions on, vol. 53, no. 2, pp.<br />
851–855, Feb. 2005.<br />
[6] R. Gaines, W. Lisowski, S. Press, and N. Shapiro, “Authentication<br />
by keystroke timing: Some preliminary results,”<br />
Tech. Rep. R-256-NSF, Rand Corporation, Santa<br />
Monica, CA, USA, May 1980.<br />
[7] Keinosuke Fukunaga, Introduction to Statistical Pattern<br />
Recognition (2nd ed.), Academic Press Professional Inc.,<br />
San Diego, CA, USA, 1990.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:14 from IEEE Xplore. Restrictions apply.
SOCCER VIDEO RETRIVAL USING ADAPTIVE TIME-FREQUENCY METHODS<br />
Jonathan Marchal*, Cornel Ioana*, Emanuel Radoi*, André Quinquis*, Sridhar Krishnan**<br />
* : ENSIETA, E3I2 Laboratory, 2 rue François Verny, Brest - FRANCE<br />
E-mails : marchajo@ensieta.fr, ioanaco@ensieta.fr, radoiem@ensieta.fr, quinquis@ensieta.fr<br />
** : Dept. Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto – CANADA<br />
E-mail : krishnan@ee.ryerson.ca<br />
ABSTRACT<br />
The retrieval of soccer highlights is a suitable technique<br />
for video indexing, required by the multimedia database<br />
management or for the development of television on<br />
demand. For these purposes, it should be interesting to have<br />
an automatic annotation of events happened in soccer<br />
games. One solution consists in analyzing the audio<br />
soundtrack associated to the soccer video and to detect the<br />
interesting frames.<br />
In this paper we use the adaptive time-frequency<br />
decomposition of the soundtrack as a feature extraction<br />
procedure. This decomposition is based on the Matching<br />
Pursuit concept and a dictionary composed of Gabor<br />
functions. The parameters provided by these<br />
transformations constitute the input of the classification<br />
stage. The results provided for real soccer video will prove<br />
the efficiency of the adaptive time-frequency representation<br />
as a feature extraction stage.<br />
1. INTRODUCTION<br />
Soccer video highlights retrieval is not only a subject of<br />
research but also a need considering the huge amount of<br />
data that we can find on the internet. Most of the video<br />
archives are not indexed and an automatic parsing approach<br />
is a marketable issue. The development of television on<br />
demand has also this need of video index. Viewers could<br />
have access to the information they need, without having to<br />
watch hours and hours of videos.<br />
Several methods have been proposed, based on the<br />
information provided by video frames, such as camera<br />
motion, court lines detection, motion vectors, location and<br />
movements of the players as in [1], others on audio/video<br />
features extraction [2] or only on audio features, for instance<br />
dominant and excited speech [3].<br />
In this paper, we propose a method based on audio<br />
feature extraction. There are typically a tremendous amount<br />
of crowd activity, which differs depending on the type of<br />
highlight in a game. For instance, when a goal is scored, the<br />
crowd cheers are increasing progressively before it, and<br />
continues for a few seconds after. For a penalty or free-kick<br />
goal, the crowd cheers are sudden, whereas when a goal is<br />
missed, crowd cheers begin and stop soon after. Finally,<br />
during a normal game sequence, crowd activity is usually<br />
not particularly intense. Considering these observations, we<br />
assert that if the human ear is able to distinguish the crowd<br />
reaction, signal processing tools would be able to do so.<br />
The idea behind this work is to use an adaptive time<br />
frequency decomposition (ATFD) [4,5] on the audio<br />
soundtrack of the sequences as a starting point for the<br />
feature extraction and classification.<br />
The paper is organized as follows. In Section 2 a brief<br />
presentation of adaptive time-frequency decomposition<br />
concept is done. The classification of the soccer sequences,<br />
based on the parameters provided by the ATFD, is described<br />
in Section 3. The efficiency of the proposed method is<br />
analyzed trough the results in the Section 4. We conclude<br />
our discussions in Section 5.<br />
2. ADAPTIVE TIME-FREQUENCY TECHNIQUES<br />
Most of the natural signals are non-stationary. Since<br />
their structure is generally complex, some transformations<br />
in a more intuitive representation spaces are usually well<br />
suited. Linear expansions in a single parameter basis,<br />
whether it is a Fourier, wavelet, or any other basis are not<br />
flexible enough. A Fourier basis provides a poor<br />
representation of functions well localized in time, and<br />
wavelet basis are not well adapted to represent functions<br />
whose Fourier transforms have a narrow high frequency<br />
support. Thus, a flexible decomposition technique can be<br />
considered for representing signal components whose<br />
localization in time and frequency vary widely.<br />
Matching pursuit (MP), introduced in [4], is a technique<br />
that decomposes a signal into a linear expansion<br />
of waveforms that belong to a redundant dictionary of<br />
functions. These waveforms, selected in order to best match<br />
the signal structure are selected among a dictionary of timefrequency<br />
atoms. The aim of the algorithm is to obtain a<br />
parsimonious description in order to estimate the original<br />
signal with as fewer coefficients as possible. Generally,<br />
considering a signal x and a dictionary<br />
142440469X/06/$20.00 ©2006 IEEE V 509<br />
55<br />
ICASSP 2006<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.
D<br />
as<br />
{ gγ<br />
γ ∈ Γ,<br />
g = 1}<br />
= γ<br />
, the signal decomposition is expressed<br />
= ∑<br />
γ ∈Γ<br />
x λ g , (1)<br />
where the decomposition coefficients λ are obtained by<br />
γ<br />
the inner product between the signal x and the function gγ. :<br />
λγ = x, g . γ<br />
The MP builds up the signal decomposition one element<br />
at a time, picking up the most energy dominant component<br />
first. The MP begins by projecting the signal x on a function<br />
gγD 0<br />
∈ and computes the residue Rx = x − x, gγ g .<br />
0 γ0<br />
Thus, the Rx is orthogonal to gγ . The MP algorithm<br />
0<br />
chooses gγ∈ D such that xg , is maximum:<br />
γ<br />
0<br />
γ<br />
γ α 0<br />
γ<br />
γ ∈Γ<br />
γ<br />
0<br />
xg , ≥ sup xg , , (2)<br />
where α ∈ ( 0,1]<br />
is an optimal factor. The MP iterates this<br />
procedure by decomposing the residue. If we suppose the<br />
m th order residue R m x has been computed, the next iteration<br />
chooses<br />
g ∈ D such that :<br />
γ<br />
and deduces R m+1 x by :<br />
m<br />
R x g ≥ R x g , (3)<br />
m m<br />
, γ α sup ,<br />
0<br />
γ<br />
γ∈Γ<br />
m 1 m m<br />
R x R x R x, gγ g<br />
m γm<br />
+ = − (4)<br />
Summing for m between 0 and M-1 yields<br />
M−1 M−1<br />
m M M<br />
, γm γm m γm<br />
m= 1 m=<br />
1<br />
∑ ∑ (5)<br />
x = R x g g + R x = a g + R x<br />
where am (m=0,..,M-1) are the decomposition coefficients.<br />
There are two major factors which guarantee the<br />
success of a MP algorithm. The first one is the choice of<br />
stop criteria. Since the MP is an iterative decomposition,<br />
establishing the number of iterations, M, is very important<br />
for the considered application. One of the most used criteria<br />
is the choice of M such that the residual energy is smaller<br />
than a percentage of the signal energy:<br />
2 2 M + 1<br />
2<br />
M ε x − R − minimal (6)<br />
This criterion is not well adapted when a signal to noise<br />
ratio (SNR) is relatively small [5]. In this case, the signal<br />
energy contains the noise contribution and, consequently the<br />
correct setup of M is almost impossible.<br />
However, since in our application the signals of interest<br />
are the soccer soundtracks we can assume that the noise is<br />
relatively small and, more importantly, its level and<br />
properties are almost the same for all signals. In these<br />
V 510<br />
56<br />
conditions, we can use the criterion (6) whose ratio ε is<br />
empirically setup.<br />
The second factor which guaranties the efficiency of the<br />
MP is the choice of the elementary functions g . Intuitively,<br />
γ<br />
the parameters of these functions should ensure a good<br />
matching on the signal’s time-frequency structures. A<br />
common choice is to design a function with as much<br />
degrees of freedom (i.e., control parameters) as possible. On<br />
the other hand, the time-frequency resolution is another<br />
property of interest, especially for feature extraction<br />
applications. According to these requirements, we consider<br />
for our application an elementary function defined as<br />
⎛t−u ⎞<br />
g () t = g⎜ ⎟e<br />
γ m<br />
1 m j( 2π<br />
fmt+ φm)<br />
sm<br />
⎝ sm<br />
⎠<br />
These atoms are issued by dilatation (sm), modulation<br />
(fm) and translation (um) of the Gaussian window<br />
2<br />
1/4 t<br />
g() t 2 e π −<br />
= (8)<br />
The fourth parameter, φm, stands for the initial phase.<br />
According to this definition, inspired from [4], the atoms<br />
(called Gabor functions) are characterized by three<br />
parameters : um, sm, fm, φm. This type of elementary functions<br />
allows to define an adaptive time-frequency tilling, unless<br />
the Gabor or wavelet transform (figure 1).<br />
Fig. 1. T-F tilling : MP versus wavelet decompositions<br />
The choice from an arbitrary time-frequency tilling<br />
constitutes the main property interesting for characterization<br />
purposes. It will be used in the next section for the<br />
separation of soccer events based on the MP analysis of the<br />
corresponding soundtracks.<br />
3. CLASSIFICATION OF SOCCER HIGHLIGHTS<br />
Most of the time, what interests a soccer watcher are the<br />
highlights, such as goals, of course, but also missed goals<br />
and scored free-kicks and penalties are of interest. Hence,<br />
we have chosen sequences of theses three types, adding a<br />
"normal game" class, which is relevant in order to<br />
differentiate an interesting sequence (from the 3 classes<br />
above) from an "uninteresting" one, in terms of highlight<br />
retrieval. This defines 4 classes as illustrated in Fig. 2 :<br />
goals, missed goals, penalties/free-kicks, normal game.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.<br />
(7)
Fig. 2. Video sequences isolated for soccer retrieval experiments<br />
Once the classes’ definition has been done rigorously,<br />
the sequences were grabbed form the Internet and from TV<br />
recordings. We retained duration of 5 seconds to analyse<br />
each highlight video sequence. Indeed, when watching these<br />
sequences, we can note that usually, the crowd cheers are<br />
growing during the 2 seconds before the ball crosses the<br />
goal line, and continues until 3 seconds after, in most cases.<br />
All the audio soundtracks of the video sequences have been<br />
extracted in a mono, 8 bits, 44,1kHz format. It corresponds<br />
to 220500 samples per sequence. The scheme for the feature<br />
extraction and classification of the audio soccer sequences<br />
shown in Fig. 3 and is as follows:<br />
Soccer<br />
sequences<br />
Soundtrack<br />
extraction<br />
MP<br />
decomposition<br />
Dimensionality<br />
Classification<br />
reduction<br />
Fig. 3. Scheme of the soccer event classification<br />
The first step is the extraction of the soccer sequence<br />
soundtrack. The extracted signals are inputs of the MP<br />
decomposition. Knowing that we consider N=220500<br />
samples per sequence we setup the parameters of the<br />
elementary function dictionary as follows :<br />
- the time parameter, un, ranges from 0 to 220499;<br />
- the scale parameter, sn, ranges from 2 to the closest<br />
integer of log2(N) (17 in our case);<br />
- the frequency parameter, fn, ranges from 0 to 22050 Hz<br />
(the half of sampling frequency). The number of frequency<br />
parameters is given by the required spectral resolution. In<br />
our application we consider 8192 values which corresponds<br />
to a spectral resolution 2.65 Hz.<br />
With the parameters ranging the intervals given<br />
previously, the experimental results provided for real data<br />
show that the stop criteria (6) needs less than 2200<br />
iterations. For this reason we will limit the number of the<br />
iterations to this value. The decomposition parameters are<br />
organized in a matrix structure as indicated in (9)<br />
V 511<br />
57<br />
Iteration index<br />
Energy Octave Frequency Time Phase<br />
E1<br />
E2<br />
.<br />
.<br />
.<br />
.<br />
.<br />
s1<br />
s2<br />
.<br />
.<br />
.<br />
.<br />
.<br />
Experimentally, we observed that these parameters have<br />
different discrimination efficiency when applied on audio<br />
signals. For example, the time parameter is difficult to be<br />
used in our case since it is impossible to synchronize the<br />
crowd reaction of all sequences at a fixed sample.<br />
Therefore, only the frequency and scale parameters will be<br />
used for the classification purpose of audio sequences. It<br />
constitutes a first step of data size reduction. Nevertheless,<br />
while we work with 2500 iterations, the number of<br />
classification parameters is about 5000. In order to reduce<br />
the dimensionality of input data, the linear discriminant<br />
analysis (LDA) technique is applied [6]. LDA is a<br />
supervised learning projection that uses information on the<br />
within-class scatter and between-class scatter to construct a<br />
projection matrix in the reduced space. It maximizes the<br />
ratio of between-class variance to the within-class variance<br />
in any particular data set thereby guaranteeing maximal<br />
separability. As it will be shown in the next section the LDA<br />
improves the classification performances compared to the<br />
classification in the original space.<br />
Finally, using the feature vectors provided by LDA we<br />
used the nearest neighbors classifier for the classification<br />
task [6]. This operation will be performed in two phases.<br />
The first one, learning stage, consist in processing a training<br />
set of features with apriori known class. The second step,<br />
testing, is based on the computation of the distance between<br />
a new unknown feature vector and each feature vector from<br />
the training set. The short distance corresponds to the<br />
nearest neighbor. This algorithm will be more<br />
computationally intensive as the size of the training set<br />
grows but the performances will be improved.<br />
The method proposed in this section has been used for<br />
the classification of the soccer sequences for a significant<br />
dataset. The most important results will be presented in the<br />
next section.<br />
4. RESULTS<br />
In this section we present the results obtained for a data<br />
set which consists of 47 sequences, composed of 10 goals<br />
sequences, 9 missed goals, 21 normal game sequences and 7<br />
penalties. The sequences are decomposed applying MP<br />
algorithm, returning parameters as illustrated in the matrix<br />
(9).<br />
The main idea behind the classification process is to<br />
compare the modulation frequencies of the Gabor functions<br />
with comparable scales for each sequence. This principle<br />
has been establishing by comparing the histogram of the<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.<br />
f1<br />
f2<br />
.<br />
.<br />
.<br />
.<br />
.<br />
t1<br />
t2<br />
.<br />
.<br />
.<br />
.<br />
.<br />
φ1<br />
φ2<br />
.<br />
.<br />
.<br />
.<br />
.<br />
(9)
frequency parameters issued from the MP algorithm applied<br />
for data corresponding to each class . This is illustrated in<br />
the Fig. 4 for the scale 13.<br />
Fig. 4. Mean frequency distributions for the 4 classes<br />
As feature parameters we use the vectors of bins centers.<br />
Empirically, we found that the best bins centers vector is of<br />
size 12. Applying the LDA algorithm for the vectors<br />
obtained for our dataset, three non-zero eigenvalues are<br />
obtained. We choose these 3 eigenvectors as a 3D<br />
projection space. The classification is then performed using<br />
the nearest neighbors (NN) method coupled to the<br />
Mahalanobis distance. The LOO (Leave-One-Out) crossvalidation<br />
technique [7] has been used because of the<br />
reduced number of examples in the database. The results are<br />
shown in figure 5...<br />
Fig. 5. Clusters given by classification in the reduced space<br />
Note that the four classes are close but properly<br />
separated in fig. 5.a, whereas the points corresponding to<br />
penalties sequences are spread among the other classes in<br />
fig. 5.b This comes from the fact that all the penalty<br />
sequences chosen in the set are also scored penalties, so<br />
although the crowd cheers are more sudden than for a goal<br />
V 512<br />
58<br />
sequence, they are very similar. Table 1 provides the<br />
classification results with and without dimensionality<br />
reduction (provided by LDA and PCA – principal<br />
component analysis).<br />
Table 1. Classification rates<br />
The classification accuracies obtained clearly shows that<br />
the LDA is well adapted to transform the data provided by<br />
the MP algorithm.<br />
5. CONCLUSION<br />
In this paper, we have proposed a new technique for<br />
soccer events classification based on the Matching Pursuit<br />
algorithm. The dictionary of elementary functions has been<br />
manipulated according to the application in hand.<br />
After MP decomposition the feature parameters have<br />
been projected via a dimensionality reduction stage. The<br />
LDA technique based on the nearest neighbor method yields<br />
better classification performances and improves the<br />
computational efficiency of the classification stage. In the<br />
future works, we intend to use other parameters of the<br />
functions with a larger dataset<br />
ACKNOWLEDGEMENTS<br />
The authors would like to thank Lastwave Software<br />
developers and Karthi Umapathy of <strong>Ryerson</strong> <strong>University</strong> for<br />
providing the Matching Pursuit routines.<br />
6. REFERENCES<br />
[1] Y.Gong, L.T.Sin, C.H.Chuan, H.Zhang, M.Sakauchi,<br />
“Automatic parsing of TV soccer programs”, Proc. ICMCS95,<br />
Washington DC, USA, 1995.<br />
[2] K. Wan, C. Xu, “Efficient multimodal features for automatic<br />
soccer highlight generation”, 17th ICPR04, 2004.<br />
[3] K. Wan, C. Xu, “Robust soccer highlight generation with a<br />
novel dominant speech feature extractor”, IEEE International<br />
Conference on Multimedia Expo ICME, 2004.<br />
[4] S. Mallat and Z. Zhang, “Matching pursuit with time-frequency<br />
dictionaries”, IEEE Trans. <strong>Signal</strong> Processing vol. 41, pp. 3397-<br />
3415, Dec. 1993.<br />
[5] S. Mallat, A Wavelet Tour of signal processing, Academic<br />
Press, 1998.<br />
[6] R.O. Duda, P.E. Hart, D.H. Stork, Pattern Classification (2nd<br />
ed.), Wiley Interscience, 2000.<br />
[7] K. Fukunaga, Introduction to statistical pattern recognition<br />
(2nd ed.), Academic Press Professional, Inc. San Diego, CA, USA,<br />
1990.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:16 from IEEE Xplore. Restrictions apply.
SUPPORT VECTOR MACHINES BASED APPROACH FOR CHEMICAL<br />
PHOSPHORUS REMOVAL PROCESS IN WASTEWATER TREATMENT PLANT<br />
Talieh Seyed Tabatabaei<br />
Department of<br />
Electrical and Computer<br />
Engineering, <strong>Ryerson</strong><br />
<strong>University</strong>, Toronto<br />
tseyedta@ee.ryerson.ca<br />
Abstract<br />
Tahir Farooq<br />
Department of<br />
Electrical and Computer<br />
Engineering, <strong>Ryerson</strong><br />
<strong>University</strong>, Toronto<br />
tfarooq(ee.ryerson.ca<br />
In this research, support vector machine (SVM) is<br />
investigated to model the uncertainty in chemical phosphorus<br />
removal processes in wastewater treatment plants. SVM is a<br />
machine-learning method based on the principle of structural<br />
risk minimization, which performs well when applied to data<br />
outside the training set. The prediction whether or not the<br />
concentration of total phosphorus as P in the effluent will<br />
exceed the maximum allowable limit (1.0 mg1L) for a certain<br />
input is considered a supervised-learning problem. Least<br />
Squares Support vector machines (LS-SVMs) algorithm, which<br />
is a reformulation of standard SVMs, is used to design the<br />
class ifier. Performance of radial basis function (RBF),<br />
polynomial and multi-layer perceptron (MLP) kernels has<br />
been evaluated and a high classification rate of 88.520 was<br />
achieved using radial basisfunction (RBF) kernel.<br />
Keywords: Wastewater, phosphorus removal, SVM<br />
1. Introduction<br />
Nature has an amazing ability to cope with small amounts<br />
of water wastes and pollution, but it would be overwhelmed if<br />
we did not treat the billions of gallons of wastewater and<br />
sewage produced every day before releasing it back to the<br />
environment. Treatment plants reduce pollutants in wastewater<br />
to a level that nature can handle.<br />
Wastewater can be defined as the liquid, or water-carried<br />
wastes removed from residence, institutions, and commercial<br />
and industrial establishments, together with such ground<br />
water, surface water, and storm water [1].<br />
Collecting, treating and reusing the wastewater are<br />
receiving an increasing interest these days. In addition to its<br />
aesthetic and sanitary advantages, we can look at it as a big<br />
financial aid, since we can reuse the treated wastewater in<br />
many applications (i.e. agriculture irrigation, urban irrigation,<br />
industrial reuses, ground water recharge, street cleaning , car<br />
washing, toilet flushing, and many more [2] ).<br />
Wastewater consists of physical, chemical, and biological<br />
components. Some of the contaminants of concern in<br />
1-4244-0038-4 2006<br />
IEEE CCECE/CCGEI, Ottawa, May 2006<br />
Aziz Guergachi<br />
Department of<br />
Information Technology<br />
Management, <strong>Ryerson</strong><br />
<strong>University</strong>, Toronto<br />
a2guerga(ee.ryerson.ca<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />
318<br />
59<br />
Sridhar Krishnan<br />
Department of<br />
Electrical and Computer<br />
Engineering, <strong>Ryerson</strong><br />
<strong>University</strong>, Toronto<br />
krishnan@ee.ryerson.ca<br />
wastewater to be removed are suspended solids, biodegradable<br />
organics, pathogens, nutrients, priority pollutants, refractory<br />
organics, heavy metals, and dissolved inorganics. Nutrients<br />
(i.e. nitrogen and phosphorus) are one of the most important<br />
contaminants of the wastewater. Both nitrogen and<br />
phosphorus are essential nutrients for growth [1, 2]. When<br />
discharged in the aquatic environment, these nutrients can lead<br />
to the growth of undesirable aquatic life. When discharged in<br />
excessive amount on land, they can also lead to the pollution<br />
of groundwater.<br />
Phosphorus is essential to the growth of algae and other<br />
biological organisms. The usual forms of phosphorus found in<br />
aqueous solutions include the orthophosphate, polyphosphate,<br />
and organic phosphate [1, 2]. Due to the presented negative<br />
affects of the phosphorus existed in wastewaters, along with<br />
the stringent discharged limits imposed on wastewater<br />
treatment plants, recently there has been increasing demand to<br />
achieve very low effluent total phosphorus. According to the<br />
phosphorus removal requirements that have been imposed (by<br />
International Joint Commission's Phosphorus Management<br />
Strategies Task Force) in Ontario, the typical effluent<br />
concentration limit should be 1.0 mg/L, based on total<br />
phosphorus [3]. However all provinces site-specific conditions<br />
may dictate more stringent requirements in terms of effluent<br />
total phosphorus limit.<br />
The process of phosphorus removal can be done either<br />
biologically or chemically. The data used in this paper is based<br />
on Ashbridges Bay Treatment Plant in Toronto, which uses<br />
the chemical method. Chemicals that are used in chemical<br />
phosphorus removal process include metal salts and lime. The<br />
most commonly used metal salts are ferric chloride, ferrous<br />
chloride, and aluminum sulfate. In the mentioned treatment<br />
plant ferrous chloride (FeCl2) is being used as the chemical<br />
precipitation for phosphorus removal.<br />
The theory of chemical precipitation reactions is very<br />
complex. There are many uncertainties that underlie all the<br />
chemical reactions. Due to the existence of numerous other<br />
particles other concurrent side reactions may happen in<br />
wastewater as well [1]. All these uncertainties bring about the
necessity of prediction, controlling and therefore some kind of<br />
intelligent system.<br />
In the last few years, numerous studies were carried out<br />
dealing with applications of Artificial Neural Networks and<br />
Fuzzy Neural Networks for modeling biological nutrient<br />
removal systems [18], fuzzy-logic based control strategies for<br />
biological nitrogen removal and dynamic enhanced biological<br />
phosphorus removal [20, 21], fuzzy controller for the level of<br />
biogas in the treated wastewater [19], whereas the amount<br />
work targeting the applications of chemical processes in<br />
wastewater treatment, especially chemical phosphorus<br />
removal, has been insufficient<br />
In this paper a novel approach based on support vector<br />
machines (SVMs) is proposed to control and classify the<br />
quality of the final effluent of wastewater treatment plants<br />
according to the International Joint Commission (IJC)<br />
phosphorus concentration standards.<br />
The paper is organized as follows: in section 2 the theory of<br />
Support Vector Machines in both linear case and non-linear<br />
case is discussed. Section 3 explains the data set preparation<br />
for classification. Section 4 demonstrates the results of the<br />
classifications and the graphs, and section 5 gives the<br />
conclusion.<br />
2. Support Vector Machines<br />
SVM introduced first by Vapnik and co-workers, and it is<br />
such a powerful method that in the few years since its<br />
introduction outperformed most other systems in a wide<br />
variety of applications. SVM has different applications.<br />
However it is mostly used as a binary classifier. Given a classlabeled<br />
training set, which in this work is a set of labeled input<br />
feature vector composed of input and control variables, the<br />
boundary between two classes is learnt using SVM.<br />
2.1. Linear Support Vector Machine<br />
Consider a binary classification problem with xi E Rd as<br />
its feature vector andyi E {-1, +1} the class labels (i.e.<br />
( x1, Yi ) I.. .,(inI Yn ) are the training sets). The hyperplane<br />
which separates the two classes is<br />
Tf(i) =Tw x+b = 0 (1)<br />
The function of SVM is based on choosing the hyperplane<br />
which minimize the margin between two classes (figure (1))<br />
[4, 5, 6]. Thus, the hyperplane ( w, b ) that solves the<br />
optimization problem<br />
subject to<br />
minimize wv, b 1 _1 112<br />
2<br />
yi(< vXi> +b)21 i=1l. In<br />
(2)<br />
319<br />
60<br />
realizes the<br />
margin 1/2 ||<br />
maximal margin hyperplane<br />
V || . The primal Lagrangian is<br />
with geometric<br />
L(wV, b,<br />
1n<br />
V= > ai [yi (< w xi~j> +b) -1I]<br />
2 ~~~~i=l<br />
(3)<br />
where ai > 0 are the Lagrange multipliers.<br />
The corresponding dual is found by differentiating<br />
respect to wv' andb:<br />
with<br />
subject to<br />
n I n<br />
W(a) = Yai -t Yi yj ai aj < xi-xj ><br />
i=l 2i,j=l<br />
n<br />
XYaiy =0 ai20 I= . 1 ....> In<br />
i=t<br />
But in many real-world problems the data is noisy; therefore<br />
there will in general be no linear separation. In this case<br />
instead of hard margin, we use soft margin (the noise tolerant<br />
version), and slack variables denoted by X, can be introduced<br />
to relax the constrains [4, 5, 6].<br />
So our optimization problem would be<br />
subject to<br />
minimize w, b - V112 + C i<br />
2 i=l<br />
yi(< w.xj >+b) >1-.i Ji 2 0, i=I, ..... In<br />
Where C is regularization parameter which is a trade off<br />
between the empirical risk (reflected by the second term in<br />
(5)) and model complexity (reflected by the first term in (5)).<br />
The dual form of this case is the same as (4) except that the<br />
constraint is different:<br />
0+bo<br />
i=l<br />
where N, is the number of support vectors.<br />
This result shows that points that are not support vectors<br />
have no influence on the solution.<br />
2.2. Non-linear Support Vector Machines<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />
In most of the real-world cases the data points are not<br />
linearly separable. In this case we use a non-linear<br />
operator y (.) to map the data to a higher dimensional space F<br />
(feature space), where it can be classified linearly (Figure 2).<br />
(4)<br />
(5)<br />
(6)
Figure 1. A linear SVM classifier. Support vectors are<br />
those elements of the training set which are on the<br />
boundary hyperplanes of two classes.<br />
So the hypothesis in this case would be<br />
T)<br />
f(x~)= (px.) + b (7)<br />
which is linear in terms of the mapped data (y(o)).<br />
Now we can extend all the presented optimization problems<br />
for the linear case, for the transformed data in the feature<br />
space.<br />
If we define the Kernel function as<br />
K(x,y) = < (o). (o) > (8)<br />
where q is a mapping from input space to an (inner product)<br />
feature space F.<br />
Then the corresponding dual form is<br />
subject to<br />
margin<br />
o \\\\ P~Y Support<br />
O O ' /\ Hyperplane<br />
o o \ \\"<br />
o \'<br />
n I n<br />
W(a) =Yai -- Y i yjai ajK(.i.Xi)<br />
i=1 2 i, j=l<br />
n<br />
YXaiyi=0 a.i>0,i= 1 ..... I n<br />
i=<br />
And our final decision rule can be expressed as<br />
f (,bo ) yi ai K(i .x) + bo (10)<br />
i=<br />
whereN and ai denote number of support vectors and the<br />
non-zero Lagrange multipliers corresponding to the support<br />
vectors respectively. Note that we don't have to know the<br />
underlying mapping function, however it is necessary to<br />
define the Kernel function. Among the different kernel<br />
functions the most common kernels are polynomial, Gaussian<br />
radial basis function (RBF) and multi-layer perception (MLP).<br />
In LS-SVMs the inequality constrains in SVM are replaced<br />
with equality constrains. As a result the solution follows from<br />
solving a set of linear equations instead of a quadratic<br />
(9)<br />
0 10++ 00<br />
00<br />
0+<br />
I O0<br />
Figure 2. Mapping data from input space to a higher<br />
dimensional feature space by a non-linear operator (p (.), in<br />
order to classify them by a linear function<br />
programming problem which we have in original SVM<br />
formulation of Vapnik and obviously we can have a faster<br />
algorithm.<br />
The primal problem of the LS-SVMs is defined as<br />
subject to<br />
minwb(e<br />
Jp(w,b,e) = 1/212 + y1/2XYe7 (11)<br />
i=l<br />
-To<br />
= yi[w o(xi)+b] -ei, i = 1 ..... d<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />
320<br />
61<br />
where y is a parameter analogous to SVM's regularization<br />
parameter. The main characteristic of LS-SVMs is the low<br />
computational complexity comparing to SVMs without quality<br />
loss in the solution.<br />
3. Dataset Preparation and Processing<br />
The dataset used in this study was obtained from<br />
Ashbridges Bay Wastewater Treatment Plant, Toronto. This<br />
dataset consists of 123 records. Each record is an observation<br />
of the input, control and output variables. Every record<br />
represents the average values of the variables over a period of<br />
one day. The input and control variables used in this study<br />
were selected after consultation with senior plant<br />
management. Total daily volume treated, peak flow rate,<br />
carbonaceous biochemical oxygen demand (CBOD),<br />
suspended solids (SS) and total phosphorus as P in influent are<br />
used as input variables. Ferrous chloride is used as control<br />
variable and is included in the input feature vector for training<br />
and testing of LS-SVM and LDA classifiers. Concentration of<br />
total phosphorus as P in effluent is used as the output variable.<br />
The dataset was randomly divided into two separate subsets.<br />
One of the subsets having 62 examples was used exclusively<br />
for training algorithms and the other one having 61 examples<br />
was used exclusively for testing. Any example from the<br />
training set was never used during testing phase and vice<br />
versa. A class label yi E {-1,+ 1} was assigned to every output<br />
value based on the threshold value of 1.0 mg/L. If the output<br />
variable exceeds the threshold, {+1 } class label is assigned to<br />
the output value, otherwise {-1 } class label is assigned. Class<br />
label assignment was done for both of the training and testing<br />
datasets before designing the LS-SVM and LDA classification<br />
algorithms.<br />
d
4. Experimental Results<br />
The objective of the LS-SVM and LDA classification<br />
algorithms is to correctly classify, whether or not the<br />
concentration of total phosphorus as P in effluent will exceed<br />
the threshold for a given set of yet-to-be-seen input patterns.<br />
Classification rate was used as a figure of merit. The<br />
classification rate was defined as the total number of correctly<br />
classified examples divided by the total number of examples<br />
classified times one hundred. The results of LS-SVM<br />
classification have been obtained using three different kernel<br />
functions; polynomial kernel, (KQ.~,j) = (9VT~+ I)d' where<br />
d is the degree of polynomial kernel), radial basis function<br />
kernel, (K(.~J) = exp(- IJ -11 2/ 2), where c is the width<br />
of RBF kernel) and multilayer perceptron kernel (MLP),<br />
(K(i~,j) = tanh(kiT'j + 0) ). MLP kernel does not satisfy<br />
Mercer condition for all k and 6.<br />
Fig. 4 shows the estimated classification rate achieved by<br />
LS-SVM classifier using RBF kernel with kernel width c=<br />
0.5, 1 and 2.5. The best classification rate achieved was<br />
88.520o when c= 0.5 and C is between 0.5 and 0.8. A<br />
similar classification rate was achieved when c 1 and C=<br />
0.5. The classification rate dropped to 86.880o when the value<br />
of c was changed to 2.5 and 0. 1.<br />
For polynomial kernel a consistent classification rate of<br />
86.8800 was achieved for a wide range of parameter settings.<br />
Although polynomial kernel did not achieve as good<br />
performance as RBF kernel, its performance was insensitive<br />
for a very wide range of parameter settings.<br />
Fig. 5 represents the classification rate obtained by MLP<br />
kernel with k =0.5, 1 and 2.5. The value of & was kept<br />
constant at 1.<br />
MLP kernel achieved the best classification rate of 86.880o<br />
for all the three values of k at different values of C. However,<br />
the results obtained by MLP kernel were very sensitive to the<br />
U~0<br />
90<br />
85<br />
80<br />
75 101- 1 00 1 01 1 02 0<br />
C<br />
Figure 4. Plot of LS-SVM classification rate versus<br />
regularization parameter C using RBF kernel with<br />
,c<br />
= 0.5,l1and 2.5<br />
321<br />
I0<br />
80<br />
(r)<br />
80<br />
0<br />
20<br />
60-----T -i<br />
40 --<br />
60 --<br />
.Ilu<br />
-101<br />
-<br />
--<br />
80 T I -- r- , rr -- -I-<br />
60<br />
0.k2.5<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />
62<br />
1 nn.<br />
-J<br />
1 00 1 0<br />
1 02<br />
Figure 5. Plot of LS-SVM classification rate versus<br />
regularization parameter C using MLP kernel with<br />
k = 0.5, 1 and 2.5<br />
parameter settings. Hence polynomial kernel could be a better<br />
choice over MLP kernel.<br />
The same training and testing dataset was used to design<br />
and test LDA classifier and the best classification rate<br />
achieved with optimal parameter settings over the testing<br />
dataset was 68.850o. These results indicate the strong<br />
generalization ability of LS-SVM classifier.<br />
C<br />
5. Conclusion<br />
We have presented SVM based approach that utilizes the<br />
principle of structural risk minimization to model the<br />
uncertainty that underlies the chemical phosphorus removal<br />
process in wastewater treatment plants. A real dataset of 123<br />
examples has been obtained from Ashbridges Bay Wastewater<br />
Treatment Plant, Toronto. A Classifier based on LS-SVM has<br />
been designed through supervised learning to classify whether<br />
or not the concentration of total phosphorus as P in the<br />
effluent will exceed the maximum allowable limit.<br />
Performance of different kernel functions has been evaluated<br />
and all the three kernel functions performed well, especially<br />
the RBF kernel achieved a very promising classification rate<br />
of 88.5200 over the unseen testing dataset. For comparison the<br />
LDA classifier was also used in the study. The classification<br />
results showed that LS-SVM based approach outperformed the<br />
LDA method.<br />
Acknowledgements<br />
We are thankful to Mark Rupke, Chris Monteith, Colin<br />
Marshall and Filemon Basa at Ashbridges Bay Treatment<br />
Plant, Toronto for providing us with valuable information.<br />
..<br />
1 T
References<br />
[1] Metcaf and Eddy, Wastewater Engineering-treatment and<br />
reuse. NewYork: McGraw-Hill, 1991.<br />
[2] M. J. Hammer and M. J. Hammer Jr., Water and<br />
wastewater technology. New Jersey, Columbus: Prentice<br />
Hall, 2003.<br />
[3] N. W. Schmidtke and Assoc. Ltd. And D. I. Jenkins and<br />
Assoc. Inc., Retrofitting municipal wastewater treatment<br />
plantsfor enhanced biological phosphorus removal.<br />
Canada: Minister of supply and services Canada, 1986.<br />
[4] N. Cristianini and J. SH. Taylor, An introduction to<br />
Support Vector Machines and other kernel-based<br />
methods. United Kingdom: Cambridge <strong>University</strong> Press,<br />
2000.<br />
[5] C.J. Burges, "A tutorial on support vector machine for<br />
pattern recognition," Knowledge Discovery and Data<br />
Mining, vol. 2, pp. 12 1-167, June, 1998.<br />
[6] I.E. Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and<br />
R. M. Nohikawa, "A support vector machine approach for<br />
detection of microcalifications," IEEE trans. Med. Imag.,<br />
vol.21, NO. 12, December, 2002.<br />
[7] P.H. Chen, C. J. Lin, and B. Scholkopf, "A tutorial on<br />
v- support vector machines," unpublished.<br />
[8] J. Salmon, S. King, and M.Osborne,"Framewise phone<br />
classification using support vector machines,"<br />
unpublished.<br />
[9] S. Z. Li and G. Guo, "Content-based audio classification<br />
and retrieval using SVM learning," unpublished.<br />
[10] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B.<br />
Scholkopf, "An introduction to kernel-based learning<br />
algorithm," IEEE Trans. Neural Networks, vol. 12, pp.<br />
181-201, Mar. 2001.<br />
[11] B. Scholkopf and A. J. Smola, Learning with kernels -<br />
support vector machines, regularization, optimization,<br />
and beyond. Cambridge, MA: MIT press, 2002.<br />
322<br />
63<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:00 from IEEE Xplore. Restrictions apply.<br />
[12] V. Kecman, Learning and soft computing- support<br />
vector machines, neural networks, andfuzzy logic.<br />
Cambridge, MA: MIT press, 2001.<br />
[13] U. Jeppsson, Modeling aspects ofwastewater treatment<br />
process. Sweden, Lund: Reprocentralen, Lund university,<br />
1996.<br />
[14] J. C. Principe, N. R. Euliano, and W. C. Lefebvre,<br />
Neural and adaptive systems -fundamentals through<br />
simulation. United States of America: John Wiley & sons<br />
Inc., 1999.<br />
[15] S. Haykin, Neural Networks - a comprehensive<br />
foundation. Hamilton, ON., Canada: Prentice Hall, 1999.<br />
[16] K. Pelckmans et al, "LS-SVMlab toolbox user's guide,"<br />
unpublished.<br />
[17] J. A. K. Suykens, T. V. Gestel, J. D. Brabanter, B. D.<br />
Moor, and J. Vandewalle, Least square support vector<br />
machines. Singapore: World scientific publishing Co. Pte.<br />
Ltd., 2002.<br />
[18] D. S. Lee, J. M. Park, "Neural Networks Modeling for<br />
On-line Estimation of Nutrient Dynamics in a<br />
Sequentially- operated Bach Reactor," Journal of<br />
Biotechnology, vol. 75, pp. 229-239, June, 1999.<br />
[19] 0. C. Pires, C. Palma, J. C. Costa, I. Moita, M. M. Alves,<br />
and E. C. Ferreira, "Knowledge-based fuzzy system for<br />
diagnosis and control of an integrated biological<br />
wastewater treatment process," the 2nd IWA conference on<br />
instrumentation, control, and automation, June, 2005.<br />
[20] S. T. Yordaova, " Fuzzy two-level control for an aerobic<br />
wastewater treatment," proceedings. 2nd international<br />
IEEE conference, vol. 1, pp. 348-352, June, 2004.<br />
[21] S. Marsili and L Giunti, "Fuzzy predict control for<br />
nitrogen removal in biological wastewater treatment,"<br />
IWA conference on water science technology, vol. 45, pp.<br />
37-44, June, 2002.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />
Data Embedding in µ-law Speech with Spread<br />
Spectrum Techniques<br />
Libo Zhang, Heping Ding<br />
Institute for Microstructural Sciences,<br />
National <strong>Research</strong> Council,<br />
Ottawa, Ontario, Canada<br />
heping.ding@nrc-cnrc.gc.ca<br />
Abstract⎯This paper explores data embedding in G.711 µ-law<br />
speech signals with the spread spectrum techniques. Based on an<br />
optimized spread spectrum scheme, a simple but effective solution<br />
is presented for high-capacity embedding. Simulations show that<br />
the proposed scheme, when incorporated with the measure of the<br />
frequency masking effects, can achieve an embedding rate of<br />
about 100 bits per second with a 7% Bit Error Rate (BER), or<br />
1000 bps with a 10% BER.<br />
Keywords- µ-law speech, data embedding, speech coding, spread<br />
spectrum communication<br />
I. BACKGROUND<br />
The techniques to embed additional digital information into<br />
host signals imperceptibly can have many applications. For<br />
example, in digital watermarking, the digital copyright<br />
information is embedded into audio signals imperceptibly to<br />
protect the intellectual property. In another example shown in<br />
[1], the wide-band components are embedded into narrow-band<br />
speech signals to enhance the quality and intelligibility.<br />
The µ-law companded signal format, which is defined in<br />
ITU-T G.711, is the telephony standard in North America. For<br />
high capacity embedding in such signals, it is required to<br />
reliably transmit the embedded information, along with the host<br />
speech signal, across both the analog and digital telephony<br />
channels. Thus, the embedding should be robust against both<br />
band-pass filtering and Additive White Gaussian Noise<br />
(AWGN), which occur in normal telephony channels.<br />
In general, three conflicting criteria are used to evaluate such<br />
embedding systems. Imperceptibility means that the composite<br />
signal should be perceptually equivalent to the host signal;<br />
robustness refers to a reliable extraction even if the composite<br />
signal is degraded; and embedding rate is a measure of how<br />
much information can be embedded and transmitted. For our<br />
research in µ-law embedding, the embedding rate is more<br />
emphasized than with other research areas.<br />
Little was published on this research topic. Currently two<br />
categories of techniques could be used for this kind of data<br />
embedding, namely, the ones based on spread spectrum<br />
techniques [2] and those based on quantization-bin techniques<br />
[3]. Usually the conventional spread spectrum techniques could<br />
not achieve high embedding rates because a long spreading<br />
sequence is required just to reduce the host impact. [4] proposed<br />
a modified spread spectrum embedding algorithm that can<br />
Sridhar Krishnan<br />
Electrical and Computer Engineering Department,<br />
<strong>Ryerson</strong> <strong>University</strong>,<br />
Toronto, Ontario, Canada<br />
krishnan@ee.ryerson.ca<br />
inherently suppress the host impact. The scheme shows a very<br />
high robustness when applied to digital audio watermarking.<br />
In this paper, we optimize this modified scheme for the<br />
purpose of high capacity embedding in µ-law speech signals.<br />
The rest of the paper is organized as follows. Section II presents<br />
a generalized view of spread spectrum embedding schemes,<br />
with the modified scheme and its optimization being special<br />
cases. Section III incorporates the frequency masking effect to<br />
implement the proposed scheme. Section IV presents the<br />
simulation results and Section V gives a summary.<br />
II. SPREAD SPECTRUM SCHEMES<br />
Supposing that one bipolar information bit b ∈ ± 1 is to be<br />
embedded into x, an N–sample time or transform domain<br />
sequence of the host signal, the generalized spread spectrum<br />
embedding can be expressed as<br />
y = x - β ( x • w)<br />
w + αb w,<br />
0 ≤ β ≤ 1.<br />
(1)<br />
where y represents the composite signal; the pseudo-random<br />
spreading sequence w is of length N and zero-mean; the scalar<br />
α > 0 controls the embedding strength; the scalar β = 0 and<br />
β = 1 result in the conventional and the modified schemes,<br />
respectively; and the dot operator represents the normalized<br />
correlation of two length-N sequences and is defined as<br />
1<br />
u • v ≡ u iv . (2)<br />
i<br />
N<br />
∑ N i=<br />
1<br />
Degraded by the additive noise n during transmission, the<br />
received signal can be expressed as<br />
r = y + n = x - β ( x • w)<br />
w + αbw<br />
+ n . (3)<br />
The normalized correlation between the received and the<br />
spreading sequences can be found as<br />
c = r • w = αb + ( 1−<br />
β )( x • w)<br />
+ x • n . (4)<br />
Assuming that both the host signal and the noise are<br />
2<br />
2<br />
Gaussian, with x ~ N ( 0,<br />
σ x ) and n ~ N ( 0,<br />
σ n ) . According to<br />
(3), the correlation is also Gaussian and with<br />
2 2 2<br />
( 1 − β ) σ x + σ n<br />
c ~ N ( αb,<br />
) . (5)<br />
N<br />
IEEE Globecom 2005 2160 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />
64<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />
Thus, the embedded information bit b can be estimated by<br />
b ˆ = sign(<br />
c)<br />
, and the performance, in term of Bit Error Rate<br />
(BER), is<br />
p<br />
where Q(<br />
x)<br />
=<br />
⎛<br />
2<br />
µ ⎞<br />
c ⎜ Nα<br />
Q(<br />
) = Q<br />
⎟ , (6)<br />
⎜<br />
2 2<br />
σ<br />
⎟<br />
c ⎝<br />
( 1 − β ) σ x + σ n ⎠<br />
= 2<br />
1<br />
2π<br />
∫ ∞<br />
2<br />
u<br />
+ −<br />
2 e<br />
x<br />
du is the tail error function.<br />
As shown in (6), in the conventional scheme ( β = 0 ), both<br />
the host and the external noise degrade the extraction, which<br />
results in the poor performance of this scheme. In the modified<br />
scheme ( β = 1),<br />
the host impact is totally suppressed. However,<br />
the total embedding power, determining the perceptual<br />
distortion, grows from 2<br />
2<br />
2 σ<br />
α to<br />
x α + , which can be<br />
N<br />
deduced directly from (1). In high capacity embedding, where a<br />
2<br />
σ<br />
small N is preferred, even the minimal embedding power x ,<br />
N<br />
obtained by setting α = 0 , may not be small enough to<br />
guarantee imperceptibility.<br />
In this paper, we propose to use a less-than-unity β to<br />
2 2<br />
2 β σ<br />
reduce the total embedding power to<br />
x<br />
P = α + . The<br />
N<br />
optimal β should minimize p, the extraction BER, while<br />
satisfying the power constraints to assure imperceptibility.<br />
We start with<br />
⎧0<br />
≤ β ≤ 1;<br />
⎪<br />
⎨ ⎛<br />
2 ⎞ ⎛<br />
⎜ N ⋅α<br />
⎟ ⎜<br />
⎪ p = Q<br />
= Q<br />
⎜<br />
2 2 2 ⎟ ⎜<br />
⎩ ⎝<br />
( 1 − β ) σ x + σ n ⎠ ⎝<br />
2 2 ⎞<br />
, (7)<br />
N ⋅ P − β σ x ⎟<br />
2 2 2<br />
( 1 − β ) σ + ⎟<br />
x σ n ⎠<br />
and, with the “embedded data to signal” and “signal to noise”<br />
2<br />
ratios defined as P<br />
DSR = and σ x SNR = , respectively, the<br />
2<br />
σ<br />
σ<br />
2<br />
x<br />
BER in (7) can be expressed by<br />
⎛<br />
⎜<br />
p = Q<br />
⎜<br />
⎝<br />
n<br />
2 ⎞<br />
N ⋅ DSR − β ⎟ . (8)<br />
2<br />
⎟<br />
( 1 − β ) + 1<br />
SNR ⎠<br />
Next, we want to find *<br />
β , the β that minimizes (8) - or<br />
maximizes what is in the square root sign in (8). Since the noise<br />
n is not known at the time of embedding and normally<br />
2 2<br />
2<br />
σ 1 , one can choose β = 1 . When<br />
N ⋅ DSR < 1,<br />
*<br />
β can be found by letting<br />
2<br />
∂ ⎡ N ⋅ DSR − β ⎤ 2 ( N ⋅ DSR − β )<br />
=<br />
= 0 (10)<br />
⎢<br />
2 ⎥<br />
3<br />
∂ β ⎣ ( 1 − β ) ⎦ ( 1 − β )<br />
*<br />
therefore, β = N ⋅ DSR . To summarize, we have<br />
*<br />
β = min( N ⋅ DSR,<br />
1)<br />
(11)<br />
*<br />
The corresponding α is then<br />
* 2 2<br />
* ( β ) σ x<br />
α = P − . (12)<br />
N<br />
When N ⋅ DSR < 1,<br />
the best achievable BER with no noise<br />
considered is, according to (9),<br />
⎛ * ⎞<br />
* ⎜ β<br />
p = Q ⎟ . (13)<br />
⎜<br />
* ⎟<br />
⎝<br />
1 − β ⎠<br />
Equation (13) can be used to estimate the maximal<br />
embedding rate of the proposed scheme. For example, given a<br />
*<br />
required BER p ≤ 3%<br />
, β = N ⋅ DSR must be at least 0.8 as<br />
per Fig. 1 (approximate to this “no noise” case and to be<br />
discussed later). Thus, the maximum rate is limited by<br />
f s f s ≤ DSR (bps), with f being the sampling frequency.<br />
s<br />
N 0.<br />
8<br />
For example, the rate limit is 100 bps when DSR=-20 dB. It will<br />
be decreased by the inherent noise from µ-law companding and<br />
external noise in the telephony channel. In the sequel, the term<br />
N ⋅ DSR is called the composite embedding power for<br />
simplicity.<br />
Figure 1. BER of spread spectrum embedding (SNR=30 dB)<br />
To show the improvement due to the optimization of β , (8)<br />
with SNR=30 dB and different composite embedding powers<br />
are shown in Fig.1. It can be seen that when the power is of the<br />
intermediate values, the performance can be improved<br />
significantly, e.g. from p = 18%<br />
of the conventional spread<br />
spectrum scheme to p = 3%<br />
when N ⋅ DSR = 0.<br />
8 . In the case of<br />
watermarking, where a large N can be used, the composite<br />
embedding power is normally large enough such that the<br />
optimization is not necessary. However in high capacity<br />
embedding, the composite power is often smaller because N<br />
IEEE Globecom 2005 2161 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />
65<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />
could not be very large; therefore, such optimization is<br />
necessary to achieve high capacity.<br />
By observing the optimizations in Fig. 1, we can see that,<br />
*<br />
although with noise ignored, β given by (11) is still a simple<br />
and reasonable approximation for the case of SNR=30 dB.<br />
III. MDCT DOMAIN IMPLEMENTATION<br />
Discussed in [5], frequency masking of human auditory<br />
system refers to the masking phenomenon between two<br />
simultaneously occurring components that are close enough in<br />
frequency; the stronger component may make the weaker one<br />
imperceptible. A masking model uses this effect to derive a<br />
masking threshold from the signal power spectrum. The<br />
amplitude changes made by embedding are perceptually<br />
irrelevant as along as they are under the threshold at each<br />
frequency. Thus, one can use the frequency masking effect to<br />
imperceptibly maximize the embedding power.<br />
The frequency masking effect is normally described in<br />
Fourier frequency domain. The Modified Discrete Cosine<br />
Transform (MDCT) with 50% overlapping between successive<br />
frames can perfectly reconstruct the original signal. It was<br />
shown in [6] that MDCT coefficients can be approximated by<br />
the corresponding Fourier ones with a modulating term. This<br />
similarity indicates that a masking model based on MDCT can<br />
be borrowed from that with the Fourier transform without<br />
causing too much error.<br />
In this research, the MDCT domain is chosen for embedding<br />
and the frequency masking effect is used. Being a scaled-down<br />
version of Model 1 of Layer 3 in MPEG-1 [5], our model<br />
consists of merely 18 non-uniform critical bands – to<br />
accommodate the 0~4 kHz range only.<br />
The block diagram of embedding/extraction is shown in Fig.<br />
4. Each 128-sample frame of the µ-law signal is first expanded<br />
to 16-bit linear PCM and then transformed into MDCT<br />
coefficients.<br />
The global masking threshold is computed from the MDCT<br />
power spectrum using the masking model discussed above. One<br />
further modification to that model is to relax the threshold in [5]<br />
by flattening the slopes of each component’s spreading function<br />
on both sides; therefore, the global threshold is raised and the<br />
embedding capacity is increased. As a result, we come up with<br />
the following two settings with different aggressiveness,<br />
• Perceptible but not annoying embedding artifacts with<br />
SDR≈17.0 dB, and;<br />
• Imperceptible embedding artifacts with SDR≈22.5 dB.<br />
Each to-be-embedded bipolar bit is spread by a<br />
pseudo-random spreading sequence of length N, which is<br />
determined by the required embedding rate, e.g., with a higher<br />
rate required, we need to embed more bits into a frame; therefore,<br />
a smaller N is adopted so that more N-sample spread sequences<br />
can be fitted into the frame. The resultant spread sequences are<br />
embedded into MDCT coefficients between 0.3~3.3 kHz<br />
according to (1). For each of the 18 critical bands, *<br />
β and<br />
*<br />
α are computed by using (11) and (12), respectively, where P<br />
is the masking threshold in that critical band. The inverse<br />
MDCT and the µ-law compression result in a µ-law signal that is<br />
embedded with the additional information then impaired by the<br />
µ-law compression. The extraction is simply based on the<br />
polarity of (4), as discussed earlier.<br />
IV. SIMULATIONS<br />
The measured relationship between the BER and the<br />
embedding rate can characterize the embedding capability and<br />
the robustness of the scheme. All information sequences are at<br />
least 200-bits long and the results are averaged over 10 runs, so<br />
the BERs are actually computed from at least 2000 bits to assure<br />
a high accuracy. The telephony channel is simulated by AWGN<br />
with SNR=35 dB and 0.3~3.3 kHz band-pass filtering.<br />
Simulation results are shown in Fig. 2 and Fig. 3, for<br />
SDR≈17.0 dB and SDR≈22.5 dB, respectively. It can be seen<br />
that the optimization of β does increase the performances of<br />
both the conventional and modified schemes. With slightly<br />
perceptible embedding artifacts, i.e., the case in Fig. 3, the<br />
proposal, with an optimal β , can achieve 100 bps with a BER<br />
less than 7%.<br />
Figure 2. Rate-Distortion at SDR=17.0 dB<br />
Figure 3. Rate-Distortion at SDR=22.5 dB<br />
IEEE Globecom 2005 2162 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />
66<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.<br />
V. CONCLUSIONS<br />
In this research, we explored the possibility of using spread<br />
spectrum techniques for high capacity data embedding in µ-law<br />
speech signals. Our proposal can achieve about 7% and 10%<br />
BERs at 100 and 1000 bps, respectively.<br />
We like to make two observations here. First, the<br />
rate-distortion curves are quite flat especially for the low<br />
embedding power case as shown in Fig. 3, e.g. a BER decrease<br />
of only less than 5% with a large rate decrease - from 1000 bps<br />
to 100 bps, shown in both Fig. 2 and Fig. 3. In other words,<br />
increasing the spreading length N does not improve the BER<br />
significantly. Second, it is understood that the large quantization<br />
noise caused by the µ-law compression plays a major role in<br />
limiting the performance. Thus, it can be a future research topic<br />
to quantitatively study this signal-dependent noise and to find<br />
ways to compensate for its adverse impact in data embedding.<br />
ACKNOWLEDGMENT<br />
L. Zhang thanks for the generous support from the Institute<br />
for Microstructural Sciences, National <strong>Research</strong> Council of<br />
µ-law<br />
speech<br />
µ to linear<br />
expansion<br />
Spreading sequence<br />
Estimated<br />
information<br />
Extraction<br />
Linear<br />
speech<br />
MDCT<br />
Decomposition<br />
Masking<br />
<strong>Analysis</strong><br />
MDCT<br />
Decomposition<br />
Figure 4. Block diagram of speech embedding<br />
Canada while doing this research as a visiting researcher at the<br />
Acoustics & <strong>Signal</strong> Processing <strong>Group</strong>. He would also thank the<br />
Electrical and Computer Engineering Department of <strong>Ryerson</strong><br />
<strong>University</strong>, for the continuous support in his master program.<br />
REFERENCES<br />
[1] H. Ding, “Backward compatible wideband voice over narrowband<br />
low-resolution media,” Acoustics <strong>Research</strong> Letters Online<br />
(http://scitation.aip.org/ARLO), vol. 6, issue 1, pp. 41 – 47, January 2005.<br />
[2] D. Kirovski and H. S. Malvar, “Spread spectrum watermarking of audio<br />
signals,” IEEE Transactions on <strong>Signal</strong> Processing, vol. 51, no. 4, pp.<br />
1020-1033, April 2003.<br />
[3] J. Eggers, R. Buml, R. Tzschoppe and B. Girod, “Scalar Costa scheme for<br />
information embedding,” IEEE Transactions on <strong>Signal</strong> Processing, vol.<br />
51, no. 4, pp. 1003-1019, April 2003.<br />
[4] L. Zhang, “Perceptual data embedding in audio and speech signals,”<br />
Master Thesis, <strong>Ryerson</strong> <strong>University</strong>, Toronto, September 2004.<br />
[5] T. Painter and A. Spanias, “Perceptual coding of digital audio,” IEEE<br />
Proceedings, vol. 88, no. 4, pp. 451-515, April 2000.<br />
[6] H. V. Azghandi and P. Kabal, “Improving perceptual coding of<br />
narrowband audio signals at low rates,” IEEE International Conference on<br />
Acoustics, Speech, and <strong>Signal</strong> Processing, vol. 2, pp. 913-916, March<br />
1999.<br />
Masking<br />
Threshold<br />
Extra information<br />
Embedding<br />
Channel noise<br />
µ to linear<br />
expansion<br />
MDCT<br />
Reconstruction<br />
Linear to µ<br />
Compression<br />
IEEE Globecom 2005 2163 0-7803-9415-1/05/$20.00 © 2005 IEEE<br />
67<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:28 from IEEE Xplore. Restrictions apply.
Proceedings of the 2005 IEEE<br />
Engineering in Medicine and Biology 27th Annual Conference<br />
Shanghai, China, September 1-4, 2005<br />
COMPARISON OF JPEG 2000 AND OTHER LOSSLESS COMPRESSION SCHEMES FOR<br />
DIGITAL MAMMOGRAMS<br />
April Khademi and Sridhar Krishnan<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON M5B 2K3 Canada<br />
E-mail: akhademi@ieee.org, krishnan@ee.ryerson.ca<br />
Abstract<br />
In this study, we propose JPEG 2000 as an algorithm for<br />
the compression of digital mammograms and the proposed<br />
work is the first real-time implementation of JPEG 2000 on<br />
a mammogram image database. Only the lossless compression<br />
mode of JPEG 2000 was examined to ensure that the<br />
mammogram is delivered without distortion. The performance<br />
of JPEG 2000 was compared against several other<br />
lossless coders: JPEG-LS, lossless-JPEG, adaptive Huffman,<br />
arithmetic with a zero order and a first order probability<br />
model and Lempel-Ziv Welch (LZW) with 12 and 15<br />
bit dictionaries. Each compressor was supplied the identical<br />
set of 50 mammograms, each having a resolution of<br />
8bits/pixel and dimensions of 1024×1024. Experimental<br />
results indicate JPEG 2000 and JPEG-LS provide comparable<br />
compression performance since their compression ratios<br />
differed by 0.72% and both compressors also superseded the<br />
results of the other coders. Although JPEG 2000 suffered<br />
from a slightly longer encoding and decoding delay than<br />
JPEG-LS (0.8s on average), it is still preferred for mammogram<br />
images due to the wide variety of features that aid<br />
in reliable image transmission, provide an efficient mechanism<br />
for remote access to digital libraries and contribute to<br />
fast database access.<br />
Keywords: JPEG 2000, mammogram image compression,<br />
lossless compression, medical images<br />
1. INTRODUCTION<br />
A particular technology which has proved to be a vital diagnostic<br />
tool for doctors and other healthcare workers is<br />
mammography, which provides x-ray images of the breast.<br />
Mammogram images allow the trained interpreter to detect<br />
any abnormal growths or changes within the breast tissue,<br />
which could be an indication of breast cancer [1]. Since<br />
early detection of breast cancer is the leading way to reduce<br />
mortality rates, it is imperative that the diagnosing professional<br />
has efficient means of accessing and viewing a patient’s<br />
mammogram [2].<br />
0-7803-8740-6/05/$20.00 ©2005 IEEE.<br />
3771<br />
68<br />
By digitizing mammograms and applying a series of signal<br />
processing techniques to them, it is possible to utilize<br />
technological devices and methods to make the necessary<br />
diagnostic tools more readily available to healthcare workers,<br />
potentially speeding up the diagnosis.<br />
Since digital mammograms are used for diagnosis, high<br />
resolution images are required to ensure that even the smallest<br />
irregularities are represented. As a consequence, mammogram<br />
sizes are large and are requiring significant amounts<br />
of bandwidth for transmission and a lot of memory for storage.<br />
To accommodate for this large file size, it is imperative<br />
to identify and make use of an optimal source encoding<br />
scheme dedicated to medical images.<br />
Primarily, this paper investigates JPEG 2000, the latest<br />
data compression technology, and applies it to mammogram<br />
images to provide lossless compression in a novel way.<br />
2. JPEG 2000<br />
This paper investigates the compression performance of<br />
JPEG 2000 on mammographic images and its rich feature<br />
set for a medical imaging application.<br />
Only JPEG 2000’s lossless compression mode was used<br />
since the application of interest is pertinent to mammograms<br />
that are to be used for diagnosis. For lossless compression<br />
of grayscale mammograms, JPEG 2000’s encoder and decoder<br />
are shown in Fig.1.<br />
A) Tiling: Tiling is performed to significantly reduce the<br />
computational overhead and memory requirements of some<br />
of the more demanding components within the JPEG 2000<br />
codec, since future processing is performed on the smaller<br />
tile components. This allows maximum interchange between<br />
devices with limited memory resources, like a Personal<br />
Digital Assistant (PDA), giving healthcare workers<br />
more versatility to manage, transmit and receive mammograms<br />
with little effort. Furthermore, each tile component<br />
can be extracted and reconstructed independently, permitting<br />
random access to portions of the bitstream. This is<br />
useful to doctors if a specific region within a mammogram<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 21, 2009 at 13:20 from IEEE Xplore. Restrictions apply.
GAUSSIAN MIXTURE MODELING USING SHORT TIME FOURIER TRANSFORM<br />
FEATURES FOR AUDIO FINGERPRINTING<br />
ABSTRACT<br />
In audio fingerprinting, an audio clip must be recognized by<br />
matching an extracted fingerprint to a database of previously<br />
computed fingerprints. The fingerprints should reduce the<br />
dimensionality of the input significantly, provide discrimination<br />
among different audio clips, and at the same time,<br />
invariant to the distorted versions of the same audio clip. In<br />
this paper, we design fingerprints addressing the above issues<br />
by modeling an audio clip by Gaussian mixture models<br />
(GMM) using a wide range of easy-to-compute short time<br />
Fourier transform features such as Shannon entropy, Renyi<br />
entropy, spectral centroid, spectral bandwidth, spectral flatness<br />
measure, spectral crest factor, and Mel-frequency cepstral<br />
coefficients. We test the robustness of the fingerprints<br />
under a large number of distortions. To make the system robust,<br />
we use some of the distorted versions of the audio for<br />
training. However, we show that the audio fingerprints modeled<br />
using GMM are not only robust to the distortions used<br />
in training but also to distortions not used in training. Using<br />
spectral centroid as feature, we obtain the highest identification<br />
rate of 99.1 % with a false positive rate of 10 −4 .<br />
1. INTRODUCTION<br />
An audio fingerprint is a compact representation of perceptually<br />
relevant portion of the audio content. An audio fingerprint<br />
should be able to identify audio files even if they<br />
are severely distorted by perceptual coding or common signal<br />
processing operations. The type of distortions a fingerprint<br />
should withstand depends on the application. For example,<br />
audio fingerprints designed for broadcast monitoring<br />
should withstand distortions such as time compression,<br />
dynamic range compression, and equalization. An Audio<br />
fingerprinting system has two principle components: fingerprint<br />
extraction and matching algorithm. The fingerprint<br />
requirements include computational simplicity, robustness<br />
to distortions, smaller size, and discrimination power over a<br />
Arunan Ramalingam and Sridhar Krishnan<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON, Canada, M5B 2K3<br />
E-mail: (aramalin)(krishnan)@ee.ryerson.ca<br />
We would like to acknowledge Micronet for their financial support.<br />
0-7803-9332-5/05/$20.00 ©2005 IEEE<br />
large number of other fingerprints [1]. The matching algorithms<br />
should be efficient to able to identify an audio item<br />
from a database of hundreds of thousands of audio songs in<br />
a few seconds. A large number of fingerprinting schemes<br />
have been proposed. For some recent work, please see [2] –<br />
[5].<br />
The overview of the proposed fingerprinting scheme is<br />
shown in Fig. 1. First the incoming audio clip is preprocessed<br />
and features are extracted from them. Then using<br />
these features, the audio clip is modeled using Gaussian<br />
mixtures. In the training phase, the mixture models of all the<br />
audio clips are stored in the database along with the metadata<br />
information. In the identification phase, the features<br />
from an unknown audio clip are used to evaluate the likelihood<br />
of all the models in the database. Then the model<br />
that is most likely to generate the features is identified as<br />
the correct audio clip.<br />
Audio<br />
Input<br />
Training<br />
Preprocessing Framing<br />
Identification<br />
Feature<br />
extraction<br />
GMM<br />
modeling<br />
Likelihood<br />
estimation<br />
Identification<br />
result<br />
Fig. 1. Proposed Fingerprinting System<br />
2. FEATURE EXTRACTION<br />
Fingerprint<br />
database<br />
In this work, we use the following features extracted from<br />
the short time Fourier transform (STFT) of the signal for<br />
fingerprint extraction. Let Fi = fi(u),u ∈ (0,M) be<br />
the Fourier transform of the i th frame, where M is the index<br />
of the highest frequency band. To increase the robustness<br />
of the fingerprint, the features are not extracted on<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />
72
the whole spectrum but on non-overlapping logarithmically<br />
spaced bands. Let Fi,b = fi(ub),ub ∈ (lb,ub) where lb and<br />
ub are the lower and upper edges of the band b. In each of<br />
the frame, the following features are extracted. These features<br />
have been used successfully in audio fingerprinting [6]<br />
and music classification [7].<br />
1. Spectral Centroid (SC): The spectral centroid is the<br />
center of gravity of the magnitude spectrum of the<br />
STFT and is a measure of spectral shape and “brightness”<br />
of the spectrum. Spectral centroid is defined as<br />
ub 2<br />
u. |fi(u)| u=lb<br />
SCi,b = ub 2 . (1)<br />
|fi(u)| u=lb<br />
2. Spectral Bandwidth (SB): The spectral bandwidth is<br />
measured as the weighted average of the distances between<br />
the spectral components and the spectral centroid.<br />
Spectral bandwidth is defined as<br />
SBi,b =<br />
ub u=lb (u − SCi,b) 2 . |fi(u)| 2<br />
ub u=lb |fi(u)| 2 . (2)<br />
3. Spectral Band Energy (SBE): The spectral band energy<br />
is the energy in the frequency bands normalized<br />
by the energy in the whole spectrum. Spectral band<br />
energy is calculated as<br />
ub SBEi,b =<br />
2<br />
|fi(u)| u=lb<br />
M u=0<br />
|fi(u)| 2 . (3)<br />
4. Spectral Flatness Measure (SFM): The spectral flatness<br />
measure quantifies the flatness of the spectrum<br />
and distinguishes between noise-like and tone-like signal.<br />
Spectral flatness measure is defined as<br />
SFMi,b =<br />
ub<br />
u=lb<br />
1<br />
ub−lb+1<br />
|fi(u)| 2<br />
1<br />
u b −l b +1<br />
ub 2 . (4)<br />
|fi(u)| u=lb<br />
5. Spectral Crest Factor (SCF): The spectral crest factor<br />
is also a measure of the tonality of the signal. Spectral<br />
crest factor is defined as<br />
<br />
max |fi(u)|<br />
SCFi,b =<br />
2<br />
1 ub 2 . (5)<br />
ub−lb+1 |fi(u)| u=lb<br />
6. Shannon Entropy (SE): The Shannon entropy of a signal<br />
is a measure of its spectral distribution of the signal.<br />
Shannon entropy is defined as<br />
SEi,b =<br />
ub <br />
u=lb<br />
|fi(u)| log 2 |fi(u)| . (6)<br />
7. Renyi Entropy (RE): The Renyi entropy of a signal is<br />
also a measure of its spectral distribution. Renyi entropy<br />
is defined as<br />
REi,b = 1<br />
1 − r log<br />
<br />
ub <br />
|fi(u)| r<br />
<br />
. (7)<br />
u=lb<br />
We used Renyi Entropy of order r =2.<br />
8. Mel-frequency Cepstral Coefficients (MFCC): MFCC<br />
are perceptually motivated features based on the STFT.<br />
After taking the log-amplitude of the magnitude spectrum,<br />
the FFT bins are grouped and smoothed according<br />
to the perceptually motivated Mel-frequency scaling.<br />
Finally, in order to decorrelate the resulting feature<br />
vectors a discrete cosine transform is performed.<br />
In this work, we used 13 coefficients since this parameterization<br />
has been shown to be quite effective for<br />
speech recognition and speaker identification [8].<br />
Let Xi be the set of features extracted for the frame i. Xi<br />
can be any one of the features described above. In order to<br />
better characterize the temporal variations of the signal, the<br />
first derivatives of the above features<br />
δi = δi − δi−1<br />
are also used included in the feature matrix. In an audio clip,<br />
successive frames are related in time. To include this time<br />
dependency, a time vector is added to the feature matrix.<br />
This time vector is taken as an incremental counter from 0<br />
to 1. Thus the feature matrix of the entire audio clip can be<br />
described as<br />
F ′ M =<br />
⎡<br />
⎤<br />
X1,δ1,t1<br />
⎢ X2,δ2,t2 ⎥<br />
⎢<br />
⎥<br />
⎢<br />
⎣ .<br />
⎥<br />
(9)<br />
.<br />
⎦<br />
XN,δN ,tN<br />
where N is the number of frames in the audio clip. Finally<br />
the feature matrix is mean subtracted and component wise<br />
variance normalized to get a normalized feature matrix FM.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />
73<br />
(8)
3. GAUSSIAN MIXTURE MODELS<br />
Gaussian Mixture Models (GMM) have been successfully<br />
used in audio classification [7] and content based retrieval<br />
[9]. In this work, the technique is used to model an audio<br />
fingerprint as a probability density function (PDF), using a<br />
weighted combination of Gaussian component PDFs (mixtures).<br />
During the training phase, the GMM parameters of<br />
an audio fingerprint are estimated to maximize the probability<br />
of the audio frames present in the audio fingerprint.<br />
We use the Baum-Welch (Expectation-Maximization) algorithm<br />
to estimate the GMM parameters with initialization by<br />
k − means clustering. As the feature vectors in this work<br />
have reasonably uncorrelated components, computationally<br />
convenient diagonal covariance matrices can be used. We<br />
used GMM with 16 mixtures. Thus in the fingerprint extraction<br />
phase, each audio clip is modeled by GMM. During the<br />
matching phase the fingerprint from an unknown recording<br />
is compared with the database of pre-computed GMM and<br />
the GMM that gives the highest likelihood for the fingerprint<br />
is identified as correct match.<br />
4. RESULTS<br />
We used a database containing 250 five-second audio clips<br />
chosen from the categories of rock, pop, country, classical,<br />
and jazz. The audio clips are chosen from random portions<br />
of songs from Compact Discs.<br />
4.1. Robustness to Distortions<br />
We used several distorted versions of the audio clips to test<br />
the robustness of the proposed scheme. We used the following<br />
distorted versions in our tests.<br />
I. Compression – 1) MP3 at 32 kbps, 2) AAC at 32<br />
kbps, 3) WMA at 32 kbps, 4) Real encoding at 32<br />
kbps.<br />
II. Amplitude distortion – 1) 3 : 1 Compression above<br />
30 dB, 2) 3 : 1 Expander below 10 dB, 3) 3 : 1 compression<br />
below 10 dB, 4) Limiter at 9 dB, 5) ‘Superloud’<br />
amplitude distortion, 6) Noise gate at 20 dB, 7)<br />
De-esser, 8) Nonlinear amplitude distortion.<br />
III. Frequency distortion – 1) Nonlinear bass distortion,<br />
2) Midrange frequency boost, 3) Notch Filter, 750 -<br />
1800 Hz, 4) Notch Filter 430 - 3400 Hz, 5) Telephone<br />
bandpass, 135 - 3700 Hz, 6) Bass cut, 7) Bass boost.<br />
IV. Change in pitch – 1) Lower pitch 2 - 6 %, 2) Raise<br />
pitch 2 - 6 %.<br />
V. Change in speed – 1) Linear speed increase 2 - 6%,<br />
2) Linear speed decrease 2 - 6%.<br />
VI. Resampling at 8 kHz<br />
VII. Echo addition<br />
To increase the robustness of the fingerprints, in addition<br />
to the original audio, some distorted versions of the<br />
audio are also used in training. We used the following distorted<br />
versions in our training: 1) Undistorted audio, 2) 3<br />
: 1 Compression above 30 dB, 3) Nonlinear amplitude distortion,<br />
4) Nonlinear bass distortion, 5) Midrange frequency<br />
boost, 6) Notch Filter, 750 - 1800 Hz, 7) Notch Filter 430<br />
- 3400 Hz, 8) Raise Pitch 1%, 9) Lower Pitch 1%. The<br />
log-likelihood of the test clips are evaluated for all the models<br />
in the database. Then the model that gives the highest<br />
log-likelihood is taken as the correct match. Table 1 shows<br />
the percentage of clips that are correctly identified for different<br />
features for distortions used in training as well as for<br />
distortions not used in training. The results show that it is<br />
not necessary to train the model for all possible distortions.<br />
By training the model to some representative distortions, we<br />
can obtain robustness to a wide variety of distortions.<br />
Table 1. Mean Recognition rate for distortions<br />
Train Test Mean<br />
MFCC 99.0 98.5 98.7<br />
Spectral centroid 99.4 99.1 99.2<br />
Spectral bandwidth 99.4 98.9 99.1<br />
Spectral band energy 98.8 98.8 98.8<br />
Spectral flatness measure 99.4 98.6 98.9<br />
Spectral crest factor 99.2 98.6 98.8<br />
Shannon Entropy 99.4 98.8 99.0<br />
Renyi Entropy 99.4 98.9 99.0<br />
4.2. False Positive <strong>Analysis</strong><br />
In the previous section it was assumed that the test clip is<br />
present in the database. Hence the model that gives the<br />
highest log-likelihood value is identified as the correct match.<br />
However it is possible that the test clip may not be in the<br />
database. So there should be a criteria to reject the audio<br />
clips that are not in the database. A suitable threshold<br />
for log-likelihood can be used to vary the false positive and<br />
false negative rates. The false positive and the corresponding<br />
identification rate are shown in Figs. 2 and 3. The percentage<br />
of audio clips correctly identified at different false<br />
positive rates are shown in Table 2. Among the different<br />
features used, spectral centroid gives the highest identification<br />
rate of 99.1% with a false positive rate of 10 −4 .MFCC<br />
performs poorly with an identification rate of 13 %. All the<br />
features except the spectral flatness measure give an identification<br />
rate of more than 90 % with a false positive rate of<br />
10 −3 .<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />
74
Identification Rate<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
10 −7<br />
0<br />
10 −6<br />
10 −5<br />
10 −4<br />
10 −3<br />
False Positive Rate<br />
MFCC<br />
Spectral centroid<br />
Spectral bandwidth<br />
Spectral band energy<br />
Fig. 2. Identification rates at different false positive rates for<br />
MFCC, Spectral centroid, Spectral bandwidth, and Spectral<br />
band energy<br />
Table 2. Identification Rate at different false positive rates<br />
10 −4 10 −3 10 −2<br />
MFCC 13.5 98.4 99.3<br />
Spectral centroid 99.1 99.5 99.8<br />
Spectral bandwidth 93.2 98.4 99.3<br />
Spectral band energy 69.2 94.3 99.2<br />
Spectral flatness measure 31.8 56.4 96.6<br />
Spectral crest factor 93.0 98.4 99.3<br />
Shannon Entropy 71.1 93.9 99.4<br />
Renyi Entropy 64.0 99.3 99.7<br />
5. CONCLUSION<br />
Gaussian Mixture Models have been successfully used in<br />
many classification and identification problems in audio. In<br />
this work, we modeled audio recordings for audio fingerprinting<br />
by Gaussian mixtures using features extracted from<br />
the STFT of the signal. Even though we use some distorted<br />
samples of the audio during training, the system is robust to<br />
distortions not used in training. Using spectral centroid as<br />
feature, we obtain the highest identification rate of 99.1 %<br />
with a false positive rate of 10 −4 .<br />
6. REFERENCES<br />
[1] P. Cano, E. Batle, T. Kalker, and J. Haitsma, “A review<br />
of algorithms for audio fingerprinting,” in IEEE<br />
Workshop on Multimedia <strong>Signal</strong> Processing, 2002,December<br />
2002, pp. 169–173.<br />
[2] J. Herre, O. Hellmuth, and M. Cremer, “Scalable robust<br />
audio fingerprinting using MPEG-7 content de-<br />
10 −2<br />
10 −1<br />
10 0<br />
Identification Rate<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
10 −7<br />
0<br />
10 −6<br />
10 −5<br />
10 −4<br />
10 −3<br />
False Positive Rate<br />
Sprectral flatness measure<br />
Spectral crest factor<br />
Entropy<br />
Renyi’s Entropy<br />
Fig. 3. Identification rates at different false positive rates<br />
for Spectral flatness measure, Spectral crest factor, Shannon<br />
Entropy and Renyi Entropy<br />
scription,” in IEEE Workshop on Multimedia <strong>Signal</strong><br />
Processing, 2002, December 2002, pp. 165–168.<br />
[3] J. Haitsma and T. Kalker, “A highly robust audio fingerprinting<br />
system,” in Proc. of the 3rd Int. Symposium on<br />
Music Information Retrieval,, October 2002, pp. 144–<br />
148.<br />
[4] V. Venkatachalam, L. Cazzanti, N. Dhillon, and<br />
M. Wells, “Automatic identification of sound recordings,”<br />
IEEE <strong>Signal</strong> Processing Magazine, vol. 21, no.<br />
2, pp. 92 – 99, March 2004.<br />
[5] C.J.C. Burges, J.C. Platt, and S. Jana, “Distortion<br />
discriminant analysis for audio fingerprinting,” IEEE<br />
Transactions on Speech and Audio Processing, vol. 11,<br />
no. 3, pp. 165–174, May 2003.<br />
[6] E. Allamanche, B. Frba, J. Herre, T. Kastner,<br />
O.Hellmuth, and M. Cremer, “Cotent-based identification<br />
of audio material using MPEG-7 low level description,”<br />
in Proceeding of the International Symposium<br />
on Music Information Retrieval (ISMIR), Indiana,<br />
USA, October 2002.<br />
[7] G. Tzanetakis and P. Cook, “Musical genre classification<br />
of audio signals,” IEEE Tran. on Speech and Audio<br />
Processing, vol. 10, no. 5, pp. 293 – 302, July 2002.<br />
[8] L. R. Rabiner and B. H. Juang, Fundamentals of Speech<br />
Recognition, Prentice-Hall, Englewood Cliffs, NJ,,<br />
1993.<br />
[9] D. Pye, “Content-based methods for the management of<br />
digital music,” in Proceedings of ICASSP, 2000, vol. 4,<br />
pp. 24–27.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:22 from IEEE Xplore. Restrictions apply.<br />
75<br />
10 −2<br />
10 −1<br />
10 0
MULTIPATH MITIGATION OF GNSS CARRIER PHASE SIGNALS<br />
FOR AN ON-BOARD UNIT FOR MOBILITY PRICING<br />
Ronesh Puri, Ahmed El Kaffas, Alagan Anpalagan, Sridhar Krishnan<br />
Department of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, ON, M5B 2K3<br />
rpuri | aelkaffa | alagan | krishnan @ee.ryerson.ca<br />
Bern Grush<br />
Applied Location Corporation, 34 Dodge Rd, Toronto, ON, M1N 2A7. bgrush@appliedlocation.com<br />
Abstract<br />
Inexpensive navigation-grade receivers are insufficiently<br />
accurate for the task of building a Global Navigation Satellite<br />
System [GNSS]-based parking meter for urban multipath<br />
conditions. Survey-grade instruments that demonstrate cm<br />
accuracy are inappropriate, and are two orders of magnitude<br />
too expensive, for this mass application. We identify three ways<br />
in which a digital signal processor added to a stationary,<br />
navigation-grade receiver can add considerable accuracy (in<br />
the range of 1-2 m, down from 5-10 m) for a parking meter.<br />
First, we apply a pseudo-multipath-based filter and a modified<br />
Receiver Autonomous Integrity Monitoring [RAIM]-derivative<br />
filter to the received carrier phase signals, allowing us to infer<br />
which signals are most affected by noise processes and to<br />
compute receiver position with the remaining signals for<br />
greater accuracy. Second, we take advantage of receiver<br />
stationarity to dwell on these signals for several minutes,<br />
allowing us to acquire a signal characterization metric that is<br />
more stable than might be possible with a non-stationary<br />
receiver. This is intended for non-repudiation. As a third step<br />
we will later experiment with ways to monitor the multipath<br />
behaviour of individual signals on approach to a parking event<br />
in a way that may allow us to more effectively weigh our initial<br />
signal selection criteria. Independent of these three<br />
opportunities, we also take advantage of dual GPS/Galileo<br />
receivers, a capability that we simulate in this experiment.<br />
Testing of the multipath mitigation filters described in this<br />
paper on two simulated GPS/Galileo datasets yielded<br />
reductions in the standard deviation of the position estimate<br />
that ranged from -4% to 61.6% (avg:34.4%) when compared<br />
to the control (unfiltered) position calculation.<br />
Keywords: GPS; Galileo; GNSS; Multipath Mitigation; RAIM;<br />
Urban Canyon; Parking; Parklog; Road-Pricing.<br />
1. INTRODUCTION<br />
A number of countries seek solutions for reliable and cost<br />
effective metering for zone-based road pricing. For economic<br />
and other reasons, GNSS signals are the prime target for this<br />
solution [1-4]. An alternative to the commonly expected<br />
“tracklog” is the use of a “parklog” a log of parking events<br />
with a minimal amount of data describing the intervening trip<br />
0-7803-8886-0/05/$20.00 ©2005 IEEE<br />
CCECE/CCGEI, Saskatoon, May 2005<br />
1578<br />
segments. The parklog is less data intensive, more accurate<br />
(i.e. more non-repudiatiable), and is a good proxy to a full<br />
tracklog in zone-pricing applications. In addition, whenever the<br />
accuracy of the endpoints of the trips the parking events is<br />
sufficient, this same meter could be used as a parking meter for<br />
that parking event, yielding a way to meter for any<br />
combination of road use, parking use and pay-as-you-drive<br />
insurance in a single system. The principle advantage of a<br />
three-in-one meter is the distribution of infrastructure costs<br />
over three sectors (road, parking and insurance) making it<br />
possible for a road-pricing meter to “pay for itself” in parking<br />
and insurance discounts from the motorist’s perspective [5].<br />
To enable a highly effective device, we have set a design<br />
goal of 1.5m-2m accuracy, 99% of the time in 75% of the<br />
parking lots in a city with the building density of Toronto,<br />
Canada. Even when a parking location cannot be known<br />
accurately enough to assess a fee, both road-pricing and<br />
insurance-pricing can still proceed. This gives the “parklog”<br />
the flavor of disruptive technology disrupting both dedicated<br />
short range communication [DSRC] and the tracklog for roadpricing.<br />
2.1 Multipath Mitigation<br />
2. METHODS<br />
Among the noise sources contributing to positioning error,<br />
multipath is the most difficult to characterize in a way that<br />
allows unambiguous mitigation. When other error sources are<br />
controlled, multipath can become the largest remaining<br />
contributor to unmodeled noise/interference. The causes and<br />
properties of this process are well described elsewhere [6-8].<br />
Of the four classes of mitigation techniques: antenna<br />
positioning, hardware compensation (antenna design), software<br />
mitigation and static antenna arrays with signal correlators [6],<br />
software mitigation is the only feasible approach for an onboard<br />
parking application. Antenna siting will seldom be<br />
optimal. Increased hardware size, complexity and expense are<br />
aesthetically, operationally and economically unacceptable,<br />
since the antenna for the target meter must be mounted in or on<br />
many millions of private vehicles.<br />
2.2 Simulating Galileo<br />
Collecting GNSS signals in densely built-up areas (“urban<br />
canyon”) often results in a diminished number of usable<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />
76
signals. On some occasions, when using only a single system<br />
such as GPS, there may be too few to calculate a horizontal<br />
position (a minimum of four signals is best). This accounts for<br />
the frequent loss of position lock requiring ancillary aids such<br />
as inertial navigation. In a parking application, we must rely on<br />
GNSS signals only, so that our process would frequently fail to<br />
fulfill our stated design goal without a redundant system,<br />
which in our case is the European Union’s Galileo, expected to<br />
be operational in 2008.<br />
Dual GPS/Galileo receivers are expected to improve<br />
position availability and accuracy considerably. As recently<br />
reported by Feng [9], dual receivers “will increase service<br />
coverage from 55% to 95% notably in the urban areas where<br />
most mass-market applications are developed.” The following<br />
table details the expected improvement.<br />
<strong>Analysis</strong><br />
Scenario<br />
& Constellation<br />
Availability of<br />
20m 95% 2D<br />
accuracy<br />
28 GPS<br />
only<br />
(%)<br />
28 GPS<br />
+27Gal<br />
(%)<br />
Accuracy and<br />
availability –<br />
satellites only<br />
28 GPS<br />
only<br />
(m/%)<br />
28 GPS<br />
+27Gal<br />
(m/%)<br />
Accuracy<br />
availability<br />
differential<br />
28 GPS<br />
+27Gal<br />
(m/%)<br />
Open Sky 90% 100% 7/95 4/95 1.5/95<br />
Suburban 70% 100% 32/90 8/95 4/95<br />
Low-rise 30% 90% 17/50 14/95 7/95<br />
High-rise 15% 80% - 42/90 25/90<br />
Table 1: Performance improvements resulting from both GPS<br />
and Galileo constellations for urban operations (Table adopted<br />
from [9])<br />
In our work we simulate a dual GPS/Galileo receiver by<br />
combining two sets of GPS data collected with a<br />
uBlox/ANTARIS TIM-LP receiver separated by three or more<br />
hours (i.e. not within three hours of an integer multiple of a<br />
sidereal day) so that the two visible satellite sub-constellations<br />
are essentially independent. A data sample from an actual dual<br />
receiver would exhibit at least as good a geometric distribution<br />
as would this “poor-man’s” simulation, hence we argue that<br />
this simulation technique does not unduly favor our approach.<br />
SW corner 10:25 and 13:58 (3.5 hr separation)<br />
Figure 1: An example of two GPS constellations viewed by a<br />
stationary receiver and separated by 3.5 hrs. See also Figure 3.<br />
2.3 Software Mitigation<br />
The key assumption in software mitigation is that it is<br />
possible to infer, in near realtime, which of the pseudo-range<br />
signals available at a given moment are contributing more error<br />
1579<br />
than others. Extensive work in this area, Receiver Autonomous<br />
Integrity Monitoring (RAIM), focuses on real-time<br />
determination of failure of one of several SVs (space vehicles)<br />
in safety of life applications [10], This work has been extended<br />
to include multiple failures and has led to the development of<br />
related techniques to determine which signals may be more<br />
subject to multipath disturbance in a dynamic, unaided manner.<br />
Our work relies on some of these extensions.<br />
Bisnath and Langley [6] extend earlier methods to compute<br />
an inferred GPS observable they call pseudo-multipath,<br />
incorporating pseudorange multipath, tracking error and<br />
receiver noise, making it a good indicator of unmodeled error<br />
and noise for position estimation, the predominant component<br />
of which is multipath. Related work by Nayak, et al [7]<br />
develops this same measure, which they call code-carrier<br />
residual (r). We use this formulation to weigh each signal in a<br />
data sample to determine whether to include that signal in the<br />
position calculation.<br />
Misra and Bednarz [10] extend RAIM techniques to deal<br />
with multiple SV failures. Their method, referred to, in this<br />
paper, as Misra04, provides for randomly selecting numerous<br />
subsets of 6 or 7 signals from a larger set of available signals,<br />
such as would be available to a dual GPS/Galileo receiver.<br />
Pseudo-random selection is constrained to minimize dilution of<br />
precision (horizontal dilution of precision (HDOP) in our<br />
case), and repeated selection and position calculations are<br />
clustered and outliers are observed to de-weight SVs. We use<br />
this algorithmic approach to exclude noisy signals that were<br />
not filtered out by the code-carrier residual (pseudo-multipath).<br />
These two methods in combination select the least noisy<br />
signals, constrained by a requirement for a constellation subset<br />
yielding good geometry for subsequent position calculation.<br />
Merge two, several-minute GPS readings, sufficiently<br />
separated in time to simulate a dual receiver<br />
<br />
Drop signal(s) based on Code-carrier residual filter [6,7]<br />
<br />
Drop signal(s) based on Misra04 (RAIM-derivative) filter [10]<br />
<br />
Compute LAT, LON using remaining signals<br />
<br />
Compute associated characterization<br />
Figure 2: The filtering and position calculation process is set<br />
up as illustrated here and detailed in the following section.<br />
The reader might question the efficacy of this degree of<br />
filtering given that the receivers are stationary. However,<br />
consider that in a complex multipath environment in which<br />
signals are collected for several minutes, the movement of the<br />
SVs, the movement of tree crowns, and passing vehicles might<br />
each effect the relative degree of multipath of each SV from<br />
moment-to-moment as it impinges on the stationary antenna.<br />
We will be exploring this further in subsequent work.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />
77
3. THE PROPOSED ALGORITHM<br />
Following from the previous section, we detail each stage<br />
in the process: two filter stages, signal selection and final<br />
position calculation.<br />
Input to this process is carrier-phase data, captured every<br />
second.<br />
3.1 Pseudo-Multipath based filter<br />
For the first stage of our dual-filter method, we compute the<br />
pseudo-multipath observable, r, and its standard deviation, r,<br />
for each visible SV.<br />
r = 2dion - N + (p) + (),<br />
where:<br />
dion ionospheric delay (m)<br />
wavelength of L1 carrier (m)<br />
N the integer cycle ambiguity (cycles)<br />
(p) code noise (receiver noise + multipath) (m)<br />
() carrier phase noise (receiver + multipath) (m)<br />
The SVs are ranked in ascending order of the magnitude of<br />
r. Since we suspect signals with higher variance, we simply<br />
discarded the single most suspect signal. in this first<br />
experiment.<br />
The full derivation of r is developed in [7], and is described<br />
therein as containing:<br />
“twice the atmospheric error, the carrier phase ambiguity, code receiver<br />
noise, and code multipath. Carrier receiver noise and multipath can be<br />
neglected since they are very small compared to the code values. The<br />
ambiguity term is a constant if there are no cycle slips whereas the<br />
ionospheric error generally varies slowly over time. A piece-wise linear<br />
regression model can therefore be implemented to remove terms due to the<br />
ionosphere and ambiguity. Since the ionospheric error changes with time, a<br />
regression model [could be] implemented over predefined averaging<br />
intervals. … The resulting code-carrier residual, [r], will contain multipath<br />
and receiver noise and can be used for further analysis. Subtracting out the<br />
mean removes not only the integer ambiguity, but also the bias components<br />
present in all of the remaining terms. Code multipath is a nonzero mean<br />
process and this technique only isolates relative multipath effects and not<br />
the absolute multipath because the regression process removes the portion<br />
of multipath with nonzero mean.<br />
In our application, we are using this as one of two<br />
“advisors” to help us select the signals least disturbed by<br />
multipath. Hence, the fact that this is only a relative indicator<br />
and that it also incorporates minor components of other error<br />
sources does not detract from its value as a way to identify the<br />
noisiest signals.<br />
3.2 Modified Misra04 (RAIM-derivative) filter<br />
The steps we used in our adaptation of the Misra04<br />
algorithm [10], are as follows:<br />
1. Set K as the number of SVs in view less the one<br />
rejected by the pseudo-multipath filter;<br />
2. Divide the sky into six bins as shown in Figure 3;<br />
3. Characterize each SV (space vehicle) as belonging to<br />
one of the bins, based on its elevation and azimuth;<br />
4. Select 4K subsets of SVs from the original set of SVs<br />
as follows:<br />
select one SV randomly from each of the six bins.<br />
1580<br />
Select two more SVs from the remaining SVs<br />
If PDOP > 3, select one more from those remaining.<br />
5. Compute, then cluster 4K positions using these<br />
selections.<br />
6. Compute the mean of the cluster of positions;<br />
7. Compute the distance of each computed position from<br />
the mean of the cluster<br />
8. Divide the cluster into 5 concentric rings around the<br />
cluster mean each ring incremented by d=0.2M where<br />
M is the distance of the farthest position from the<br />
cluster mean. Hence the rings are at d, 2d, … 5d from<br />
the cluster mean.<br />
9. For each of the 4K positions, assign a value from 1 to<br />
5 to every contributing SV, based on the concentric<br />
rings that position falls in.<br />
10. Sum these assigned values from each of the K SVs<br />
11. Discard the signal for the highest ranked SV<br />
Horizon<br />
Elevation<br />
40°<br />
Figure 3: We divided the sky into six bins as described in [10].<br />
The two symbols represent SVs from the two constellation<br />
configurations shown in Figure 1.<br />
3.3 Static Position calculation<br />
N<br />
We are now left with the original, merged dataset (i.e., the<br />
dataset that simulates a dual GPS/Galileo receiver) less the<br />
signals from the two SVs that were rejected as the least<br />
trustworthy signals. We compute receiver position at each<br />
second from this filtered dataset, then compute mean and<br />
covariance for these position sets. The mean is our new<br />
position estimate for the position of the stationary receiver and<br />
the covariance matrix is an element of characterization data.<br />
4. EXPERIMENTAL RESULTS<br />
In order to gauge the efficacy of our processing we will<br />
need to compare position calculations with and without our<br />
process. Since we are reading carrier phase signals with a<br />
commercial receiver (TIM-LP) prior the application of<br />
proprietary filters to which we have no access, we are required<br />
to perform our own position calculations, both for our<br />
approach and for the control approach. This means that our<br />
position estimates may not be as accurate as those produced by<br />
the commercial receiver. However, it is the relative<br />
improvement in which we are interested.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />
78
For our first test, we recorded two 15-minute data sets from<br />
a stationary receiver, 3.5 hours apart. The location was an older<br />
neighborhood, 3 or 4 meters from two 2-storey houses with a<br />
large-canopied tree about 6 meters away and other houses and<br />
mature trees somewhat further away. The effect of filtering this<br />
first GPS dataset can be seen in Figure 4.<br />
Figure 4: The larger scatter represents unfiltered position<br />
calculations, while the smaller scatter represents positions after<br />
filtering. The two ellipses represent the 3 distance (in<br />
degrees) from the mean of each cluster.<br />
0.3289x10 -8 0.0838x10 -8<br />
-0.5489x10 -8 1.3408x10 -8 -0.0516x10 -8 0.1977x10 -8<br />
Covariance: unfiltered scatter Covariance: filtered scatter<br />
Table 2: Covariance Matrices from two scatters in Figure 4<br />
(each element represents 2 ; hence, units are degrees squared).<br />
The covariance matrices from these two scatters are shown<br />
in Table 2. By comparing the ratios of standard deviations for<br />
LAT and LON taken from these matrices we get a sense of the<br />
percentage level of reduction achieved by these filters.<br />
To illustrate with the first element the variance in degrees<br />
LAT (LAT 2 ) we calculated the percentage change in standard<br />
deviation value as:<br />
1- (LAT-filtered / LAT-unfiltered)<br />
Hence the percentage change in standard deviation for LAT<br />
and LON are: 49.5% and 61.6%, respectively, representing a<br />
considerable reduction in sigma values.<br />
A second test, recorded similarly, with the two data subsets<br />
7 hours apart and several meters away, endured less multipath<br />
effects and provided good, but less dramatic results, shown in<br />
Figure 5.<br />
Figure 5: The second data set<br />
1581<br />
0.0330x10 -8<br />
-0.0051x10 -8<br />
0.0159x10 -8<br />
0.0519x10 -8 -0.0031x10 -8<br />
unfiltered filtered<br />
Table 3: Covariance Matrices from two scatters in Figure 5.<br />
0.0561x10 -8<br />
The percentage change in standard deviation values for<br />
LAT and LON are: 30.6% and -4%, respectively, representing<br />
a significant but mixed reduction in sigma values (LON did not<br />
improve, and may have worsened).<br />
5. CONCLUSIONS and FUTURE WORK<br />
We have shown that it is feasible to reduce positioning<br />
variance due to multipath in the case of a static GNSS receiver.<br />
In these two experiments the higher multipath data showed the<br />
greatest improvement of course much more testing is needed.<br />
By gathering signals for a modest amount of time (we propose<br />
7 to 10 minutes) and using techniques to isolate signals that<br />
contribute relatively more noise than others, and by taking<br />
advantage of the expected dual GPS/Galileo receivers, we are<br />
optimistic we can specify a processor that would be the<br />
positioning engine for a reliable in-vehicle meter for roadpricing,<br />
pay-as-you drive insurance, and most parking-pricing.<br />
For our first experiment with this approach to reduce<br />
variation in position error for a stationary GNSS receiver, we<br />
have successfully adapted and simplified two existing results<br />
from the literature. Clearly, making decisions to drop the least<br />
trustworthy signals help, but it is also understood that which<br />
signals are best at any one moment can change, even for a<br />
stationary receiver. For this reason, we are currently exploring<br />
with good success several additional ideas. These include<br />
time-slicing the signals into numerous smaller windows,<br />
iterative removal of 0 or more SVs (rather than removal of<br />
exactly one SV per filter), dynamic thresholds, and others. We<br />
expect to be able to improve considerably on the current<br />
results.<br />
REFERENCES<br />
[1] “Feasibility Study of Road Pricing in the UK: A Report to the Secretary<br />
of State, UK,” Department for Transport 2004.<br />
[2] T. Grayling, J. Foley, and N. Sansom, “In the Fast Lane,” Institute for<br />
Public Policy <strong>Research</strong> (IPPR) – UK, June 2004.<br />
[3] “Fair Payment for Infrastructure Use,” Commission of European<br />
Communities, 1998<br />
[4] H. Appelbe, “Taking Charge,” Traffic Technology International,<br />
October/November 2004, pg 52.<br />
[5] B. Grush, “The Delicate Art of Tolling (Part 1),” Tolltrans, 2004, pg 52;<br />
and Part 2, Traffic Technology International, Dec ‘04/Jan ’05. pg 58.<br />
[6] S. Bisnath and R. Langley, “Pseudorange Multipath By Means of<br />
Multipath Monitoring and De-Weighting,” KIS 2001, June, 2001.<br />
[7] R. Nayak, M. Cannon, C. Wilson, and G. Zhang, “<strong>Analysis</strong> of Multiple<br />
GPS Antennas for Multipath Mitigation in Vehicular Navigation,”<br />
Institute of Navigation National Technical Meeting, Jan 2000.<br />
[8] P. Dana, “Global Positioning System (GPS) Time Dissemination for<br />
Real-Time Applications”, Real-Time Systems, 12, 9-40. 1997<br />
[9] Y. Feng, “Combined Galileo and GPS: A Technical Perspective,”<br />
Journal of Global Positioning Systems Vol. 2, No.1: 67-72, (2003).<br />
[10] P. Misra and S. Bednarz, “Navigation for Precision Approaches”,<br />
GPSWorld, April 2004.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:06 from IEEE Xplore. Restrictions apply.<br />
79
A SIGNAL CLASSIFICATION APPROACH USING TIME-WIDTH VS FREQUENCY BAND<br />
SUB-ENERGY DISTRIBUTIONS<br />
Karthikeyan Umapathy<br />
Dept. of Electrical and Computer Engg.,<br />
The <strong>University</strong> of Western Ontario,<br />
London, ON N6A 5B8, Canada<br />
Email: kumapath@uwo.ca<br />
ABSTRACT<br />
Time-frequency (TF) signal decompositions provide us with ample<br />
information and extreme flexibility for signal analysis. By applying<br />
suitable processing on the TF decomposition parameters,<br />
even subtle signal characteristics can be revealed. In many real<br />
world applications, identification of these subtle differences make<br />
a significant impact in signal analysis. Particularly in classification<br />
applications using TF approaches, there may be situations where<br />
a localized high discriminative signal structure is diluted due to<br />
the presence of other overlapping signal structures. To address<br />
this problem we propose a novel approach to construct multiple<br />
time-width vs frequency band mappings based on the energy decomposition<br />
pattern of the signal. These mapping are then analyzed<br />
to locate the highly discriminative features for classification.<br />
Initial results with two real world biomedical signal databases (1)<br />
Vibroarthrographic (VAG) signals and (2) Pathological speech signals,<br />
indicate high potential for the proposed technique.<br />
1. INTRODUCTION<br />
Time-frequency (TF) transformations have significantly contributed<br />
towards complex signal analysis and automatic classification. In<br />
classification applications using TF approach, it is often a small<br />
area or pockets of areas in the TF plane that actually exhibit the<br />
difference between the classes of signals. Within these small areas,<br />
there may be overlapping multiple signal components with varying<br />
discriminative characteristics. The overall discriminative power of<br />
the area is normally decided by the high energy signal components<br />
which dilute the discriminative characteristics of less energy signal<br />
components. It may so happen that a high discriminative but less<br />
energy component is masked by a less discriminative but high energy<br />
component. Typical biomedical signals contain a mixture of<br />
coherent and non-coherent signal structures with varying localized<br />
overlaps. Using some criteria, if we can separate these localized<br />
overlapping structures, it may lead to a better understanding of the<br />
signal thereby to extract high discriminative features for classification<br />
applications.<br />
In general, all real world signals contain both coherent and<br />
non-coherent structures. Coherent structure have definite TF localization<br />
unlike the non-coherent structures. Any iterative decomposition<br />
algorithm such as matching pursuits with TF dictionaries<br />
model the coherent structures during the initial iterations as<br />
they correlate well with the dictionary elements. The non-coherent<br />
Thanks to NSERC for funding this research work.<br />
Sridhar Krishnan<br />
Dept. of Electrical and Computer Engg.,<br />
<strong>Ryerson</strong> <strong>University</strong>,<br />
Toronto, ON M5B 2K3, Canada<br />
Email: krishnan@ee.ryerson.ca<br />
structures on the other hand are broken into finer and finer structures<br />
till the information is diluted across the whole dictionary [1].<br />
The contribution of coherent and non-coherent structures in a signal<br />
decide the energy capture pattern of the decomposition algorithms.<br />
The previous work [2] of the authors, introduced a novel timewidth<br />
vs frequency band mapping (constructed from the decomposition<br />
parameters) to identify the high discriminative TF tilings<br />
between different classes of signal using Local Discriminant Bases<br />
(LDB) algorithm. The proposed work uses a similar mapping,<br />
however splitting it into multiple mappings for identifying better<br />
discriminatory features.<br />
The paper is organized as follows: Section II covers methodology<br />
consisting of adaptive time-frequency transform, multiple<br />
TFD slices, multiple sn vs fn mappings, databases, feature extraction<br />
and pattern classification. Results and discussion are given in<br />
Section III and conclusions in Section IV.<br />
2. METHODOLOGY<br />
2.1. Adaptive Time-frequency Transform (ATFT)<br />
The signal decomposition technique used in this work is based on<br />
the matching pursuit (MP) [1] algorithm. MP is a general framework<br />
for signal decomposition. The nature of the decomposition<br />
varies according to the dictionary of basis functions used. When<br />
a dictionary of TF functions is used, MP yields an adaptive timefrequency<br />
transformation [1]. In MP any signal x(t) is decomposed<br />
into a linear combination of K TF functions g(t) selected<br />
from a redundant dictionary of TF functions as given by:<br />
K−1 <br />
<br />
an t − pn<br />
x(t) = √ g<br />
exp {j(2πfnt + φn)} , (1)<br />
sn<br />
n=0<br />
sn<br />
where an is the expansion coefficient, the scale factor sn also<br />
called as octave or time-width parameter is used to control the<br />
width of the window function, and the parameter pn controls the<br />
temporal placement. The parameters fn and φn are the frequency<br />
and phase of the exponential function respectively. The signal<br />
x(t) is projected over a redundant dictionary of TF functions with<br />
all possible combinations of scaling, translations and modulations.<br />
The dictionary of TF functions can either suitably be modified or<br />
selected based on the application in hand. In our technique, we<br />
are using the Gabor dictionary (Gaussian functions) which has the<br />
best TF localization properties. At each iteration, the best correlated<br />
TF functions to the local signal structures are selected from<br />
0-7803-8874-7/05/$20.00 ©2005 IEEE V - 477<br />
80<br />
ICASSP 2005<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.
Frequency bands<br />
F4<br />
F3<br />
F2<br />
F1<br />
ME5<br />
s1 s2 ...........................sn<br />
Time−width<br />
ME5 = ME1 + ME2 + ME3 + ME4<br />
SPLIT<br />
the dictionary. The remaining signal called the residue is further<br />
decomposed in the same way at each iteration subdividing them<br />
into TF functions.<br />
2.2. Multiple TFD slices<br />
As explained in Section 1, in the initial iterations, the ATFT algorithm<br />
captures the coherent signal structures which have correlated<br />
TF dictionary elements and then as the number of iterations grows,<br />
it tries model the non-coherent structures by breaking them finer<br />
and finer till the information is diluted across the whole dictionary.<br />
The energy capture pattern can be extracted from the normalized<br />
decomposition parameter an. In order to explain how this energy<br />
capture pattern can be utilized to extract overlapping signal structures,<br />
let us take an example of a synthetic signal y(t) which is<br />
composed of a sinusoid, two chirps and random noise. The signal<br />
y(t) is given by:<br />
y(t) =w1s(t)+w2c1(t)+w3c2(t)+w4r(t) (2)<br />
where s(t) represent the sinusoid at approximately Fs/4, c1(t) is a<br />
linear chirp with increasing frequency cutting the sinusoid, c2(t) is<br />
another linear chirp with decreasing frequency again cutting both<br />
the sinusoid and c1(t). r(t) represents the random noise. The<br />
weight factors w1,2,3,4 are (1, .1, .01, .001) respectively. We performed<br />
the ATFT decomposition (1000 iterations) of y(t) using a<br />
Gabor dictionary. Figures 3(a) and 3(b) show y(t) in time domain<br />
and TF domain (spectrogram is used inorder to show all the three<br />
components at the same time). Here we deliberately introduced energy<br />
differences between the components so as to demonstrate the<br />
significance of energy capture pattern. Most of the times, the first<br />
few iterations capture significant amount of signal energy (coherent<br />
structures). Thereafter with the increase in the number of iterations<br />
we move from modeling coherent structures to non-coherent<br />
structures. The energy capture pattern of the ATFT decomposition<br />
for y(t) is shown in Fig. 2 (the first 50 iterations). The curve<br />
represents the normalized energy captured per iteration. We can<br />
see the energy captured per iteration drops as we move along the<br />
iterations. In this work as an example we split the energy capture<br />
pattern into 4 parts i.e. (E1) the number of iterations at which<br />
the energy captured per iteration drops to 10% of its initial value<br />
(initial value= 1), (E2) the number of iterations between 10% of<br />
Frequency band<br />
ME4<br />
ME3<br />
ME2<br />
ME1<br />
Fig. 1. Time-width vs frequency band mapping<br />
V - 478<br />
81<br />
Time−width<br />
initial value and 1% of initial value, (E3) the number of iterations<br />
between 1% of initial value and 0.1% of initial value, and (E4) the<br />
number of iterations between 0.1% of initial value to the end of<br />
decomposition.<br />
Normalized energy capture curve − log scale<br />
10 0<br />
10 −1<br />
10 −2<br />
10 −3<br />
0.1<br />
0.01<br />
0.001<br />
E1 E2 E3 E4<br />
Energy decomposition pattern( E1, E2, E3 and E4)<br />
10<br />
0 5 10 15 20 25 30 35 40 45 50<br />
−4<br />
Number of iterations<br />
Fig. 2. Energy capture pattern of the sample signal y(t).<br />
Following the energy capture pattern we accumulate the TF<br />
functions into the above explained four parts (E1-4). For this example,<br />
we had 5 TF functions for E1, 9 TF functions for E2, 16 TF<br />
functions for E3 and 970 TF functions for E4. The number of TF<br />
functions will give an idea that almost 99% of the signal energy<br />
needs only 30 (1 to E3) TF fucntions, whereas the remaining 1%<br />
signal energy (mostly noise like strutcures) needs 970 TF functions<br />
or more. Using these 4 sets of TF fucntions we construct 4<br />
different TFDs. i.e. splitting the original TFD of y(t) into 4 TFDs<br />
based on the energy capture pattern. The corresponding 4 TFDs<br />
are shown in Figs. 3(c), 3(d), 3(e) and 3(f). If we closely look at<br />
the TFDs, we can see the TFD in Fig. 3(c) showing the sinusoid<br />
s(t) alone, the TFD in Fig. 3(d) shows the disappearing sinusoid,<br />
the TFD in Fig. 3(e) shows the evolving chirp c1(t) signal from the<br />
sinusoid background and the TFD in Fig. 3(f) showing a stronger<br />
but noisy chirp c1(t), a faint evolving chirp c2(t) and the random<br />
noise. It is obvious to see that TFDs 3(c) to 3(f) are better individual<br />
representations of the signal components than the combined<br />
TFD 3(b). In this example if it so happens that one of the components<br />
that was masked by the overlapping strong component is<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.
Amplitude (.au)<br />
1.5<br />
1<br />
0.5<br />
0<br />
−0.5<br />
−1<br />
−1.5<br />
0 50 100 150 200 250 300 350 400 450 500<br />
Time samples<br />
Normalized frequency<br />
Normalized frequency<br />
Fs/2<br />
Fs/2<br />
(a)<br />
Time samples<br />
(c)<br />
Time samples<br />
(e)<br />
1024<br />
1024<br />
Normalized frequency<br />
Normalized frequency<br />
Fs/2<br />
Fs/2<br />
Time samples<br />
(b)<br />
Time samples<br />
(d)<br />
Time samples<br />
Fig. 3. (a) sample signal y(t), (b) TFD of the sample signal, (c)<br />
TFD of sample signal with TF functions of E1, (d) TFD of sample<br />
signal with TF function of E2 (e) TFD of sample signal with TF<br />
functions of E3 and (f) TFD of the residue signal<br />
the discriminator that we are looking for, then the proposed technique<br />
of generating multiple TF mapping using the energy capture<br />
pattern will be of immense help. Here it should be noted that the<br />
energy split shown in this example is not the best to show all the<br />
components individually and separately. This is just to give an<br />
idea about the possiblity of using the energy capture patttern for<br />
removing overlapping structures in complex situations. Also this<br />
approach may not work in all situations unless there are hidden<br />
signal structures either with (a) different energy contribution or<br />
(b) different contributions from coherent and non-coherent structures<br />
or both (a) and (b). Extending this same concept of multiple<br />
TF mappings, we now apply it on a novel time-width vs frequency<br />
band mapping as will be explained in the next Section 2.3.<br />
2.3. Multiple sn vs fn mappings<br />
In order to effectively analyze for classification applications, the<br />
ATFT signal decomposition parameters need to be rearranged in a<br />
pseudo dictionary format. There are five parameters as explained<br />
in Section 2.1 viz. an, sn, fn, pn and φn that represent the index<br />
of each of the dictionary element. After a signal is decomposed<br />
into TF functions, we group the TF functions with the time-width<br />
parameter sn in X axis and the the fn parameter in the Y axis.<br />
In order to reduce the computational complexity instead of using<br />
all the possible values of the fn parameter we break the frequency<br />
range into M bands only. whereas sn takes all the possible values<br />
(2 1..14 ) depending on the length of the signal. Each combination of<br />
Normalized frequency<br />
Fs/2<br />
(f)<br />
1024<br />
1024<br />
1024<br />
V - 479<br />
82<br />
sn with one of the M frequency bands form a cell which contains<br />
the cumulative normalized energy of all the TF functions falling<br />
in that particular combination of sn and frequency band. The left<br />
side of the Fig. 1 shows a sample time-width vs frequency band<br />
mapping. In the proposed work we used 4 frequency bands, which<br />
means we transform the decomposition parameters of a signal into<br />
14 time-widths (sn) vs 4 frequency band mapping.<br />
From this time-width vs frequency band mapping we can readily<br />
obtain the energy distribution of the signal in terms of the timewidth<br />
and frequency band (center frequency) decomposition parameters.<br />
Depending upon the application one can choose say<br />
K number of cells that covers an area corresponding to certain<br />
amount of signal energy. This area will provide the sn and fn<br />
ranges which are significant for that particular application. This<br />
area can be arrived by averaging the time-width vs frequency band<br />
of N sample signals. For classification applications this can be<br />
done using LDB as demonstrated in authors previous work [2].<br />
Considering the benefits of multiple TFD slices in signal analysis<br />
as explained in Section 2.2, instead of using one time-width vs frequency<br />
band mapping that covers all the signal energy, we slice it<br />
into L time-width vs frequency band mappings as shown in Fig. 1<br />
(L =4). This L sliced time-width vs frequency band mappings<br />
are expected to separate out the overlapping energy distribution of<br />
the TF functions based on the energy capture pattern and thereby<br />
enhance the discriminatory power of the cells. We performed classification<br />
on two biomedical signal databases to verify the effectiveness<br />
of the proposed technique of splitting the time-width vs<br />
frequency band mapping.<br />
2.4. Databases<br />
(1) Vibroarthographic (VAG) signals: These are the vibration signals<br />
emitted from the human knee joints during an active movement<br />
of the leg and can be used to detect the early joint degeneration<br />
or defects. Extensive work [3] has been done using timefrequency<br />
approach in analyzing these signals. Few important<br />
characteristics of the VAG signals which make them difficult to<br />
analyze are as follows: (i) Highly non-stationary, (ii) Varying frequency<br />
dynamics, and (iii) Multi-component signal. The database<br />
consists of 36 signals with 19 normal and 17 abnormal signals.<br />
(2) Pathological speech signals: These are speech signals recorded<br />
from the pathological and normal talkers in a sound-attenuating<br />
booth at the Massachusetts Eye and Ear Infirmary. All signals<br />
were sampled at 25 kHz. The signals were the first sentence of<br />
the rainbow passage, ’when the sunlight strikes rain drops in the<br />
air, they act like a prism and form a rainbow’. More details on the<br />
classification of this database can be found in author’s previous<br />
work [4]. The database used in this study consists of 30 signals<br />
with 15 normal and 15 pathological signals.<br />
2.5. Feature Extraction and Pattern Classification<br />
<strong>Signal</strong>s from both the databases were decomposed using the ATFT<br />
algorithm (5000 iterations) as explained in Section 2.1. For each<br />
signal, 4 time-width vs frequency band mappings were created using<br />
the decomposition parameters. The energy split used for generating<br />
these 4 mappings were same as the one used in the example<br />
of synthetic signals (E1-4). In these 4 mappings, each row<br />
of the mapping represents the signal energy distribution over all<br />
time-widths for a particular band of frequencies. Let us name the<br />
mappings as ME1, ME2, ME3 and ME4 and the frequency bands<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.
as F1, F2, F3 and F4 as shown in Fig. 1. Now for each combination<br />
of MEx and Fx we extract P × 14 energy values from<br />
the cells as feature matrix, where P is the number of signals in<br />
the database. From the 16 combinations of MEx and Fx, only<br />
non-zero feature matrices are used for classification. In order to<br />
compare the results with the original non-split time-width vs frequency<br />
mappings (let it be ME5), another set of 4 feature matrices<br />
were generated using the same procedure. When tested with the<br />
Ho-Kashyap [5] algorithm, most of these 20 combinations (MEx<br />
and Fx) for both the databases favored non-linear separability to<br />
achieve maximum classification accuracies. However, as the main<br />
focus of the proposed technique is to demonstrate the relative improvement<br />
in discrimination between the split and non-split timewidth<br />
vs frequency mappings, we restrict our analysis to a linear<br />
classifier. The extracted features were fed to a Linear Discriminant<br />
<strong>Analysis</strong> (LDA) based classifier using SPSS [6]. The classification<br />
accuracy was validated using the leave-one-out method which is<br />
known to provide a least bias estimate.<br />
3. RESULTS AND DISCUSSION<br />
A two stage classification was performed for the VAG database.<br />
In the first stage, we performed a two group classification classifying<br />
the VAG signals into normal and abnormal. Table 1 shows<br />
the highest classification accuracy achieved out of the 20 combinations<br />
of MEx and Fx. We observed an overall classification accuracy<br />
of 88.9% using leave-one-out (Cross. V) based LDA for the<br />
combination of ME4 and F3. This is higher than the classification<br />
accuracies reported by existing works for this database. There is<br />
no difference in the classification accuracy comparing it with the<br />
combination of non-split ME5F3. This is because F3 is non zero<br />
only for ME4 which means, there is no overlap in F3. So eventually<br />
ME4F3 and ME5F3 were the same. The results also gave a<br />
clue that the discriminatory information between normal and abnormal<br />
lies in F3.<br />
Table 1. Table showing 2 group classification accuracy achieved<br />
for the VAG database. Cross.V = Leave-one-out method LDA, %<br />
= Percentage of classification<br />
Method <strong>Group</strong>s Normal Abnormal Total<br />
Cross.V Normal 15 4 19<br />
Abnormal 0 17 17<br />
% Normal 78.9 21.1 100<br />
Abnormal 0 100 100<br />
We now performed the second stage of classification on the 17<br />
abnormal VAG signals. The abnormal VAG signals in the database<br />
are from different kinds of knee pathologies. Chondromalcia patella<br />
(CMP) [3] is one of the pathologies which has four categories (I,<br />
II, III and IV) of grading based on the severity. It is a difficult<br />
task to classify them by their gradings using the VAG signals. Out<br />
of the 17 abnormal signals, 10 were CMP signals. We performed<br />
a three groups classification on this 10 signals viz. grade(I and<br />
II), grade (II and III) and grade (III and IV). We observed a perfect<br />
classification of 100% using leave-one-out based LDA for the<br />
combination of ME2 and F1. None of the other combinations including<br />
the non-split ME5F1 could achieve 100% classification.<br />
This result explains the fact that splitting the time-width vs frequency<br />
band mappings does enhance the discriminatory power and<br />
also indicates the discriminatory features for CMP lies in the ME2<br />
and F1 mapping.<br />
V - 480<br />
83<br />
Similarly we performed a 2 group classification (normal and<br />
pathological) for the pathological speech database. Table 2 shows<br />
the highest classification accuracy achieved out of the 20 combinations<br />
of MEx and Fx. An overall classification accuracy of<br />
93.3% was achieved using the leave-one-out based LDA. The reported<br />
classification is for the combination of ME1F1 and nonsplit<br />
ME5F1. In which case we observe the classification accuracy<br />
to remain same with or without splitting the time-width vs<br />
frequency mapping. However the results give a clue that the discriminatory<br />
information lies in ME1 and F1.<br />
Table 2. Table showing the 2 group classification accuracy<br />
achieved for the pathological speech database. Cross.V - Leaveone-out<br />
method LDA, % = Percentage of classification<br />
Method <strong>Group</strong>s Normal Pathological Total<br />
Cross.V Normal 13 2 15<br />
Pathological 0 15 15<br />
% Normal 86.7 13.3 100<br />
Pathological 0 100 100<br />
4. CONCLUSIONS<br />
Enhancing the discriminatory power of the TF representations using<br />
a TFD splitting approach was proposed. The technique was explained<br />
using a synthetic signal and two real world signal databases.<br />
Using the technique on the VAG database showed a significant<br />
improvement in the sub classification of abnormal signals. Although<br />
the results are inconclusive for the real world databases,<br />
this approach may better suit for identifying finer discriminatory<br />
features inside global classifications. Adaptively choosing the energy<br />
split might improve the significance of the proposed technique.<br />
Future work involves in arriving at a suitable energy split<br />
ratio based on the nature of the signal, increase the number of frequency<br />
bands and extract visual feature treating the time-width vs<br />
frequency mapping as an image.<br />
5. REFERENCES<br />
[1] S. G. Mallat and Z. Zhang, “Matching pursuit with timefrequency<br />
dictionaries,” IEEE Trans. <strong>Signal</strong> Processing, vol.<br />
41, no. 12, pp. 3397–3415, 1993.<br />
[2] K. Umapathy, S. Krishnan, and A. Das, “Sub-dictionary selection<br />
using local discriminant bases algorithm for signal classification,”<br />
in Proceeding of IEEE Canadian Conference on<br />
Electrical and Computer Engineering, Niagara falls, Canada,<br />
May 2004, pp. 2001–2004.<br />
[3] S. Krishnan, “Adaptive signal processing techniques for analysis<br />
of knee joint vibroarthrographic signals,” in Ph.D dissertation,<br />
<strong>University</strong> of Calgary, June 1999.<br />
[4] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, “Timefrequency<br />
modeling and classification of pathological voices,”<br />
in Proceedings of IEEE Engineering in Medicine and Biology<br />
Society (EMBS) 2002 Conference, Houston, Texas, USA, Oct<br />
2002, pp. 116–117.<br />
[5] M. H. Hassoun and J. Song, “Adaptive Ho-Kashyap rules for<br />
perceptron training,” IEEE Trans. on Neural Networks,vol.3,<br />
no. 1, pp. 51–61, 1992.<br />
[6] SPSS Inc., “SPSS Advanced statistics user’s guide,” in User<br />
manual, SPSS Inc., Chicago, IL, 1990.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:11 from IEEE Xplore. Restrictions apply.
INDEXING OF NFL VIDEO USING MPEG-7 DESCRIPTORS AND MFCC FEATURES<br />
Syed G. Quadri, Sridhar Krishnan and Ling Guan<br />
Dept. of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong><br />
Toronto, Canada, M5B 2K3<br />
{squadri,krishnan,lguan}@ee.ryerson.ca<br />
ABSTRACT<br />
In this paper, we propose an application system to classify<br />
American Football (NFL) Video shots into 4 categories,<br />
namely: Pass plays, Run plays, Field Goal/Extra Point plays<br />
(FG/XP) and Kickoff/Punt plays (K/P). The proposed system<br />
consists of two stages. The first stage is responsible for<br />
play event localization and the latter stage is responsible for<br />
feature mapping and classification. For play event localization<br />
we have proposed an algorithm that uses MPEG-7 motion<br />
activity descriptor and mean of the magnitudes of motion<br />
vectors, in a collaborative manner to detect the starting<br />
point of a play event within a video shot with 83% accuracy.<br />
The indexing and classification stage uses MPEG-7 motion<br />
and audio descriptors along with Mel Frequency Cepstrum<br />
Coefficients (MFCC) features to classify the events into 4<br />
categories using Fisher’s LDA. We obtain indexing accuracy<br />
of 92.5% by using leave-one-out classification technique<br />
on a database of 200 video shots taken from 4 different<br />
games obtained from 4 different networks.<br />
1. INTRODUCTION<br />
The concept of On-Demand entertainment and programming<br />
is fast becoming a reality with the popularity of digital TV<br />
channels. Nearly every professional sports league and team<br />
in North America has a digital channel boasting of On-Demand<br />
programming and statistics. But the reality is that it takes<br />
nearly three to four hours in post-production work to prepare<br />
the highlights for a game. For example, on NFL Sunday<br />
Ticket you get Highlights-On-Demand on Monday morning<br />
for the games played on Sunday. In order to minimize<br />
this delay, we need a system that can analyze the contents<br />
of the broadcast and derive the semantics from the input.<br />
These semantics can be made available to the users<br />
for querying in order to create a true On-Demand experience.<br />
Recently a lot of research has been conducted on automating<br />
the process of indexing and annotating the sports<br />
video streams. Nearly all the major sports have been used<br />
to test the indexing and retrieval systems. One of the major<br />
projects working in generating semantic sports video an-<br />
notations is the ASSAVID project. As detailed in [1], this<br />
project focuses on developing a system that can categorize<br />
different types of sports and provide users with an interface<br />
to query events in a particular sport.<br />
In [2] Miyauchi et. al., used audio, textual and visual<br />
information to classify NFL video into events like touchdowns<br />
and field goals. In [3] Lazarescu et. al., classified<br />
different types of formations within NFL games using the<br />
natural language commentary from the game, the geometrical<br />
information about the play and the domain knowledge.<br />
In [4], Nitta et. al. used closed caption text and audio visual<br />
information to classify plays into 3 categories namely:<br />
scrimmage, FG/XP and K/P.<br />
All of the works mentioned above rely on domain knowledge<br />
to classify different high level concepts within American<br />
football. On the other hand, we propose a system that<br />
classifies recurring events of the game without using any domain<br />
knowledge. These recurring events are the most basic<br />
components of the game. By classifying these basic components<br />
first we can look for higher concepts contained within<br />
each of the basic events and thus generate a hierarchical<br />
graph of concepts which varies from low level to high level.<br />
In this work we focus on utilizing existing standard descriptors<br />
of MEPG-7 as the basic feature set. In [5], the authors<br />
have proposed applications for generating summary highlights<br />
in sports domain using MPEG-7 motion descriptor,<br />
but to our knowledge no one has used MPEG-7 audio and<br />
motion descriptors to index recurring events in the American<br />
football domain.<br />
Section 2, will detail the algorithm proposed for localization<br />
of play events within NFL video shots along with<br />
an analysis on the performance of the algorithm. Section<br />
3, provides details on the features set used for indexing and<br />
the classification scheme utilized. Section 4, presents the<br />
results of the classification scheme and Section 5, provides<br />
the concluding remarks and future directions.<br />
2. LOCALIZATION OF PLAY EVENT<br />
Sports have a very well defined structure. They have a set<br />
of rules that must be followed in order for the game to be<br />
0-7803-8874-7/05/$20.00 ©2005 IEEE II - 429<br />
84<br />
ICASSP 2005<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.
Figure. 1. Motion Vector magnitudes for various plays<br />
played properly. Many sports such as golf, baseball, bowling<br />
and American football have a requirement that the team<br />
or players must be in a distinctive position before each play.<br />
In Golf, the player positions himself by the ball in order to<br />
hit it in a certain direction. Likewise in American football<br />
the two teams first line up face to face before the ball is<br />
snapped to begin the play. The common theme among all<br />
these sports is that before the play starts, the level of motion<br />
activity in the video is lower compared to when the play<br />
has started. This distinction in the motion activity is utilized<br />
in the proposed algorithm to segment play events from<br />
non-play events. Figure 1, shows the magnitude of motion<br />
vectors in different types of NFL plays.<br />
2.1. Proposed Play Event Detection Algorithm<br />
The primary objective of the algorithm is to detect the key<br />
frame that can be used as the starting point of the play event<br />
in the shot. The end point of the play event is not extracted,<br />
as in most American football video shots containing play<br />
events, the shot usually terminates at the end of the play.<br />
In order to extract the intensity of motion descriptor,<br />
MPEG-1 video motion vectors are used. Only the motion<br />
vectors from the P frames are analyzed in order to speed<br />
up the processing time. In MPEG-7 the motion activity descriptor<br />
represents the Standard Deviation of motion vector<br />
magnitudes within a frame. The intensity of motion activity<br />
descriptor along with the mean of the motion vector magnitudes<br />
is used collaboratively in the algorithm to detect the<br />
starting point of the play event. An analysis of 20 video<br />
shots selected from each category was conducted to estimate<br />
the thresholds for the mean and standard deviation of<br />
motion vectors. The following steps detail the algorithm:<br />
Step 1: Find a P frame with a mean value of 4 or higher<br />
Step 2: Determine the gradient of the mean values within a<br />
window (3 or 4 adjacent frames)<br />
II - 430<br />
85<br />
Step 3: If gradients are all positive mark the frame as possible<br />
starting point, else go back to Step 1.<br />
Step 4: If the intensity of motion descriptor has a value of 2<br />
or higher, return frame number as the starting point<br />
Step 5: Otherwise, determine the gradient of the standard<br />
deviation values within a window (3 or 4 adjacent frames)<br />
Step 6: If the gradients are all positive return the frame number<br />
as the starting point, else go back to Step 1.<br />
2.2. Play Event Detection Algorithm Evaluation<br />
The above algorithm was tested on the American football<br />
video shot database which consists of 200 video shots taken<br />
from 4 different games and 4 different networks. In order<br />
to measure the performance of the algorithm, we have to<br />
establish the ground truth about the starting point of the play<br />
event within each video shot. This was accomplished by<br />
having an observer manually index the frame number which<br />
best represented the start point of the play event.<br />
Comparison of results was done by getting the delta between<br />
the ground truth frame number and the frame number<br />
estimated by the algorithm. The results still needed to be<br />
evaluated in terms of what this delta meant in actual time<br />
domain. That is we need to evaluate if the algorithm is estimating<br />
a starting point too early or if it is estimating the<br />
starting point after a certain amount of delay.<br />
Since MPEG-1 video has a frame rate of 30 frames/sec,<br />
building a histogram whose bin size was 30 frames would<br />
give a general idea of how apart the estimated frame numbers<br />
were from the ground truth in actual time domain. Figure<br />
2, shows the histogram of the number of shots within<br />
each time unit. Negative time units represents early detection<br />
and positive time units represents a delayed detection.<br />
From Figure 2 we can see that the algorithm detects the<br />
starting points of the play with 83% accuracy. That is 166 of<br />
the 200 video shots in the database had the starting points<br />
detected within ±1 seconds of the original starting point.<br />
The accuracy of the algorithm can be increased to 86.5% by<br />
increasing the window size from 3 frames to 4 frames. But<br />
this change in window size has its own side effect. By increasing<br />
the window size we are looking for motion activity<br />
being sustained for a longer period of time, which means<br />
we will get more shots with delayed detection.<br />
3. INDEXING AND CLASSIFICATION<br />
One of the biggest application areas for MPEG-7 is multimedia<br />
indexing and retrieval. Since the introduction of<br />
MPEG-7 standard, there has been significant research effort<br />
put in developing applications based on descriptors from<br />
MPEG-7, but to date there has been only a few applications<br />
that utilize MPEG-7 descriptors for sports video indexing<br />
and retrieval. The application we are proposing is a first<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.
Figure. 2. Performance of algorithm in actual time<br />
in the American football domain, which utilizes MPEG-7<br />
motion and audio descriptors along with MFCC features.<br />
In American football domain visual or motion features<br />
play a significantly dominant role in discriminating between<br />
different types of plays as evident from Figure 1. Therefore<br />
firstweevaluatetheefficacy of using motion descriptors<br />
for an American football video indexing system and then<br />
we evaluate the changes in system performance by adding<br />
audio descriptors and MFCC features.<br />
3.1. MPEG-7 motion features<br />
According to MPEG-7 description [6], the standard deviation<br />
of the magnitude of motion vectors formulate the intensity<br />
of motion descriptor. The descriptor takes on the<br />
value of 1 through 5, low value meaning low intensity of<br />
motion. Experiments done by using 5 levels showed that<br />
most of the motion descriptors were quantized into 2 or 3<br />
levels. Thus to provide better motion activity resolution the<br />
descriptor was quantized into 12 levels. Similarly according<br />
to MEPG-7 description the dominant direction descriptor is<br />
calculated by quantizing the angles of the motion vectors<br />
into 8 levels. In this work the same 8 quantization levels<br />
were used to define the dominant direction descriptor.<br />
A 2D feature map was created by combining the two<br />
motion activity descriptors. The motivation behind this was<br />
to create a feature set that can model both the intensity of<br />
motion and the direction of motion, thus discriminating between<br />
high intensity motion in upward direction versus high<br />
intensity motion in lateral direction. As can be seen from<br />
Figure 3, the feature map provides a unique representation<br />
of only 12 × 8 dimensions for both the intensity and direction<br />
of motion. In the feature map, blue colour corresponds<br />
to low values and red colour corresponds to high values.<br />
II - 431<br />
86<br />
Figure. 3. Motion feature map<br />
3.2. MPEG-7 audio features<br />
The motivation behind using audio descriptors is due to the<br />
fact that most sports have a certain vocabulary associated<br />
with each event. Almost all the announcers will utilize some<br />
of the vocabulary to describe similar events. Therefore we<br />
wanted a compact representation of audio characteristics to<br />
describe the general tone and pitch of the announcer. The<br />
objective is to analyze the similarity in the spoken sound<br />
between similar events.<br />
We used 3 MPEG-7 basic spectral audio features, namely:<br />
Audio Spectrum Envelope (ASE), Audio Spectrum Centroid<br />
(ASC) and Audio Spectrum Flatness (ASF) to achieve<br />
our objective. The ASE descriptor represents the power<br />
spectrum of an audio signal and can be calculated by taking<br />
the Fourier transform (FFT) of the audio signal which<br />
is windowed using a Hamming window with an overlap of<br />
50% between adjacent windows.<br />
The ASC descriptor represents the center of gravity of<br />
the power spectrum. This is calculated by adding the energy<br />
in each frequency bin in the FFT spectrum and dividing it<br />
by the total energy in the frame as shown below:<br />
ASC(l) = ΣK−1<br />
k=0 k.|P (l, k)|2<br />
Σ K−1<br />
k=0<br />
|P (l, k)|2<br />
, (1)<br />
where k is the frequency bins index. The descriptor shows<br />
which frequencies are dominated in the spectrum.<br />
The ASF descriptor represents the overall tonal component<br />
in the power spectrum of the audio signal. It is calculated<br />
by calculating the geometric mean of the audio frame<br />
and dividing it by the arithmetic mean of the audio frame as<br />
shown by the equation:<br />
ASF(l) = (ΠK−1<br />
k=0 |P (l, k)|2 ) 1<br />
N<br />
1<br />
N ΣK−1<br />
k=0 |P (l, k)|2<br />
, (2)<br />
where k is the frequency bins index and N is the size of the<br />
short time fourier transform window.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.
All the above descriptor were quantized into 10 levels,<br />
thus providing a feature set of 30 dimensions.<br />
3.3. MFCC features<br />
Due to the fact that most of the video shots contain a lot of<br />
crowd noise, and we want to extract the perceived rhythm<br />
and sound of the spoken content, we needed a feature that<br />
can model the human hearing and also works well under<br />
noisy condition. MFCC has been used extensively in the<br />
speech recognition systems as it tries to emphasize the frequencies<br />
that are more perceptive to the human ear.<br />
First the audio file is pre-processed in order to remove<br />
the silent segments. Then 13 MFCC coefficients are extracted<br />
for each segment. Each of the segments have 50%<br />
overlap, and thus there is lot of redundancy between adjacent<br />
MFCC values. In order to reduce the dimension of the<br />
matrix, the MFCC values are passed to a feature reduction<br />
stage. The MFCC features are reduced to a 12 × 64 matrix.<br />
4. EXPERIMENTAL RESULTS<br />
In order to evaluate the efficacy of the feature set, we used<br />
simple classification scheme such as Fisher’s Linear Discriminant<br />
<strong>Analysis</strong> (LDA). In a specific sense LDA also<br />
commonly refers to techniques in which a transformation<br />
is done in order to maximize between-class separability and<br />
minimize with-in class variability. LDA works on the feature<br />
set with no prior assumptions about the nature of the<br />
data set. It tries to compute a weight vector w, which when<br />
multiplied by the input feature vector x would generate discriminant<br />
functions gi(x). For C classes problem we define<br />
C discriminant functions g1(x)...gC(x). The feature vector<br />
x is assigned to a class whose discriminant function is the<br />
largest value of x.<br />
The test database consists of 200 video shots with durations<br />
varying from 5 seconds to about 25 seconds. In the<br />
database there are 88 pass plays, 67 run plays and 45 kicking<br />
plays. A total of 8 different teams were used to create<br />
the database from 4 different networks. This variety in the<br />
database ensured that the sample space of our work was diverse<br />
and included all major broadcasters.<br />
Table 1, shows the indexing results of using MPEG-7<br />
motion and audio descriptors along with MFCC features.<br />
5. CONCLUSIONS<br />
In this paper we have proposed a system with two main<br />
components. The first component finds the starting points<br />
of play events within a video shot. The second component<br />
is responsible for indexing and classificationofeventsin<br />
the American football domain. Both the components of the<br />
system utilize MPEG-7 motion descriptors, while MPEG-7<br />
II - 432<br />
87<br />
Play MPEG-7 MPEG-7 MPEG-7 motion<br />
Events motion motion+audio audio+MFCC<br />
Pass 79.5% 85.2% 94.3%<br />
Run 92.5% 91.0% 89.6%<br />
FG/XP 87.5% 87.5% 93.8%<br />
K/P 65.5% 82.8% 93.1%<br />
Overall 82.5% 87.0% 92.5%<br />
Table 1. Classification Performance Summary<br />
audio and MFCC features are added to enhance the indexing<br />
capabilities of the system.<br />
Although there is no baseline to compare our results<br />
with, but somewhat similar works reported in indexing and<br />
retrieval of American football events [2] [3] [4], have shown<br />
indexing accuracy of 84%, 81% and 84% respectively. In<br />
this work the we obtained classification accuracy of 82.5%<br />
by using a MPEG-7 motion features alone, while the above<br />
mentioned works used multiple modalities. By using multiple<br />
modalities, our system is able to index the events into 4<br />
categories with 92.5% accuracy.<br />
6. REFERENCES<br />
[1] W.J. Christmas B. Levienaise-Obadia J. Kittler,<br />
K. Messer and D. Koubaroulis, “Generation of semantic<br />
cues for sports video annotation,” in Proc. of IEEE<br />
Intl. Conf. on Image Processing.<br />
[2] N. Babguchi S. Miyauchi, A. Hirano and T. Kitahashi,<br />
“Collaborative multimedia analysis for detecting semantical<br />
events from broadcasted sports video,” in<br />
Proc. of IEEE 16th Intl. Conf. on Pattern recognition.<br />
[3] G. West M. Lazarescu, S. Venkatesh and T. Caelli, “On<br />
the automated interpretation and indexing of american<br />
football,” in IEEE Intl. Conf. on Multimedia Computing<br />
and Systems.<br />
[4] N. Babaguchi N. Nitta and T. Kitahashi, “Extracting<br />
actors, actions and events from sports video - a fundamental<br />
approach to story tracking,” in Proc. of IEEE<br />
Intl.Conf. on Pattern recognition.<br />
[5] R. Radhakrishnan Z. Xioing and A. Divakaran, “Generation<br />
of sports highlights using motion activity in combination<br />
with a common audio feature extraction framework,”<br />
in Proc. of IEEE Intl. Conf. on Image Processing.<br />
[6] P. Salembier B.S. Manjunath and T. Sikora, Introduction<br />
to MPEG-7: Multimedia Content Description Interface,<br />
John Wiley and Sons, England, UK, 2002.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:12 from IEEE Xplore. Restrictions apply.
2004 International Conference on <strong>Signal</strong> Processing 8 Communications (SPCOM)<br />
AUDIO SIGNAL FEATURE EXTRACTION AND CLASSIFICATION USING<br />
LOCAL DISCRIMINANT BASES<br />
Karthikeyan Umupathy, Raveendra; K. Rao<br />
Dept. of Electrical and Computer Engg.<br />
The <strong>University</strong> of Western Ontario<br />
London, ON, Canada N6A 5B9<br />
Email: kumapath@uwo.ca. rkrao@eng.uwo.ca<br />
ABSTRACT<br />
Automatic cIassilication of audio signals is an intcresting<br />
and a challenging task. With the rapid growth of' multimedia<br />
content over Internet, intelligent content-based audio and<br />
video retrieval techniques are required to perform efficient<br />
search over vast databases, Classification schemes form the<br />
basis of such content-based retrieval systems. In this paper<br />
we propose an audio classification scheme using Local Dis-<br />
criminant Bases (LDB) algorithm. The audio signals were<br />
decomposed using wavelet packets and the high discrimi-<br />
natory nodes were selected using the LDB algorithm. Two<br />
different dissimilarity measures were used to sclcct the LDB<br />
nodes and to extract features from them. The features were<br />
fed to a Linear Discriminant <strong>Analysis</strong> based classifier for<br />
a six group (Rock, Classical, Country, Folk, Jazz and Pop)<br />
and a four group (Rock, Classical, Country and Folk) clas-<br />
sifications. Overall classification accuracies as high as 77%<br />
and 88% were achieved for the six and four group classifi-<br />
cations respectively using a database of 170 audio signals.<br />
1.. INTRODUCTION<br />
Over the years many existing techniques [14] have addressed<br />
the problem of classification and content-based retrievat<br />
of audio signals. The general methodology of audio<br />
classification involves extracting discriminatory features<br />
from the audio data and feeding them to a pattern classifier.<br />
The features can be extracted either directly from the<br />
time domain or from a transformation domain depending<br />
upon the choice of signal anaIysis tool. Some of the audio<br />
features that have been succcssfully used for audio classification<br />
include mel-frequency cepstral coefficients (MFCC)<br />
[3], spectral similarity, timbral texture [3], band periodicity<br />
[2], zero crossing rate [2], entropy [5] and octaves (61.<br />
Few techniques generate a pattern from the features and use<br />
it for classification by the degree of correlation. Few other<br />
techniques use the numerical values of the features with statistical<br />
classification packages.<br />
0-7803-8674-4/04/$20.00 02004 IEEE 457<br />
Sridhar Krishnan<br />
Dept. of Electrical and Computer Engg.<br />
<strong>Ryerson</strong> <strong>University</strong><br />
Toronto, ON, Canada M5B 2K3<br />
Email: krishnan @ee.ryerson.ca<br />
Audio signals are highly non-stationary in nature and<br />
the best way to analyze them is to use a joint time-frequency<br />
(TF) approach. The previous works [5,6] of the authors<br />
have demonstrated the success of TF approach in audio clas-<br />
sification. In [5], audio features such as entropy, centroid,<br />
centroid ratio, bandwidth, silence ratio, energy ratio, fre-<br />
quency location of minimum and maximum energy were<br />
extracted from the spectrogram of the audio signals. These<br />
features werc fed to a Linear Discriminant <strong>Analysis</strong> (LDA)<br />
based classifier to perform a six group classification. An<br />
overall classification accuracy of 93% was reported with a<br />
database of 143 audio signals. In [6], the distribution values<br />
of thc TF decomposition parameter 'octave' over 3 bands of<br />
frequencies were used as the audio features and a similar six<br />
group classification was performed with a database of 170<br />
audio signals. An overall classification accuracy of 97.6%<br />
was rcparted.<br />
In order to perform efficient TF anaiysis on the signals<br />
for classification purposes, it is essential to locate the sub-<br />
spaces on the TF plane that demonstrate high discrimination<br />
between different classes of the signals. Once the target sub-<br />
spaces are identified, it is easier to extract relevant features<br />
for classifications. In the proposed work we use Local Dis-<br />
criminant Bases algorithm (LDB) [7] with wavelet packet<br />
bases to identify these target subspaces in the TF plane to<br />
classify the audio signals. The optimal choice of LDBs de-<br />
pends on the nature of the dataset and the dissimilarity mea-<br />
sures used to distinguish between classes. A combination<br />
of tnUItipk dissimilarity measures can be used to achieve<br />
high classification accuracies. Features were extracted from<br />
the basis vectors of the LDB nodes and fed to a LDA based<br />
classifier for a six (Rock, Classical, Country, Folk, Jazz and<br />
Pop) and four (Rock, Classical, Country and Folk) group<br />
classification. The paper is organized as follows: Section 2<br />
covers the methodology comprising of LDB, LDB selection<br />
process, feature extraction and pattern classification. Re-<br />
sults and Discussions are covered in Section 3, and Conclu-<br />
sions in Section 4.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />
88
2. METHODOLOGY<br />
2.1. Local Discriminant Bases Algorithm<br />
In the LDB [7] algorithm with wavelet packet bases, a set<br />
of training signals xp for all P classes are decomposed to<br />
a full tree structure of order N. The indexes p and i represent<br />
the pth signal class and ith signal in the training set of<br />
pth class. We restrict our analysis to binary wavelet packet<br />
trees [8]. Let s l be ~ a vector ~ space in €2" corresponding to<br />
the node 0 of the parent tree. Then at each level the vector<br />
space is spilt into two mutually orthogonal subspaces given<br />
by fl,,k = flj+l,zk @Q7+1,2k+l, wherej indicates the level<br />
of the tree and k represents the node index in level j, given<br />
by k = 0, ,..., 23 - 1. This process repeats till the level J,<br />
giving rise to 2' mutually orthogonal subspaces. Each node<br />
k contains a set of basis vectors B,,k,<br />
where 2". corresponds to the length of the signal. Then the<br />
signal xp can be represented by a set of coefficients as:<br />
3 k 1<br />
Basically the signal xr is decomposed into 2.' subspaces<br />
with aJ,k,J coefficients in each subspace. With the train-<br />
ing signals decomposed into wavelet packet coefficients we<br />
need to define a dissimilarity measure (&) in the vector<br />
space so as to identify those subspaces, which have larger<br />
statistical distance between classes. This dissimilarity mea-<br />
sure is used in an iterative manner to prune the tree in such<br />
a way that only a node is split if the cumulative discrimina-<br />
tive measure of the children nodes is greater than the par-<br />
ent nodc. The resulting tree contains the most significant<br />
LDBs, from which a set of Ir' significant LDBs are selected<br />
to construct the final tree. The testing set signals are then<br />
expanded using this tree and features are extracted from the<br />
respective basis vectors for classification.<br />
2.2. LDB selection process<br />
In the proposed method, we use a modified LDB approach.<br />
Instead of using a single dissimilarity measure, we use two<br />
dissimilarity measures (D1 and Dz) to arrive at the final tree<br />
structure. Using multiple dissimilarity measures provides<br />
more feature dimensions for classification. Especially for<br />
complex data sets like music signals, a single dissimilarity<br />
measure may not be able capture aII the characteristic information<br />
about its class. Also instead of the selective splitting<br />
of the nodes, which basically helps in removing the redundancy<br />
in the LDB selection, we used all the nodes from the<br />
full decomposition tree. The redundancy within the final set<br />
of LDBs were later removed in the feature evaluation pro-<br />
cess.<br />
The goal is to identify those nodes (LDB) from the full<br />
waveIet packet tree which demonstrate high discriminatory<br />
values between all the classes for a given dissimilarity mea-<br />
sure D,. If there are say P classes then the dissimilarity<br />
measure was computed by taking 2 classes at a time i.e.<br />
PC2 combinations, where C stands for the mathematical<br />
operation of combinations. The nodes which show rela-<br />
tively higher discriminatory power compared to all the other<br />
nodes in each of the PC2 combinations were chosen as<br />
LDBs for that particular combination. The LDB nodes are<br />
then sorted by their discriminatory power and the first Q<br />
LDBs were chosen for further processing. This is repeated<br />
for T trials using different audio signals for each of the<br />
classes. All the Q x T LD3s for each of the PC2 com-<br />
bination over the T trials were analyzed for number of oc-<br />
currences. The first 2 highly occurring LDBs for each of<br />
the PCz combinations was chosen as the best LDBs for that<br />
particular combination of classes. So after all the trials we<br />
will have 2PCz LDBs, from which we choose the first 10<br />
highly occurring LDBs over all the combinations. At the<br />
end of this selection process we wil1 have 10 LDBs in the<br />
wavelet packet tree that demonstrate relatively high discrim-<br />
inatory behavior among all the combination of P classes.<br />
In othcr words, these nodes demonstrate high statistical dis-<br />
tance between all the P classes.<br />
In this study the following values were chosen for P =<br />
6 and 4, Q = 5, and T = 10. Also we tested the database<br />
with few variations of wavelets (db, coif, sym) and observed<br />
sym4 wavelet to provide better discrimination between the<br />
classes. Hence, the results presented in this study are based<br />
on the sym4 wavelet packet decompositions. As we also<br />
used two different dissimilarity measures in selccting the<br />
LDBs to enhance the classification accuracy, at the end of<br />
the LDB selection process we will have 2 x 10 LDBs using<br />
the two dissimilarity measures. These 20 LDBs can be used<br />
to construct a composite wavelet packet tree which is used<br />
to decompose the testing set and extract features as will be<br />
explained in Section 2.3.<br />
2.2.1. Dissimilarity measures<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />
89<br />
458<br />
The first dissimilarity measure D1 is the difference in the<br />
normalized energy between the corresponding nodes of the<br />
training signals from one of the PCz combination of classes.<br />
This gives the difference in the energy distribution of the<br />
signals on the TF plane. Audio signals exhibit different<br />
TF energy distribution patterns based on their composition.<br />
Hence this measure is expected to approximately reveal the<br />
different energy concentration locations on the TF plane for<br />
different type of audio signals.<br />
(3)
where. and E;,k are the normalized energy of the cor-<br />
responding nodes for the PC$ combination signab. Fig. 1<br />
shows a sample LDB tree obtained using the dissimilarity<br />
measure D1,<br />
The discriminant measure Dz is a measure of estimat-<br />
ing the randomness or non-stationarity of the basis vectors.<br />
It is computed as the set of variances along the segments of<br />
the basis vector coefficients. The ratio of this variance mea-<br />
sure between the signals from each of the PC2 combination<br />
of classes indicate the amount of deviation observed in the<br />
non-stationarity between the classes. One of the important<br />
characteristics of an audio signal is its time varying signal<br />
structures. This variability in time-varying signal structures<br />
can be approximated using the discriminant measure D2.<br />
where p and q are the index of the L segments obtained by<br />
segmenting the basis vectors at node (j, k) for one of the<br />
PC2 combination of classes. Fig. 2 shows a sample LDB<br />
tree obtained using the dissimilarity measurc D.3.<br />
I<br />
(4)<br />
1 - 1 5 1 " '<br />
05 I 15<br />
'<br />
2<br />
' I<br />
2.5<br />
'I tme katms 1 Q'<br />
Fig. 1. A sample LDB trce obtained using the dissimilarity<br />
measure DI and sym4 wavelet.<br />
2.3. Feature extraction<br />
An audio database consisting of 24 Rock, 35 Classical, 31<br />
Country, 21 Jazz, 34 Folk and 25 Pop signals (a total of 170<br />
signals) was used in this study. Each of the signal from this<br />
database were extracted from individual music CDs. AI1 the<br />
samples were of 5 seconds length sampled at 44.1 kHz, Af-<br />
ter the LDBs were selected as described in the previous sec-<br />
tion, a composite wavelet packet tree was constructed using<br />
all the 20 LDBs. The signals from the audio database were<br />
1 1 I<br />
I<br />
Fig. 2. A sample LDB tree obtained using the dissimilarity<br />
measure D2 and sym4 wavelet.<br />
decomposed using this composite wavelet packet tree. The<br />
basis vectors from each of the LDB nodc from this wavelet<br />
packet tree can be directly used as features. However con-<br />
sidering the dimensions of the basis vectors, we extract the<br />
discriminatory values using thc same dissimilarity measures<br />
(Dl and Dz) on the LDB nodes and use them as features. So<br />
for each audio signals we will have 10 features using each of<br />
thc dissimilarity measure. In total we will have 20 features<br />
for each signal. The cornbination of these 20 features were<br />
evaluated for their significance in the class separability. A<br />
wrapper approach was used to select the highly discrimina-<br />
tive feature set. In wrapper approach the features are either<br />
added or removed sequentially one by one and the classi-<br />
fication accuracy is computed using thc classifier. The set<br />
of featurcs which provide minimum classification error was<br />
chosen to be the optimal feature set. The resulting set of<br />
fcatures were fcd a pattern classifier as will be explained in<br />
the next Section.<br />
459<br />
2.4. Pattern Classification<br />
The motivation for the pattern classification is to automat-<br />
ically group signals of same characteristics using the dis-<br />
criminatory features derived as explained in the previous<br />
section. Pattern classification was camed out using a LDA<br />
based classifier. In LDA, the feature vector derived as ex-<br />
plained above were transformed into canonicaI discriminant<br />
functions [91 such as<br />
f = Wlbl + ~ 2b2 +.......-I- ~,b, +a, (5)<br />
where {U} is the set of highly discriminative features, {b}<br />
and a are the coefficients and constant respectively. Using<br />
the discriminant scores and the prior probability values of<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />
90
.<br />
each group, the posterior probabilities of each sample oc-<br />
curring in each of the groups are computed. The sample is<br />
then assigned to the group with the highest posterior proba-<br />
bility.<br />
The classification accuracy was estimated using the Ieave-<br />
one out method which is known to provide a least bias esti-<br />
mate [ 101. In leave-one-out method, one sample is excluded<br />
from the dataset and the classifier is trained with the remain-<br />
ing samples. Then the excluded signal is used as the test<br />
data and the classification accuracy is determined. This is<br />
repeated for all samples of the dataset. Since each signal<br />
is excluded from the training set in turn, the independence<br />
between the test and the training set are maintained.<br />
3. RESULTS AND DISCUSSIONS<br />
AI! the signals from the audio database were decomposed<br />
using the LDB wavelet packet tree and features were ex-<br />
tracted as explained in Sections 2.2 and Section 2.3. The<br />
features were then fed to LDA based classifier. The clas-<br />
sification accuracies were computed and verified using the<br />
leave-one-out method. Table 1 shows the classification ac-<br />
curacies achieved for a six group classification. An overall<br />
classification accuracy of 77% using rcgular LDA and 65%<br />
using leave-one-out method werc achieved. From the table<br />
it can be observed that the Classical and Country classes<br />
were classified more accurately (94% and 84%) followed<br />
by the Rock and Folk (79% and 79%). We observe that<br />
the classes Jazz and Pop have significant overlap with other<br />
ctasscs showing a poor classification accuracy. Fig. 3 shows<br />
the scatter plot of the 6 group classification. The cluster-<br />
ing behaviour of the classes can he observed. Rock, Classi-<br />
cal, Country and Folk classes show distinct clusters whereas<br />
the Jazz and Pop overlap with the clusters of Classical and<br />
Country. It is hard to perceptually arrive at a clear boundary<br />
between music types. Always there exist a natural overIap<br />
between similar type of music signals. However, the poor<br />
classification of Jazz and Pop in our case may be attributed<br />
to the insufficient and less discriminatory clues (features)<br />
used in this study. Fine tuning and adding more dissimilar-<br />
ity measures can improve the overall classification accuracy.<br />
As we observed significant overlap from the Jazz and<br />
Pop classes, we performed a second classification using only<br />
4 groups (124 signals), removing Jazz and Pop. This was<br />
done to asses the performance of the classifier with the re-<br />
maining 4 groups. As expected the overall classification ac-<br />
curacy improved from 77% to 88% for the regular LDA and<br />
65% to 82% using leave-one-out method as shown in Ta-<br />
ble 2. Fig. 4 shows a clearer clustering behavior of the 4<br />
classes. The results obtained are from our initial testing of<br />
the proposed technique. Author's previous work [6] using<br />
a different TF approach has provided better classification<br />
accuracies, however with double the size of the rkported<br />
Method<br />
Original<br />
Gr Ro CI CO .Is Fo Po 1 CA%<br />
Ro 19 0 5 0 0 0 I 79.2<br />
CI 0 33 0 1 0 1 I 94.3<br />
Table 1. Six group classification results. Method: Origi-<br />
nal - Regular linear discriminant analysis, Cross - validated<br />
- Linear discriminant analysis with leave-one-out method,<br />
CA% - Classification accuracy rate, Gr-<strong>Group</strong>s, Ro-Rock,<br />
CI-Classical, CO-Country, Fo-Folk, Ja-Jazz, Po-Pop.<br />
N<br />
f i "<br />
4<br />
Y .I<br />
Scatter ploi<br />
F~ntmn 1<br />
#<br />
* * y +.<br />
*d *<br />
Fig. 3. Six groups scatter plot with the first two canonical<br />
discriminant functions<br />
features in this work. Also restricting the final significant<br />
LDBs to 10 from the set of 2PC2 controls the classification<br />
accuracy.<br />
4. CONCLUSIONS<br />
A novel LDB based audio classification scheme was pre-<br />
sented. High classification accuracies were achieved using<br />
the proposed methodology. Initial results suggest significant<br />
potential for LDB based audio classification. Simple dis-<br />
similarity measures like node energy and non-stationarity<br />
index performed well in identifying the discriminatory nodes<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />
91<br />
460<br />
9
CO 28 903<br />
I I I<br />
Validated 1 CI I 0 1 32 I 3 1 0 I 91.4<br />
IC01 1 1 4 1261 0 183.9<br />
Table 2. Four group classification results. Method: Origi-<br />
nal - Regular linear discriminant analysis, Cross - validated<br />
- Linear discriminant analysis with leave-one-out method,<br />
CA% - Classification accuracy rate, Gr-<strong>Group</strong>s, Ro-Rock,<br />
Cl-Classical, CO-Country, Fo-Folk.<br />
Scatter plot<br />
4 . 7-<br />
I “h.<br />
*? I.<br />
d -<br />
--<br />
. .<br />
Fig. 4. Four groups scatter plot with the first two canonical<br />
discriminant functions<br />
hetween the music classes. Future work involves improving<br />
the LDB selection process, arriving at an optimal number of<br />
LDBs for a given classification problem and include more<br />
dissimilarity measures for audio classification.<br />
5. ACKNOWLEDGEMENTS<br />
Thc authors thank the NSERC organization for funding this<br />
project. The authors also acknowledge the contributions of<br />
Andre Chang, a research assistant in the <strong>Signal</strong> <strong>Analysis</strong><br />
and <strong>Research</strong> (<strong>SAR</strong>) group at <strong>Ryerson</strong> <strong>University</strong>, Toronto,<br />
Canada.<br />
.<br />
46 1<br />
6. REFERENCES<br />
[I] H. G. Kim, N. Moreau, and T. Sikora, “Audio clas-<br />
sification based on mpeg-7 spectral basis representa-<br />
tions,” IEEE Transactions on circuits and systems for<br />
video technology, vol. 14, no. 5, pp. 716-725, May<br />
2004.<br />
[2] Lie Lu and Hong-hang Zhang, “Content analysis for<br />
audio classification and segmentation,” IEEE Trans-<br />
actions on Speech and Audio Processing, vol. 10, no.<br />
7, pp. 504-5 16, Oct 2002.<br />
[3] George Tzanetakis and Perry Cook, “Music genre<br />
classification of audio signals,” IEEE Transactions on<br />
Speech and Audio Processing, vol. IO, no. 5, pp. 293-<br />
302, July 2002.<br />
[4] G. Guo and S. 2. Li, “Content-based audio classifica-<br />
tion and retrieval by support vector machines,” fEEE<br />
Transactions on neural networks, vol. 14, no. 1, pp.<br />
209-215, Jan 2003.<br />
[SI S. Esmaili, S. Krishnan, and K. Raahemifar, “Con-<br />
tent based audio classification and retrieval using joint<br />
time-frequency analysis,” in IEEE Itrtertiational Con-<br />
ference on Acoustics, Speech arid Sigtzal Processing<br />
(ICASSP), May 2004, pp. V 665668.<br />
K. Umapathy, S. Krishnan, and S. Jimaa, “Multi-<br />
group classification of audio signals using time-<br />
frequency parameters,” IEEE Trarisactiotis on MuE-<br />
timedia, in press.<br />
N. Saito and R. R. Coifmann, “Local discriminant<br />
bases and their applications,” JolournaE of Muthemat-<br />
ical hinging arid Vision, vol. 5, no. 4, pp. 337-358,<br />
1995.<br />
Stephane Mallat, A wavelet tour of signal processing,<br />
Academic press, San Diego, CA, 1998.<br />
SPSS Inc., “SPSS advanced statistics user’s guide,” in<br />
User marrual, SPSS Inc., Chicago, IL, 1990.<br />
K. Fukunaga, Introduction to SratisticaE Pattern<br />
Recognition, Academic Press, Inc., San Diego, CA,<br />
1990.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:20 from IEEE Xplore. Restrictions apply.<br />
92
A NOVEL ROBUST IMAGE WATERMARKING USING A CHIRP<br />
BASED TECHNIQUE<br />
Arunan Ramalingam and Sridhar Krishnan<br />
Department of Electrical and Computer Engineering,<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, Ontario, Canada MSB2K3<br />
Email: (aramalin)(krishnan) @ee.ryerson.ca<br />
Abstract<br />
In this study, we propose a new spread spectrum im-<br />
age watermarking algorithm thnt embeds linear chirps as<br />
watermark messages. The slopes of the chirp on the time-<br />
fmquency (TF) plane represent watermark messages such<br />
that each slope corresponds to a different message. We<br />
extract the watermark message using a line detection al-<br />
gorithm based on the Hough-Radon transform (HRTJ. The<br />
HRTderects the directional elements rhnt sari@ a paramet-<br />
ric constraint in the image of a TF plane. The proposed<br />
method not only detects the presence of watermark, but also<br />
extracts the embedded watermark bits and ensures the mes-<br />
sage is received correctly. The robustness of the proposed<br />
scheme has been evaluated using common imagepmcessing<br />
techniques such as JPEG compression and image cmpping,<br />
and we found that the maximum bit error rate to be 19.03%<br />
which is zero aferposrprocessing using HR?:<br />
Keywords: Imnge Watermarking, Spread Spectrum, Data<br />
Hiding, Hough-Radon Transform, Chirp Modulation.<br />
1. INTRODUCTION<br />
The huge success of the Internet allows for the trans-<br />
mission, wide distribution, and access of electronic data in<br />
an effortless manner. Content providers are faced with the<br />
challenge of how to protect their electronic data. One of<br />
the possible solutions in that area is data watermark, which<br />
is added to multimedia content by embedding an imper-<br />
ceptible and statistically undetectable signature. Thereby,<br />
multimedia data creators and distributors are able to prove<br />
ownership of intellectual property rights without forbidding<br />
other individuals to copy the multimedia content itself. In<br />
this study, we propose a novel chirp based watermarking<br />
scheme [I] for images that embeds linear chirps as water-<br />
mark messages. Different chirp rates, i.e., slopes on the<br />
time-frequency (TF) plane, represent watermark messages<br />
such that each slope corresponds to a different message. The<br />
narrowband watermark messages are spread with a water-<br />
mark key (PN sequence) across a wider range of frequen-<br />
CCECE 2004 - CCGEI 2004, Niagara Falls, May/mai 2004<br />
0-7803-8253-6/04/$17.00 @ 2004 IEEE<br />
- 1889 -<br />
cies before embedding. The resulting wideband noise is<br />
added to the perceptually significant regions of the origi-<br />
nal image. We use the block-based discrete cosine trans-<br />
form (DCT) scheme for inserting the watermark. As a re-<br />
sult of image manipulations some message bits extracted by<br />
the detector may he in error potentially resulting in the de-<br />
tection of the wrong watermark message. Our motivation<br />
for the proposed image watermarking algorithm is to detect<br />
the presence of the watermark, extract the embedded wa-<br />
termark message bits and decide on the watermark message<br />
even if some hits are received in error. As chirps are repre-<br />
sented as lines in a TF plane, line detection algorithm such<br />
as HRT has been applied to extract the watermark messages<br />
successfully.<br />
2. WATERMARK EMBEDDING<br />
Let m be a normalized chirp function that represents the.<br />
watermark message to be embedded into the original image.<br />
m takes continuous values in the interval [-1,11, and needs<br />
to be quantized for the detection of each embedded bit. mq<br />
is the quantized version of m formed according to the sign<br />
of the sample values of m, taking values -1 and 1. Let m:<br />
represent a watermark message bit to he embedded into the<br />
image. Each bit mi is spread with a cyclic shifted version<br />
pk of a binary PN sequence with a chip length of N and<br />
summed together to generate the widehand noise vector w.<br />
M<br />
w = m;Pk, (1)<br />
k=O<br />
where M is the number of watermark message bits in mq.<br />
There is always a possibility to make the trade-off between<br />
the embedded data size and robustness of the algorithm; as<br />
the PN length is decreasing, the algorithm is able to add<br />
more bits into the host image but the detection of the hidden<br />
bits and resistance to different attacks is decreased. The<br />
wideband noise vector w formed is added to the image in<br />
perceptually significant regions to ensure robustness of the<br />
watermark against attacks. The length of w and hence the<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />
93
number of watermark bits that can he embedded depends on<br />
the perceptual entropy of the image.<br />
To embed the watermark in the image, we can utilize<br />
the models that describe the masking characteristics of the<br />
human visual system [2]. Among such models, we use<br />
the model based on the jus: noticeable difference (JND)<br />
paradigm (31. A set of JNDs is associated with a particu-<br />
lar invertible transform T. Given that a multimedia signal<br />
is transformed using T, the JNDs provide an upper hound<br />
on the extent that each of the coefficients can be perturbed<br />
without causing perceptual changes to the signal quality.<br />
The set of signal and transform dependent JNDs can he de-<br />
rived using complex analytic models or through experimen-<br />
tation. The JND paradigm is widely used in image com-<br />
pression, and image watermarking applications. We use the<br />
JNDs to determine the perceptually significant regions and<br />
also to find the perceptual entropy of the image. In this work<br />
we use one such model based on the DCT.<br />
DCT Based Model<br />
We use the model proposed by Watson [4] that has been<br />
applied to JPEG coding. In this method, the original image<br />
is decomposed into non-overlapping 8x8 blocks, and the<br />
DCT is performed independently for every block of data.<br />
Let the original image pixels are represented as z,,j,b, where<br />
Fig. 1. Watermark embedding scheme.<br />
i and j represent the pixel elements in block b, and Xu,u,b<br />
denotes the DCT coefficients for the hasis function located<br />
in the position U, II of the block b. A frequency thresh-<br />
old value is derived for each DCT basis function and in<br />
this case result in an 8x8 mabix oft:," threshold values.<br />
These threshold values are determined for various view-<br />
ing conditions by Peterson er. al. [SI. The visual model<br />
we used is for a minimum viewing distance of four picture<br />
heights and a D65 monitor white point. Watson further re-<br />
- 1890 -<br />
94<br />
fines this model by adding a luminance sensitivity and con-<br />
trast masking component. Luminance sensitivity threshold<br />
is estimated by the formula<br />
where XO,O,b is the DC coefficient of the DCT for block b,<br />
Xh,o is the DC coefficient corresponding to the mean luminance<br />
of the display, and a is a parameter which controls the<br />
degree of luminance sensitivity. The authors in [51 suggest<br />
to set the value of a to 0.649. Given a DCT coefficient and<br />
a corresponding threshold value derived from the viewing<br />
conditions and local luminance masking, a contrast masking<br />
threshold is derived as<br />
tgv,b = max [t,".u,b, IXU,V,bl"'", (t;,t,b)'-""'"] > (3)<br />
where w,,,is a number between zero and one, and is empir-<br />
ically derived as 0.7 by the authors in [SI. The watermark<br />
embedding scheme is based on the model proposed in [61.<br />
The watermark encoder for the DCT scheme is described as<br />
where Xu,-,b refers to the DCT coefficients, X:,u,p refers to<br />
the watermarked DCT coefficients, ~ ~ , is ~ ohmned , b from<br />
the wideband noise vector w, and t$,,,b is the computed<br />
JND calculated from the visual model described in Eq(3).<br />
Fig. 1 shows the block diagram of the described watermark<br />
encoding scheme.<br />
3. WATERMARK DETECTION<br />
The received image may be different from the water-<br />
marked image due to some intentional or unintentional im-<br />
age processing operations such as lossy compression, shift-<br />
ing and downsampling. Fig. 2 shows the block diagram of<br />
the described watermark decoding scheme. The detection<br />
scheme for the DCT based watermarking can he expressed<br />
as<br />
where X;,",& are the coefficients of the received watermarked<br />
image, and +f is the received widehand noise vector. The re-<br />
ceived widehand noise vector can be expressed as<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />
iK=w+n, (7)
where n is the distortion component resulting from hostile<br />
image manipulations and is modeled as a zero-mean ran-<br />
dom vector uncorrelated with the PN sequence. We use the<br />
watermark key, i.e., the appropriately circular shifted PN<br />
sequence pk to despread %, and integrate the resulting se-<br />
quence to generate a test statistic (%, pr). The sign of the<br />
expected value of the statistic depends only on the emhed-<br />
ded watermark hit mi. Hence the watermark hits can be<br />
estimated using the decision rule:<br />
la TFD HRT m<br />
- 1891 -<br />
!m* L m rab, -MWl<br />
Fig. 4. Line detection using HRT.<br />
(WV) and the Hough space of the linear chirp received at<br />
a hit error rate of 19.03%. The prominence of the global<br />
maximum in the HRT space provides an indication of the<br />
presence of chirps in TFD and thereby leading to successful<br />
watermark detection.)<br />
4. RESULTS AND DISCUSSION<br />
We evaluated the proposed scheme using eight differ-<br />
ent images of size 512x512. The sampling frequency f&<br />
of the watermarks equal 1 kHz. Hence the initial and final<br />
frequencies, fob and fib of the linear chirps representing<br />
all watermark messages are constrained to [O-5001 Hz. We<br />
embed these chirps in to the images for a chip length of N,<br />
which depends on the perceptual entropy of the image. We<br />
experimentally found that for reliable detection of chirp un-<br />
der various image processing attacks, the chip length should<br />
he atleast 10000 samples. If the image can support more<br />
1M)OO samples, then.multiple chirps can he embedded and<br />
the payload can he increased. In out tests, we used a sin-<br />
gle watermark sequence having 176 message hits. To mea-<br />
sure the robustness of the watermarking algorithm, we per-<br />
formed the following difficult image manipulation tests: (i)<br />
JPEG Compression, (ii) JPEG Compression and Cropping<br />
(114 Original), (iii) JPEG Compression and Cropping (1116<br />
Original). The JPEG compression is performed with dif-<br />
ferent quality Q; higher value of Q indicates better image<br />
quality. These tests are performed on the watermarked im-<br />
ages to simulate the image processing attacks and the water-<br />
mark message hits are extracted as described in Section 3.<br />
During all these robustness tests, we assume that the image<br />
and the PN sequence are synchronized. Figs. 5 - 6 show the<br />
hit error rate (BER) obtained for JPEG compression with<br />
quality factors 60 and 20 respectively, and with watermark<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />
95
sJPEG CmpmrsionlSO; + Croppngilri)<br />
Fig. 5. BER (in %) for PEG compression with quality<br />
factor 60.<br />
message length of 176 bits. The extracted hits are localized<br />
in the TF domain using WV. Although some of the bits are<br />
received in error, HRT is able to detect the presence of chirp<br />
and estimate the parameters of the chirp for all the simula-<br />
tion results reported in the study.<br />
5. CONCLUSION<br />
In this paper, we proposed a novel image watermark-<br />
ing algorithm that embeds linear chirp as watermark mes-<br />
sages. The watermark message is added to the perceptually<br />
significant regions of the image to ensure robustness of the<br />
watermark to common image processing attacks. The algo-<br />
rithm is able to extract the watermark message even if some<br />
of the bits received are in error. A line detection algorithm<br />
based on the HRT detects the slope of the watermark mes-<br />
sage in the image of the TF plane of the chirp signal. The<br />
HRT provides error correcting capability and can be effi-<br />
ciently implemented as it operates on small images of the<br />
TF plane. Our studies confirm the robustness of the algo-<br />
rithm to image compression and cropping attacks. We are<br />
currently working in expanding our robustness tests and de-<br />
veloping a complete analytical model for the algorithm.<br />
Acknowledgements<br />
We would like to acknowledge Micronet for their finan-<br />
cial support.<br />
- 1892 -<br />
^^<br />
~ .................................................................. . ~~~<br />
~~ ..<br />
.............<br />
im1<br />
m<br />
lm2 Im3 ImZ ImL Im G Im 7 lm8<br />
OJPEG Cmpressian(20)<br />
0 JPEG CmpierEion(PO)+ CrappngW4I<br />
oJPEO Cmpresrian/PO) I CroF@ngil!lGi<br />
Fig. 6. BER (in %) for JPEG compresion with quality<br />
factor 20.<br />
References<br />
[I] S. Erkucuk,S. Krishnan and M. Zeytinoglu, “Ro-<br />
bust Audio Watermarking Using a Chirp Based Tech-<br />
nique,”IEEE Intl. Cony on Multimedia andfipo, vol.<br />
2, pp. 513416.2002.<br />
[2] M. S. Sanders and E. J. McCormick, Human Factors<br />
in Engineering and Design, McGraw-Hill, New York,<br />
7th edition, 1993.<br />
131 N. Jayant, J. Johnston, and R. Safranek, “<strong>Signal</strong> Com-<br />
pression Based Models of Human Perception,” Pm-<br />
ceedings ofthe IEEE, vol. 81, pp. 1385-1422, October<br />
1993.<br />
[4] A. B. Watson “DCT quantization matrices visually OQ-<br />
timized for individual images,’’ Proc. SPIE Con5 Hu-<br />
man Vision, Visual Processing, and Digital Display,<br />
vol. 1913,pp. 202-216. February 1993.<br />
151 H. A. Peterson, A. J. Ahumada and A. B. Watson,<br />
“Improved detection model for DCT coefficient quan-<br />
tization:’ Proc. SPIE Cony Human Vision, Visual Pro-<br />
cessing, andDigita1 Display, vol. 1913, pp. 191-201.<br />
February 1993.<br />
[6] C.I. Podilchuk and W. Zeng “Image-Adaptive Water-<br />
marking Using Visual Models,” IEEE Journal on Se-<br />
lected Areas in Communications, vol. 16, pp. 525-<br />
539. May 1998.<br />
171 R.M. Rangayyan and S. Krishnan, “Feature identifica-<br />
tion in the time-frequency plane by using the Hough-<br />
Radon transform,” Trans. Pattern Recognition, vol. 34,<br />
pp. 1147-1158,2001.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:55 from IEEE Xplore. Restrictions apply.<br />
96<br />
I<br />
I
A Novel Way of Lossless Compression of Digital Mammograms<br />
Using Grammar Codes<br />
Xiaoli Li, Sridhar Krishnan and Ngok-Wah Ma<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ONT M5B 2K3, CANADA<br />
Phone: 416-979-5000 ext.6086 Fax: 416-979-5280<br />
Abstract-Breast cancer is the most common cancer among women<br />
in Canada. Despite slight declines in mortality rates over the past<br />
decade for women with breast cancer, one in nine Canadian<br />
women will develop breast cancer in her lifetime; one in 25<br />
Canadian women will die from this disease. Digital mammograms<br />
(X-rays of the breast) may allow better cancer magnosis and has<br />
the ability to be transmitted electronically around the world. The<br />
problem is mammograms are large size images and have less<br />
correlation details. Therefore, for a physician to diagnose diseases<br />
correctly even through the communication networks, gaining<br />
higher compression to save bandwidth without any data loss<br />
becomes a challenging issue. Among the traditional lossless<br />
compression algorithms such as Huffman, Lempel-Ziv and<br />
Arithmetic, Lempel-Ziv and Arithmetic source coding techniques<br />
have better performances than Hut?inan on digital mammograms.<br />
In order to achieve better compression ratios we investigate the<br />
newly developed Grammar-based source code for medical image<br />
compression such as mammograms. In this Grammar-based code,<br />
the original data (image) is first transformed into a context free<br />
grammar, from which the original data sequence can be fully<br />
reconshucted by performing parallel and recursive substitutions,<br />
and then uses an arithmetic coding algorithm to compress the<br />
context free grammar or the corresponding sequence of parsed<br />
phrases. We tested the grammar-based coding technique on digital<br />
mammograms obtained from the Mammographic Image <strong>Analysis</strong><br />
Society (MIAS). The result shows the newly developed grammar<br />
code performs better than the traditional lossless coding schemes.<br />
In general, the grammar-based lossless compression algorithm<br />
seems to be a promising technique for teleradiology applications.<br />
Keywordr-Arithmetic coding, grammar-based codes,<br />
mammography, compression ratio.<br />
I. INTRODUCTION<br />
In this paper, we investigate a novel lossless source coding<br />
technique called the grammar code for lossless compression of<br />
mammography. Universal source coding theory aims at designing<br />
data compression algorithms, whose performance is<br />
asymptotically optimal for a class of sources.<br />
To put things into perspective, let us first review briefly,<br />
from the information-theoretic point of view, the existing<br />
universal lossless data compression algorithms. So far, the most<br />
widely used universal lossless compression algorithms are<br />
arithmetic coding algorithms, Lempel-Ziv algorithms, and their<br />
variants. Arithmetic coding algorithms and their variants are<br />
statistical model-based algorithms. To use an arithmetic coding<br />
algorithm to encode a data sequence, a statistical model is either<br />
built dynamically during the encoding process, or assumed to<br />
exist in advance. Several approaches have been proposed in the<br />
CCECE 2004- CCGEI 2004, Niagara Falls, Maylmai 2004<br />
0-7803-8253-6/04/$17.00 02004 IEEE<br />
- 2085 -<br />
97<br />
literature to build the statistical model dynamically. Typically, in<br />
all these methods, the next symbol in the data sequence is<br />
predicted by a proper context and coded by the corresponding<br />
estimated conditional probability. Arithmetic coding algorithms<br />
and their variants are universal only with respect to the class of<br />
Markov sources with order less than some designed parameter<br />
value. Note that in arithmetic coding, the original data sequence is<br />
encoded letter by letter. In contrast, no statistical model is used in<br />
Lempel-Ziv algorithms and their variants. During the encoding<br />
process, the original data sequence is parsed into non-overlapping,<br />
variable-length phrases according to some kind of string matching<br />
mechanism, and then encoded phrase by phrase. Each parsed<br />
phrase is either distinct or replicated with the number of<br />
repetitions less than or equal to the size of the source alphabet.<br />
Phrases are encoded in terms of their positions in a dictionary or<br />
database. Lempel-Ziv algorithms are universal with respect to a<br />
class of sources which is broader than the class of Markov sources<br />
of bounded order; the incremental parsing Lempel-Ziv algorithm<br />
[SI is universal for the class of stationary, ergodic sources.<br />
Other universal compression algorithm include the dynamic<br />
HuiXnan algorithm [6], the move to front coding scheme [7] [SI<br />
[9], and some two-stage compression algorithms with codebook<br />
transmission [IO] [ll]. These algorithms are either inferior to<br />
arithmetic coding algorithms and Lempel-Ziv algorithms, or too<br />
complicated to implement.<br />
The class of grammar-based codes is broad enough to<br />
include block codes, Lempel-Ziv types of codes, multilevel<br />
pattern matching (MPM) grammar-based codes [2], and other<br />
codes as special cases. It has been proved in [I] that if a grammar-<br />
based code transforms each data sequence into an irreducible<br />
context-free grammar, then the grammar-based code is universal<br />
for the class of stationary ergodic sources. (For the definition of<br />
grammar-based codes and irreducible context free grammars,<br />
please see Section 11.) Each irreducible context-free grammar also<br />
gives rise to a nonoverlapping, variable-length parsing of the data<br />
sequence it represents. Unlike the parsing in Lempel-Ziv<br />
algorithms, however, there is no upper bound on the number of<br />
repetitions of each parsed phrase. More repetitions of each parsed<br />
phrase imply that now there is room for arithmetic coding, which<br />
operates on phrases instead of letters. (In Lempel-Ziv algorithms,<br />
there is not much gain from applying arithmetic coding to parsed<br />
phrases since each parsed phrase is either distinct or replicated<br />
with the number of repetitions less than or equal to the size of the<br />
source alphabet.)<br />
In Section 11, we review how context-free grammars are<br />
used to represent sequence x, and refer you to the articles that<br />
explain how the reduction rules are used for designing grammar<br />
transforms, and how to efficiently encode grammars. In Section<br />
III, we describe how we implemented the new algorithm for real<br />
cases and the compression performances of the grammar code and<br />
other traditional lossless compression techniques for<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:54 from IEEE Xplore. Restrictions apply.<br />
.
mammographic images. We also discuss what the advantage and<br />
disadvantage of the new algorithm are and why it is a promising<br />
algorithm after surmounting a few problems.<br />
11. RE\'IE\\' OF 1°K NE\\' L~VIVEKSAI. COKTEXT-FREE<br />
LOSSI.ESS DATA COMPRESSION ALGORITHM BASED<br />
ON A GREEDY COSTEXT-FREE SEQUESTIAL<br />
GRAMIIAR TRAVSFORhl<br />
'The purpuse of this ssction is IO bnAly review the new<br />
gr;immsr-bawrl codc \\e applied so th3t this paper is selfconuined<br />
For the derailed dcscription uf the grammar-based<br />
codes. plcasc reler to 131.<br />
l.et A be our sourx alphrhet with cardinaliry greater than or<br />
~~UJI to 2. Let A' is the set sf 311 finite slnngs of prraiti\e length<br />
from A. x denotes the length uix. To avoid poisihle confusion, a<br />
sequence from A ts somctimss called an A -sequence. Let .II EA'<br />
be a ssquencc tu be comprcssed<br />
A grammar-baed code has the structure choun in Fig I<br />
The onginal daw sequence i is first transformed into a concu-<br />
ires grammx (or simply a grammar) G irum which x mn he fully<br />
reroversd. and then G is comprcssed indirectly hy uring 3 7ero-<br />
order arithmetic code RcDre bringing in the grdmmar transfomi,<br />
we begin with cxpl~ining how context-free grdmmsn arc uscd to<br />
reprcwnt scquenccs x in ,\-.<br />
Figure 1: Structure of a grammar-based code.<br />
Fix a countable set S={so,s,,s z,...} ofsymhols, disjoint from<br />
A . Symbols in Swill be called variables; symbols in A will he<br />
called ferminal symbols. For any j>l, let S(j7={so ,S~,S~,...,S~.,).<br />
For our purpose, a context-free grammar G is mapping from S(i)<br />
to (S(i)uA )+for some j?l. The set S6) will be called the variable<br />
set of G and, to be specific, the elements of SO) shall he called<br />
sometimes G-variables. For the purpose of data compression, we<br />
are interested only in grammars G for which the parallel<br />
replacement procedure terminates after finitely many steps and<br />
every G-variable s( i
111. IMPLEMENTATION<br />
As we have presented in section 11, the new lossless<br />
grammar-based compression code is accomplished hy taking the<br />
following three steps:<br />
i) Dcfinc a size-on-demand variable set of G and ensure each G-<br />
variable is distinct from source symbols;<br />
ii) Convert the source sequence x into an irreducible context-free<br />
grammar by applying a greedy grammar transform which adopts<br />
reduction rules in some order [3];<br />
iii)Based on the grammar transform, use one of three universal<br />
lossless data compression algorithms which are sequential<br />
algorithm, improved sequential algorithm, and hierarchical<br />
algorithm, to compress the irreducible grammar. All these<br />
algorithms combine the power of arithmetic coding with that of<br />
string matching. We define the size 14 of S as the number of G-<br />
variables in S. In OUT work, we fixed the size 14 of S, then<br />
operated the irreducible grammar transform and finally applied the<br />
hierarchical data compression algorithm [3]. The rest of this<br />
section mainly describes how we implemented the new lossless<br />
algorithm, and presents the compression performance of OUT<br />
implementation on mammographic images in 3 categories from<br />
small size (35X5), middle size (200X150) to large size<br />
(1024X1024). Each category has 30 images.<br />
To obtain a higher compression rate, we directly<br />
transformed the MIAS image text into a grammar G without<br />
converting each text into binary stream. However, the elements of<br />
the image text are not identical in length compared to their binary<br />
forms. For recovering the image successfully hy decoding later<br />
on, we embedded a specific symbol among pixel values to<br />
distinguish every two neighbors as solution before starting<br />
grammar transform.<br />
As noted in [1][3], the G-variables that represent the distinct<br />
production rules are distinct. The size Ifl of S is dependent on the<br />
image size as the new irreducible grammar transform is applied.<br />
Since each production rule's lei? member is a unique G-variable<br />
and its right member is represented hy (Se) U A )+, and string<br />
match is often used by the new grammar transform, the G-<br />
variables should he actual and as many as we need. Whereas the<br />
total visible symbols in language C via which we simulated the<br />
grammar code are limited. The maximum number of the available<br />
symbols is 75. But it does not mean the language C can not<br />
overcome the problem, but it will involve more sophisticated<br />
programming. Therefore, we adopted two schemes. Even though<br />
both of the schemes are not same as the theory, they helped us to<br />
study the new lossless grammar-based code in depth and verify its<br />
feasibility. One is to allow reusing the 75 G-variables (or less than<br />
75) to encode the remaining data sequence. In other words, as a<br />
result, a complete image was represented by several parallel<br />
grammars as shown in Figure 3. The other scheme is to use only<br />
75 G-variables to convert an image into an irreducible grammar<br />
one time only.<br />
In the first scheme, we used 75 and 35 G-variables to<br />
encode the 30 middle-size (2OOX150) mammographic images<br />
respectively and used 35 G-variables to encode the 30 large-size<br />
(1024x1024) mammographic images. From the study, we found<br />
that the more variables we used, the more processing time we<br />
needed for converting from a source image to an irreducible<br />
- 2087 -<br />
grammar G. For example, using 75 G-variables consumed 15<br />
minutes and 37 seconds on a GNU Linux workstation to encode<br />
medium-size images, while using 35 G-variables only took 2<br />
minutes and 4 seconds on the average instead. While in medical<br />
applications especially for mammograms where real-time<br />
compression is not an issue, the computational time can be<br />
sacrificed to some extent. But for regular images, the<br />
computational time of grammar code will be considered. Another<br />
observation we obtained from the study of scheme 1 is that its<br />
compression rate is better than Huftnan, Lempel-Ziv, and<br />
Arithmetic algorithms in some cases but not very significant<br />
because such a scheme destroys the correlation as a whole of the<br />
source image. Figure 4 displays this conclusion. We also<br />
compared the compression rate of using 75 G-variables and using<br />
35 G-variables. They are 2.643:l and 2.639:1, respectively. The<br />
compression gain of using 75 G-variables is very limited<br />
compared to using 35 G-variables. While using 75 G-variables<br />
took much longer time on processing as described above.<br />
Obviously, in scheme 1, using 35 G-variables is good enough for<br />
encode source images. Therefore, to save time, we did not do test<br />
on 1024x1024 mammographic images by using 75 G-variables.<br />
In the second scheme, we compared average compression<br />
rate of grammar code over 30 small (35x5) mammographic<br />
images with Huftinan, Lempel-Ziv, and Arithmetic algorithms. Its<br />
compression rate is much greater than any other 3 traditional<br />
techniques, which is displayed in Figure 5. While this scheme is<br />
impractical for the compression of large images since 75 G-<br />
variables are not enough for this purpose, the result does<br />
demonstrate that if we let 14 be big enough to completely<br />
represent a large image, its compression performance will<br />
conform to the theory and the Result 2 described in Section 11.<br />
However, we should be aware of the time consumption involved<br />
for processing large images using grammar code.<br />
Figure 2: A sample of mammographic images. (a) the original<br />
image(b) the image aAer Grammar decoding<br />
Grayscale Image<br />
174 175 173 173 174<br />
177 180 182 180 175<br />
176 176 174 173 175<br />
..................................<br />
..................................<br />
io<br />
ill io<br />
N575<br />
Figure 3: The image represented by several grammars.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:54 from IEEE Xplore. Restrictions apply.<br />
99
4<br />
3.5<br />
2.5<br />
A"mS<br />
iaqmaioonw<br />
I.!<br />
3<br />
I<br />
0.S<br />
0<br />
350. A"Ul"nlC uw H u h<br />
"vlablcl<br />
ad<br />
rrpeudly<br />
Techniques<br />
Figure 4: The compression performances of techniques over 30<br />
1024x1024 digital mammographic images<br />
hmgr<br />
mqmnon mL<br />
750- A"tt.wc Lzw H u h<br />
urnable<br />
V l d om nm<br />
0"lY<br />
Techniques<br />
Figure 5: The compression performance of techniques over 30<br />
35x5 digital mammographic images<br />
For transmitting mammograms through network,<br />
conquering variables requirement of grammar-based code will<br />
provide high compression performance.<br />
IV. CONCLUSIONS<br />
For decades, researchers have kept looking for much<br />
effective lossless compression technique for critical and large<br />
images, MIAS images for example, to be transmitted across the<br />
internet without any data loss. The new lossless compression<br />
100<br />
- 2088 -<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:54 from IEEE Xplore. Restrictions apply.<br />
grammar-based code attracts our attention and prompts us to<br />
verify if it is a promising code. The result based on our simulation<br />
presents that it is promising to get higher compression ratio for<br />
large images than Huffman, Lempel-Ziv, Arithmetic algorithms.<br />
Assuming that the number of the symbols as the variables of<br />
G is infinite, we can completely compress large images according<br />
to the original design of the new universal lossless grammar-based<br />
code. But it will involve more complicated processes and large<br />
compression time. These two are the main obstacles for the<br />
grammar-based code to be applied practically.<br />
REFERENCES<br />
[I] I. C. Kieffer and E. -H. Yang, "Grammar based codes:<br />
A new class of universal lossless source codes," EEE Trans.<br />
Inform. Theory, Vol. 6, No. 3, May 2000<br />
[2] J. C. Kieffer, E. -H. Yang, G. Nelson, and P. Cosman,<br />
"Universal lossless compression via multilevel pattern matching,"<br />
IEEE Trans. Inform. Theory. vol. 46, pp. 12274245, july 2000.<br />
[3] E. -h. Yang and J. C. Kieffer, "Efficient universal<br />
lossless data compression algorithms based on a greedy sequential<br />
grammar transform-Part one:Without context models," EEE<br />
Trans. Inform. Theory, vol. 46, pp.755-788, May 2000.<br />
[4] N. Abramson, Information Theory and Coding. New<br />
York McGraw-Hill, 1963.<br />
[5] I. Ziv and A. Lempel, "Compression of individual<br />
sequences via variable rate coding," IEEE Trans. Inform. Theory,<br />
vol. IT-24, pp. 530-536, 1978.<br />
[6] R. G. Gallager, "Variations on a theme by Hufian,"<br />
IEEE Trans. Inform. Theory, vol. IT-24, pp. 668-674, 1978.<br />
[7] J. Bentley, D Sleator, R. Tarjan, and V. K. Wei, "A<br />
locally adaptive data compression scheme," Commun. Asoc.<br />
Comput. Mach., vol. 29, pp. 320-330, 1986.<br />
[8] P. Elias, "Interval and recency rank source coding: Two<br />
on-line adaptive variable length schemes," IEEE Trans. Inform.<br />
Theory, vol. IT-33, pp. 1-15, 1987.<br />
[9] B. Y. Ryabko, "Data compression by means of a 'book<br />
stack',"Probl. Inform. Transm.,vol. 16,110.4, pp. 16-21, 1980.<br />
[IO] D. L. Neuhoff and P. C. Shields, "Simplistic universal<br />
coding," IEEE. Trans. Inform. Theory, vol. 44, pp. 778-781, Mar.<br />
1998.<br />
[I I] D. S. Omstein and P. C. Shields, "Universal almost sure<br />
data compression," Ann. Probab.,vol. 18, pp. 441452,1990,
CONTENT BASED AUDIO CLASSIFICATION AND RETRIEVAL USING JOINT<br />
TIME-FREQUENCY ANALYSIS<br />
S. Esmaili, S. Krishnan and K. Raahemifar<br />
Multimedia Information and <strong>Signal</strong> <strong>Analysis</strong> <strong>Research</strong> (MI<strong>SAR</strong>) Laboratories<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, Ontario, Canada<br />
e-mail: (sesmaili)(krishnan)(kraahemi)@ee.ryerson.ca<br />
ABSTRACT<br />
In this paper, we present an audio classification and retrieval technique<br />
that exploits the non-stationary behavior of music signals<br />
and extracts features that characterize their spectral change over<br />
time. Audio classification provides a solution to incorrect and inefficient<br />
manual labelling of audio files on computers by allowing<br />
users to extract music files based on content similarity rather than<br />
labels. In our technique, classification is performed using timefrequency<br />
analysis and sounds are classified into 6 music groups<br />
consisting of rock, classical, folk, jazz and pop. For each 5-second<br />
music segment, the features that are extracted include entropy, centroid,<br />
centroid ratio, bandwidth, silence ratio, energy ratio, and<br />
location of minimum and maximum energy. Using a database<br />
of 143 signals, a set of 10 time-frequency features are extracted<br />
and an accuracy of classification of around 93% using regular linear<br />
discriminant analysis or 92.3% using leave one out method is<br />
achieved.<br />
1. INTRODUCTION<br />
With the abundance of personal computers, advances in high speed<br />
modems operating at 100 Mbps and GUI based peer-to-peer (P2P)<br />
file-sharing systems that make it simple for individuals without<br />
much computer knowledge to download their favorite music, there<br />
has been an increase of digitized music available on the Internet<br />
and on personal computers. As such, there is also a rising need<br />
to manage and efficiently search the large number of multimedia<br />
databases available online which is difficult using text searches<br />
alone. Current multimedia databases are indexed based on song<br />
title or artist name which requires manual entry and improper indexing<br />
could result in incorrect searches. A more effective content<br />
based retrieval system, analyzes audio signals, selects and extracts<br />
dominant perceptual features and classifies the music based<br />
on these features. Stronger features provide a higher degree of<br />
separation between classes and thereby a higher classification accuracy.<br />
The aim is to make music search engines as effective as<br />
text-based ones and this is examined further in this paper.<br />
In recent years, there has been many works on audio classification<br />
with various perceptual features and several classification<br />
algorithms. In one of the pioneer works done on audio classification<br />
and later commercialized into the “Muscle Fish” project, Wold<br />
et al [1] extracted an N dimensional vector consisting of several<br />
acoustical features such as loudness, pitch, brightness, bandwidth<br />
Thanks to Micronet and NSERC for funding.<br />
and harmonicity from each sound. A Euclidean (Mahalanobis)<br />
distance is then calculated between the input sound feature vector<br />
and the existing models in the database. Using the nearest neighbor<br />
(NN) rule, the signal is grouped into the class with the minimal<br />
Euclidean distance.<br />
In a similar work to that of [1], Liu et al [2] extract 13 different<br />
audio features to separate audio clips into different scene classes<br />
such as advertisement, basketball, football, news and weather. Features<br />
consist of volume distribution, pitch contour, bandwidth, frequency<br />
centroid and energy. A neural network classifier with a<br />
one-class-in-one network (OCON) structure is used and an overall<br />
classification rate of 88% is achieved. Artificial neural networks<br />
(ANN) are effective in detecting complex nonlinear relationships<br />
while requiring little formal training. However, their process is<br />
computationally expensive and more importantly, the relation between<br />
the input and output variables is defined in a black box<br />
model that has no analytical basis. In terms of audio classification<br />
this means that it is difficult to deduce which acoustical features<br />
are significant in classifying each type of sound [1].<br />
In a different technique, Lu and Hankinson [3] used a rulebased<br />
heuristic classification method to classify an audio signal<br />
into speech, music and noise. For each feature, a threshold is set<br />
to determine the segment type and the feature set includes silence<br />
ratio, centroid, harmonicity and pitch. Since the feature threshold<br />
must change for different audio inputs, this type of classifier is<br />
tedious and not ideal. A classification rate of 75% for speech, and<br />
89% for music is reported.<br />
Lu et al [4] proposed support vector machines (SVMs) as an<br />
alternative to current classification methods. Using a kernel-based<br />
SVM increases the classification rate by separating nonlinear cases.<br />
Here, a nonlinear kernel function maps the data to a high dimensional<br />
feature space where the data is linearly separable. The authors<br />
use a combination of a rule-based classifier and a kernel<br />
based SVM to distinguish between 5 different audio classes including<br />
silence, music, background sound, pure speech and nonpure<br />
speech. Their feature set include similar features to those<br />
reported in [1] and [5], such as MFCCs, zero-crossing rate (ZCR),<br />
short time energy (STE), sub-band powers, brightness, and bandwidth<br />
with some new features such as spectral flux (SF), band periodicity<br />
(BP), and noise-frame-ratio (NFR). An average classification<br />
accuracy of around 90% is achieved.<br />
In the majority of the previous work in this area, audio is examined<br />
in either the time or frequency domain where it is assumed<br />
that the signals are wide sense stationary. In reality, sounds are<br />
non-stationary and multi-component signals consisting of series<br />
<br />
101<br />
<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
of sinusoids with harmonically related frequencies. Our algorithm<br />
considers the short-time Fourier transform (STFT) of an audio signal<br />
to extract parameters that will be used to classify signals. Our<br />
retrieval technique is less computationally intensive than those that<br />
use ANN, SVM, or Hidden Markov Models (HMM). Also, the<br />
efficiency of features can be examined which is not feasible in<br />
ANNs. Note that while HMM can be used to examine spectral<br />
change over time, past works have shown that HMM needs to be<br />
coupled with external features such as Cepstral or perceptual features<br />
to be efficient [6]. Finally, our method also offers the added<br />
improvement that it is not specific to certain audio files and can<br />
be applied without adjusting the algorithm or thresholds such as in<br />
rule-based models.<br />
Our work on content-based audio classification is presented<br />
as follows. Section 2 presents the application of time-frequency<br />
analysis to feature selection and analysis for audio classification.<br />
In Section 3 we present our classification results for the system and<br />
our conclusions are provided in Section 4.<br />
2. METHODOLOGY<br />
2.1. Short-time Fourier transform (STFT) algorithm<br />
Since speech and audio signals have spectral characteristics that<br />
vary over time, they require a non-stationary signal model such<br />
as the STFT to describe them. Ultimately, we would like to imitate<br />
the capability of the ear and provide simultaneous information<br />
about time and frequency of the music. STFT uses a sliding window<br />
to compute the Fourier transform thereby providing an estimate<br />
of the “local frequency” at a given time. The STFT of a signal<br />
x[n] is given by,<br />
STFT(n, f) =<br />
∞<br />
m=−∞<br />
x[n + m]w[m]e −j(2πf)m<br />
where w[m] is the window function and the spectrogram of x is<br />
defined as SPEC(n, f) =|STFT(n, f)| 2 . For a given signal<br />
x, SPEC(n, f)∆n∆f represents the energy in the time interval<br />
[n, n +∆n] in the frequency band [f,f +∆f]. In STFT analysis,<br />
we can improve the frequency resolution by decreasing the<br />
spectral width ∆f at the expense of increasing the temporal width<br />
∆n (poor time resolution). Also the shape of the window w[n]<br />
is important as a window with a sharp cutoff will introduce artificial<br />
discontinuities. Hanning windows are mainly used in audio<br />
classification techniques as they reduce spectral leakage.<br />
2.2. Audio feature extraction<br />
The set of features extracted are critical as they need to be strong<br />
enough to clearly separate the classes of signals. This procedure<br />
requires perceptual features that model the human auditory system.<br />
Discriminating music from speech is less complex than between<br />
different classes of music. The latter may only require a<br />
small number of features such as zero crossing rate or energy envelope<br />
and since the spectral characteristics are not very similar,<br />
high accuracy rates are achieved.<br />
In this paper, we examine the similarities of 143 audio signals<br />
and classify them under six different genres. Each audio signal<br />
is 5 seconds, mono-channel, 16 bits per sample and sampled at<br />
44.1 kHz. The length of the audio samples was chosen to be 5<br />
seconds in relevance with the human neurological behavior which<br />
(1)<br />
<br />
102<br />
was examined by Perrot et al in [7]. They found that human beings<br />
require at least 3 second excerpts to identify different musical<br />
genres with a 70% accuracy rate while the accuracy decreases to<br />
53% for a 250 ms excerpt.<br />
We start by transforming our audio signal into a spectrogram<br />
with a window size of 1024 samples which corresponds to about<br />
23 ms at 44.1 kHz. This window size is similar to that used in<br />
[4] and [8]. A Hanning window with 50% overlap is used and the<br />
DFT is calculated in each window. The audio features extracted<br />
from the two-dimensional time-frequency distribution (TFD) are<br />
explained below.<br />
2.2.1. Entropy<br />
The entropy of a signal is a measure of its spectral distribution<br />
and portrays the noise-like or tone-like behavior of the signal. The<br />
entropy of a signal in time frame n can be calculated as:<br />
where<br />
Fm<br />
H(n) = Pf (TFD(n, f)) log2 Pf (TFD(n, f)), (2)<br />
f=0<br />
Pf (TFD(n, f)) =<br />
TFD(n, f)<br />
. (3)<br />
TFD(n, f)<br />
f=Fm<br />
f=0<br />
Here, TFD(n, f) represents the energy of the signal at time frame<br />
n and frequency index f (it is equivalent to SPEC(n, f) defined<br />
in Section 2.1). Also, Fm refers to the maximum frequency.<br />
Consider the case where there are L number of frequency bins.<br />
Then the maximum entropy in time window n is log 2 L which occurs<br />
if the frequency bins are equiprobable. First, we examined<br />
the entropies of 3 different types of signals. These signals were<br />
analyzed using 128 frequency bins, implying that the maximum<br />
entropy is 7 bits. The first signal consisted of a single sine wave,<br />
at a sampling frequency of 1 kHz. In this case, the mean entropy<br />
was 1.24 bits and the standard deviation at 5.636 × 10 −6 . Next<br />
we considered the vowel “a” (a signal component with harmonic<br />
structure) and its entropy was calculated to be 2.84 bits with a standard<br />
deviation of 0.1. Finally, we considered white Gaussian noise<br />
and its mean entropy was 6.38 bits with a standard deviation of<br />
0.06. As we expected, the sine wave had the lowest entropy and a<br />
standard deviation of almost zero while white noise had the largest<br />
entropy (approaching maximum) with a larger standard deviation.<br />
From our database of music signals, we found that entropy<br />
was a dominant feature in classifying particularly rock or folk music.<br />
As shown in Figure 1a, rock signals possessed the highest<br />
entropy followed closely by folk music while classical, country,<br />
jazz and pop had low entropies. Figure 1b shows the distribution<br />
of entropy for rock music compared to classical. As can be seen,<br />
the entropy ranges for the two types of signals are quite different.<br />
In order to determine the strength of entropy from a different perspective,<br />
a receiver operating curve (ROC) was plotted. The ROC<br />
curve is a two dimensional measure of classification performance.<br />
The area under this curve measures discrimination, or the ability<br />
of a feature to correctly classify signals. An area of 1.0 represents<br />
a perfect test; where an area of 0.5 or less shows the feature is<br />
not useful in discrimination of that class. Rock, folk, jazz, classical,<br />
country and pop music had ROC areas of 0.933, 0.808, 0.644,<br />
0.337, 0.294, and 0.145 respectively. These results show that although<br />
entropy is a strong feature, further features are required to<br />
improve classification.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Entropy<br />
4.5<br />
4<br />
3.5<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
ROCK CLASSICAL COUNTRY FOLK JAZZ POP<br />
Type<br />
(a)<br />
Normalized Frequency<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
1 1.5 2 2.5 3 3.5 4 4.5 5<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
Rock<br />
Classical<br />
0<br />
1 1.5 2 2.5 3 3.5 4 4.5 5<br />
Mean Entropy<br />
Fig. 1. Comparison of entropy values a) Results for different genres<br />
b) Distribution for classical and rock.<br />
2.2.2. Energy ratio<br />
The rate of change in the spectral energy over time was measured<br />
as the mean of the total energy in a frequency sub-band to the pre-<br />
vious time window (E[<br />
fupper<br />
TFD(n,f)<br />
f=flower f=fupper<br />
TFD(n−1,f)<br />
f=flower (b)<br />
]). This was exam-<br />
ined in three different sub-bands [0, 5 kHz], [5, 10 kHz], [10 kHz,<br />
Fm]. However, it was found empirically that the energy ratio in<br />
mid and high frequency bands did not improve the classification.<br />
This is probably because most energy activity in audio signals is<br />
in the low frequency band. Therefore, only the mean of energy in<br />
the low-band was used in our feature set.<br />
The frequency location with the lowest energy component was<br />
also computed. Although an estimate of the mean can be calculated<br />
from the frequency domain, it was included in our feature set<br />
as it improved the classification rate by 5%. In fact, using the mean<br />
and standard deviation of the location of minimum energy provided<br />
100% classification rates for classifying country, folk and<br />
jazz music but low classification rates for the other three genres.<br />
When examining the histogram of the location of minimum energy<br />
for our database of signals (Figure 2), the frequency spread<br />
was smaller for country (21.4-21.5 kHz), folk (21.45-21.85 kHz),<br />
jazz (21.36-21.51 kHz) and a wider range for pop (18.1-21.5kHz),<br />
classical 15.5-21.5kHz) and rock (20-21.6 kHz).<br />
2.2.3. Brightness<br />
The brightness of a signal also referred to as its frequency centroid,<br />
shows the weighted midpoint of the energy distribution in a given<br />
frame. It is defined by:<br />
Fm f=0 fTFD(n, f)<br />
fi(n) = . (4)<br />
Fm<br />
f=0 TFD(n, f)<br />
The brightness feature could also be seen as the instantaneous<br />
mean frequency parameter, a typical non-stationary feature of a<br />
signal. The frequency centroid of the audio signal in the low frequency<br />
range (0-5KHz) is also examined as most of the frequency<br />
content of audio signals is concentrated in low frequency.<br />
In addition, the mean of centroid ratio to previous window is<br />
a useful feature as it measures the spectral change over time. We<br />
found that rock, folk, pop and country music signals had the largest<br />
<br />
103<br />
Distribution of location of minimum energy<br />
1<br />
0.5<br />
Rock<br />
0<br />
15 16<br />
1<br />
Classical<br />
0.5<br />
17 18 19 20 21 22<br />
0<br />
15<br />
1<br />
Country<br />
0.5<br />
16 17 18 19 20 21 22<br />
0<br />
15<br />
1<br />
0.5<br />
Folk<br />
16 17 18 19 20 21 22<br />
0<br />
15<br />
1<br />
0.5<br />
Jazz<br />
16 17 18 19 20 21 22<br />
0<br />
15<br />
1<br />
0.5<br />
Pop<br />
16 17 18 19 20 21 22<br />
0<br />
15 16 17 18 19<br />
Frequency in kHz<br />
20 21 22<br />
Fig. 2. Distribution of location of minimum energy<br />
change in centroid frequency over time while classical and jazz<br />
signals had the lowest change. This is expected as classical and<br />
jazz music generally have less activity over time compared to the<br />
other 4 genres.<br />
2.2.4. Bandwidth<br />
Bandwidth is the magnitude-weighted average of the difference<br />
between the signal’s spectral components and centroid. It can be<br />
defined as:<br />
<br />
Fm <br />
B(n) = f=0 (f − fi(n)) TFD(n, f)<br />
. (5)<br />
Fm<br />
f=0 TFD(n, f)<br />
Effectively, it shows the spectral shape and the spread of energy<br />
relative to the centroid, therefore it is also a non-stationary feature.<br />
For instance, a sine wave without noise has zero bandwidth.<br />
2.2.5. Silence ratio<br />
Silence ratio is the number of silent time window frames with total<br />
energy less than 0.01. This threshold is set empirically. Note that<br />
this feature could also be extracted from the time domain.<br />
Bandwidth, brightness and silence ratio have been proven to<br />
be effective in previous audio classification papers including [1, 2]<br />
although an STFT approach showing the rate of change to previous<br />
windows has not been used.<br />
3. AUDIO CLASSIFICATION<br />
Using the above analysis, the 10 features extracted for each sample<br />
included mean and standard deviation of centroid frequency, mean<br />
centroid (low-frequency range), mean of centroid ratio to previous<br />
window, mean bandwidth, silence ratio, mean and standard deviation<br />
of the frequency location with the lowest energy, mean and<br />
standard deviation of entropy. Note that mean and variance of a<br />
feature are calculated over the entire time window. Once the features<br />
are extracted for the 143 audio signals, linear discriminant<br />
analysis (LDA) is then applied using SPSS software [9], to predict<br />
group classification of cases. This type of analysis tries to<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Function 2<br />
4<br />
2<br />
0<br />
-2<br />
-4<br />
-6<br />
Canonical Discriminant Functions<br />
-8<br />
Function 1<br />
-6<br />
-4<br />
Classical<br />
pop<br />
-2<br />
0<br />
count<br />
jazz<br />
2<br />
Folk<br />
ROCK<br />
4<br />
6<br />
file type<br />
<strong>Group</strong> Centroids<br />
Fig. 3. All-groups scatter plot with the first two canonical discriminant<br />
functions<br />
find a linear combination of those extracted features that best separate<br />
the group of cases. To represent this linear combination, a<br />
discrimination function is formed using the extracted features as<br />
discrimination variables and can be expressed as:<br />
pop<br />
jazz<br />
Folk<br />
count<br />
Classical<br />
L = b1x1 + b2x2 + ....... + b10x10 + c, (6)<br />
where b1..b10 are the coefficients, c is a constant and x1..x10 are<br />
the values of the extracted features. This technique finds the first<br />
function that separates the groups as much as possible and then<br />
finds further functions that improve the separation and are uncorrelated<br />
to previous ones. The number of functions is determined<br />
by the number of predictors or features and the number of groups<br />
available.<br />
Using Fisher’s coefficients and prior probabilities of each group,<br />
a scatterplot (Figure 3) is created showing the discriminant scores<br />
of the cases on two discriminant functions. This plot shows the<br />
separation between different cases. Songs are categorized into six<br />
groups (rock, classical, country, folk, jazz and pop) and the confusion<br />
matrix depicted in Table 1 shows the classification performance.<br />
Using the original LDA, 93.0% of all original grouped<br />
cases are correctly classified with folk music having the lowest<br />
rate. A more accurate estimate is obtained through the crossvalidated<br />
method where a portion of cases belong to the learning<br />
sample and the other cases belong to the test sample. In the leaveone-out<br />
method used, each case is classified by the functions derived<br />
by all cases except that one. This method yields a 92.3%<br />
classification rate revealing the discrimination strength of our feature<br />
set.<br />
4. CONCLUSIONS<br />
In this paper, we examined a technique where features used to classify<br />
music signals are derived directly from the time-frequency domain.<br />
Using six different genres for classification, we have shown<br />
that high accuracy rates can be obtained using features that reflect<br />
the non-stationarity properties of audio signals and are able to depict<br />
its spectral, energy and entropy change over time. In addition<br />
to the success rate, our algorithms have low computational complexity<br />
compared to other techniques and they offer versatility as<br />
ROCK<br />
<br />
104<br />
Method Type RO CL CO FO JA PO CA%<br />
1. Original RO 14 0 0 2 0 0 87.5<br />
CL 0 30 0 0 0 1 96.8<br />
CO 0 0 15 0 0 1 93.8<br />
FO 2 0 1 27 1 1 84.4<br />
JA 0 0 0 1 15 0 93.8<br />
PO 0 0 0 0 0 32 100<br />
Overall 93.0<br />
2. Cross- RO 14 0 0 2 0 0 87.5<br />
Validated CL 0 30 0 0 0 1 96.8<br />
CO 0 0 15 0 0 1 93.8<br />
FO 2 0 1 26 1 2 81.3<br />
JA 0 0 0 1 15 0 93.8<br />
PO 0 0 0 0 0 32 100<br />
Overall 92.3<br />
Table 1. Classification results. Method: Original - Linear discriminant<br />
analysis, Cross - validated - Linear discriminant analysis with<br />
leave-one-out method (RO-Rock, CL-Classical, FO-Folk, Ja-Jazz,<br />
PO-Pop, CA% - Classification accuracy rate)<br />
they can be applied to any audio signal without alteration. Further<br />
work will include optimization of window size in the TF domain<br />
as well as examining other classification methods such as minimum<br />
classification error (MCE) to improve classification rate for<br />
a larger database of signals.<br />
5. REFERENCES<br />
[1] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based<br />
classification, search, and retrieval of audio,” IEEE Multimedia,<br />
pp. 27–36, 1996.<br />
[2] Z. Liu, J. Huang, Y. Wang, and T. Chuan, “Audio feature extraction<br />
and analysis for scene classification,” in IEEE Workshop<br />
on Multimedia <strong>Signal</strong> Processing, June 1997, pp. 343–<br />
348.<br />
[3] G. Lu and T. Hankinson, “A technique towards automatic audio<br />
classification and retrieval,” in Fourth International Conference<br />
on <strong>Signal</strong> Processing, Beijing, China, October 1998,<br />
pp. 1142–1145.<br />
[4] L. Lu, H. Zhang, and S. Li, “Content-based audio classification<br />
and segmentation by using support vector machines,”<br />
ACM Multimedia Systems Journal 8, vol. 8, no. 6, pp. 482–<br />
492, March 2003.<br />
[5] J. Foote, “Content-based retrieval of music and audio,” in<br />
Multimedia Storage and Archiving Systems II, Proc. of SPIE,<br />
1997, pp. 138–147.<br />
[6] T. Zhang and C. Kuo, “Hierarchical classification of audio<br />
data for archiving and retrieving,” in Proc. ICASSP, March<br />
1999, pp. 3001–3004.<br />
[7] D. Perrot and R.O. Gjedigen, “Scanning the dial: An exploration<br />
of factors in the identification of musical style,” Proceedings<br />
of the 1999 Society for Music Perception and Cognition,<br />
p. 88, 1999.<br />
[8] G. Tzanetakis and P. Cook, “Music genre classification of<br />
audio signals,” IEEE Transactions on Speech and Audio Processing,<br />
vol. 10, no. 5, pp. 293–302, July 2002.<br />
[9] SPSS Inc., “SPSS advanced statistics user’s guide,” in User<br />
manual, SPSS Inc., Chicago, IL, 1990.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
MODIFIED LOCAL DISCRIMINANT BASES AND ITS APPLICATIONS IN SIGNAL<br />
CLASSIFICATION<br />
Karthikeyan Umapathy<br />
Dept. of Electrical and Computer Engg.,<br />
The <strong>University</strong> of Western Ontario,<br />
London, ON N6A 5B8, Canada<br />
ABSTRACT<br />
One of the major challenges in classification problems based<br />
on signal decomposition approach is to identify the right basis<br />
function and its derivatives that can provide optimal features to<br />
distinguish the classes. With the vast amount of available libraries<br />
of orthonormal bases, it is hard to select an optimal set of basis<br />
functions for a specific dataset. To address this problem, pruning<br />
algorithms based on certain selection criteria is needed. Local<br />
Discriminant Bases (LDB) algorithm is one such algorithm, which<br />
efficiently selects a set of significant basis functions from the library<br />
of orthonormal bases based on certain defined dissimilarity<br />
measure. The selection of this dissimilarity measure is critical as<br />
they indirectly contribute to the performance accuracy of the LDB<br />
algorithm. In this paper, we study the impact of the dissimilarity<br />
measures on the performance of the LDB algorithm with two classification<br />
examples. The two biomedical signal databases used are<br />
1. Vibroarthographic signals (VAG) - 89 signals with 51 normal<br />
and 38 abnormal, and 2. Pathological speech signals - 100 signals<br />
with 50 normal and 50 pathological. Classification accuracies<br />
of 76.4% with VAG database and 96% with pathological speech<br />
databases were obtained. This modified method of signal analysis<br />
using LDB has shown its powerfulness in analyzing non-stationary<br />
signals.<br />
1. INTRODUCTION<br />
The Local Discriminant Bases (LDB) [1] algorithm is recently being<br />
used successfully in many classification problems. The optimal<br />
choice of LDBs for a given dataset is driven by the nature of<br />
the dataset and the dissimilarity measures [2] used to distinguish<br />
between classes. The choice of the dissimilarity measure for a particular<br />
dataset depends on knowledge of the data, computational<br />
complexity, and the classification accuracy requirements. For example<br />
probabilistic dissimilarity measures such as relative-entropy<br />
needs prior knowledge of the dataset distribution, whose accuracy<br />
depends on the size of data, on the other hand simple dissimilarity<br />
measures such as Euclidean distance is only suitable for numeric<br />
data sets. A combination of multiple dissimilarity measures with<br />
varying complexity can be used to achieve high classification accuracies.<br />
In this paper we analyze two biomedical signal databases using<br />
LDB algorithm with 3 different dissimilarity measures. The<br />
LDB algorithm is based on the wavelet packet decompositions<br />
with 3 different wavelets namely Daubechies (db4), Coiflet (cf4)<br />
Thanks to NSERC for funding this research work.<br />
Sridhar Krishnan<br />
Dept. of Electrical and Computer Engg.,<br />
<strong>Ryerson</strong> <strong>University</strong>,<br />
Toronto, ON M5B 2K3, Canada<br />
and Symlet (sy4) [3]. This gives us 9 different combinations for<br />
each of the databases. A two group (class1 and class2) classification<br />
was performed for the 9 combinations. Linear discriminant<br />
analysis (LDA) based classifier was used to compute the classification<br />
accuracies. The classification accuracies were verified<br />
using the leave-one-out method [4]. The paper is organized as<br />
follows: In Section 2 on Methodology, Local Discriminant Bases<br />
algorithm, dissimilarity measures, feature extraction and pattern<br />
classification are covered. Results and discussions are covered in<br />
Section 3, and Conclusions in Section 4.<br />
2. METHODOLOGY<br />
2.1. Local Discriminant Bases Algorithm<br />
In the LDB [1] algorithm with wavelet packet bases, a set of training<br />
signals x c i for all C classes are decomposed to a full tree<br />
structure of order N. We restrict our analysis to binary wavelet<br />
packet trees. Let Ω0,0 be a vector space in R n corresponding to<br />
the node 0 of the parent tree. Then at each level the vector space<br />
is spilt into two mutually orthogonal subspaces given by Ωj,k =<br />
Ωj+1,2k ⊕ Ωj+1,2k+1 where j indicates the level of the tree and k<br />
represents the node index in level j, givenbyk =0, ...., 2 j − 1.<br />
This process repeats till the level J, giving rise to 2 J mutually<br />
orthogonal subspaces. Our goal is to select the set of best subspaces<br />
that provide maximum discriminant information between<br />
the classes of the signal. Each node k contains a set of basis vectors<br />
Bj,k =[wj,k,l] l=2no−j −1<br />
l=0 , where 2 no corresponds to the length of<br />
the signal. Then the signal xi can be represented by a set of coefficients<br />
c as:<br />
xi =Σj,k,lcj,k,lwj,k,l<br />
Basically the signal xi is decomposed into 2 J subspaces with<br />
cj,k,l coefficients in each subspace. With the training signals decomposed<br />
into wavelet packet coefficients we need to define a dissimilarity<br />
measure (Dn) in the vector space so as to identify those<br />
subspaces, which have larger statistical distance between classes.<br />
This dissimilarity measure is used in an iterative manner to prune<br />
the tree in such a way that only a node is split if the cumulative discriminative<br />
measure of the children nodes is greater than the parent<br />
node. The resulting tree contains the most significant LDBs,<br />
from which a set of K significant LDBs are selected to construct<br />
the final tree. The testing set signals are then expanded using this<br />
tree and features are extracted from the respective basis vectors for<br />
classification.<br />
<br />
105<br />
<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.<br />
(1)
In the proposed method we use a similar approach with some<br />
modification. Instead of the selective splitting of the nodes, which<br />
basically helps in removing the redundancy in the LDB selection,<br />
we used all the nodes from the full decomposition tree and ranked<br />
them in decreasing order of their dissimilarity measure values between<br />
classes. The first 5 nodes that exhibit high dissimilarity<br />
measure values between the classes are selected for each trial.<br />
Among these nodes, based on the frequency of occurrence in all<br />
the trials, the 5 most occurring significant LDBs are selected. The<br />
redundancy within these 5 LDBs is later removed in the feature<br />
evaluation process in the LDA classifier. This is basically done<br />
to reduce the computational complexity of the LDB algorithm implementation.<br />
The whole process is repeated for three different<br />
wavelets (db4, cf4 and sy4) and the wavelet, which provides maximum<br />
dissimilarity measures among all the tested wavelets, is chosen<br />
to be the best basis for expansions.<br />
2.2. Databases<br />
2.2.1. Vibroarthographic (VAG) signals<br />
These are the vibration signals emitted from the human knee joints<br />
during an active movement of the leg. The VAG signals can be<br />
used to detect the early joint degeneration or knee defects that<br />
are reflected in knee movements. Extensive work [5] has been<br />
done using time-frequency approach in classifying these signals<br />
into multiple groups. Few important characteristics of the VAG<br />
signals which make them difficult to analyze are as follows: (i)<br />
Highly non-stationary in nature, (ii) Varying frequency dynamics,<br />
and (iii) Multi-component signal. The database consists of 89 signals<br />
with 51 normal and 38 abnormal signals. A normal and an<br />
abnormal VAG signal are shown in Fig. 1a.<br />
2.2.2. Pathological speech signals<br />
These are speech signals recorded from the pathological and normal<br />
talkers in a sound-proof booth at the Massachusetts Eye and<br />
Ear Infirmary. The normal talkers exhibited no abnormal vocal<br />
characteristics and indicated no history of voice disorders. All signals<br />
were sampled at 25 kHz. The signals were the first sentence<br />
of the rainbow passage, ’when the sunlight strikes rain drops in<br />
the air, they act like a prism and form a rainbow’, as spoken by<br />
the subjects. More details about the database and the classification<br />
problem can be found in authors previous work [6]. The database<br />
consists of 100 signals with 50 normal and 50 abnormal signals. A<br />
normal and pathological speech signal are shown in Fig. 1b.<br />
Amplitude (.au)<br />
Amplitude(.au)<br />
60<br />
40<br />
20<br />
0<br />
20<br />
40<br />
20<br />
0<br />
20<br />
40<br />
60<br />
80<br />
Normal and Abnormal VAG signals<br />
Normal<br />
1000 2000 3000 4000<br />
Time samples<br />
5000 6000 7000<br />
Abnormal<br />
1000 2000 3000 4000<br />
Time samples<br />
5000 6000 7000<br />
(a) VAG signals<br />
Amplitude(.au)<br />
Amplitude(.au)<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0.2<br />
0.4<br />
0.4<br />
0.2<br />
0<br />
0.2<br />
0.4<br />
0.6<br />
Normal and Pathological speech signals<br />
Normal<br />
2 4 6 8 10 12 14<br />
x 10 4<br />
Time samples<br />
Pathological<br />
2 4 6 8 10 12 14 16 18<br />
x 10 4<br />
Time samples<br />
(b) Pathological speech signals<br />
Fig. 1. An Example of normal and abnormal/pathological signals<br />
for both the databases.<br />
<br />
106<br />
2.3. Dissimilarity measures<br />
In this study we used three different dissimilarity measures and<br />
performed a two group (class1 and class2) classification on the<br />
databases. In general most of the biomedical signals can be characterized<br />
by one or more of the following, (i) Their average energy<br />
distribution pattern over frequency bands, (ii) Event based temporal<br />
structures, (iii) Periodicity, and (iv) The amount of randomness.<br />
These rationales were used in arriving at the following dissimilarity<br />
measures.<br />
The first dissimilarity measure D1 is the difference in the normalized<br />
energy between the corresponding nodes of the training<br />
signals from class1 and class2. This gives the difference in the<br />
energy distribution of the signals on the time-frequency plane.<br />
D1 = E 1 j,k − E 2 j,k, (2)<br />
where E 1 j,k and E 2 j,k are the normalized energy of the corresponding<br />
nodes for class1 and class2 signals.<br />
The second dissimilarity measure D2 is the correlation index<br />
between the basis vectors at corresponding nodes. This measure<br />
emphasizes those nodes that can detect the difference in the temporal<br />
characteristics of the signals between class1 and class2.<br />
D2 =< Bj,k,Fj,k >, (3)<br />
where B and F are the corresponding basis vectors of class1 and<br />
class2 at node (j, k)<br />
The discriminant measure D3 is a measure of estimating the<br />
randomness or non-stationarity of the basis vectors. It is computed<br />
as the set of variances along the segments of the basis vector coefficients.<br />
The ratio of this variance measure between the signals<br />
from class1 and class2 indicate the amount of deviation observed<br />
in the non-stationarity between the classes.<br />
D3 = var(var(p))j,k)<br />
, (4)<br />
var(var(q))j,k)<br />
where p and q are the index of the L segments obtained by segmenting<br />
the basis vectors at node (j, k) for class1 and class2.<br />
2.4. Feature extraction<br />
Once the LDB nodes for each of the three dissimilarity measures<br />
are identified using the training sets (in our study 10 randomly selected<br />
signals for each class were used to form the training set) as<br />
explained in Section 2.1, all the 89 VAG signals and the 100 pathological<br />
speech signals were decomposed using the corresponding<br />
sets of LDB tree structures. Figs. 2 and 3 show the sample<br />
LDB tree structure obtained for the VAG and pathological speech<br />
databases respectively.<br />
The basis vectors from each of the nodes (LDBs) can be directly<br />
used as feature vector, however, considering the dimension<br />
of the basis vectors, we extract the same features from the basis<br />
vectors of LDBs using the dissimilarity measures (D1, D2, and<br />
D3) [1]. That is, from each of the LDB nodes of the corresponding<br />
tree structures, the normalized node energy, correlation index<br />
and the variance measure were calculated. In short, each signal in<br />
the database is used to compute 15 features, 5 from each dissimilarity<br />
measure. As for the correlation index calculation we use a<br />
random choice of normal signal as a template to correlate with the<br />
signals from respective test databases. The above procedure was<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.
Tree Decomposition<br />
(0,0)<br />
(1,0) (1,1)<br />
(2,0) (2,1) (2,2) (2,3)<br />
(3,0) (3,1)<br />
(3,6) (3,7)<br />
(4,12) (4,13)<br />
(5,24) (5,25)<br />
80<br />
60<br />
40<br />
20<br />
0<br />
20<br />
40<br />
60<br />
80<br />
100<br />
data for node: (0) or (0,0).<br />
1000 2000 3000 4000 5000 6000 7000<br />
Fig. 2. A sample LDB tree decomposition for VAG database (db4<br />
wavelet and D3 dissimilarity measure)<br />
Tree Decomposition<br />
(0,0)<br />
(1,0) (1,1)<br />
(2,0) (2,1)<br />
(3,0) (3,1)<br />
(4,0) (4,1)<br />
(5,2) (5,3)<br />
(4,2) (4,3)<br />
(5,6) (5,7)<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0.2<br />
0.4<br />
0.6<br />
data for node: (0) or (0,0).<br />
2 4 6 8 10 12 14<br />
x 10 4<br />
Fig. 3. A sample LDB tree decomposition for pathological speech<br />
database (cf4 wavelet and D3 dissimilarity measure)<br />
repeated for all the three wavelets. So, in total, for each wavelet, a<br />
set of 15 feature vectors was extracted from each of the signal in<br />
the test database.<br />
Figs. 4 and 5 demonstrate the feature space with the first two<br />
dominant features of the VAG and pathological speech database<br />
respectively. From the figures of the feature space plots, the discriminatory<br />
boundaries can be visualized between the signals of<br />
class1 and class2. These extracted features were then fed to a linear<br />
discriminant based classifier as will be explained in next section.<br />
2.5. Pattern Classification<br />
The motivation for the pattern classification is to automatically<br />
group signals of same characteristics using the discriminatory features<br />
derived as explained in the previous section. Pattern classification<br />
was carried out by linear discriminant analysis (LDA) technique<br />
using the SPSS software [7]. In discriminant analysis, the<br />
feature vector derived as explained above were transformed into<br />
canonical discriminant functions such as<br />
f = x1b1 + x2b2 + ....... + x42b42 + a, (5)<br />
where {x} is the set of features, {b} and a are the coefficients<br />
<br />
107<br />
Feature 2<br />
13<br />
12<br />
11<br />
10<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
x 10 3<br />
Feature space of the two dominant features for the VAG database with db4 wavelet<br />
Normal<br />
Abnormal<br />
0.05 0.1 0.15 0.2<br />
Feature 1<br />
0.25 0.3 0.35<br />
Fig. 4. Feature space with the first two dominant features - VAG<br />
database, db4 wavelet<br />
Feature 2<br />
9<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
x 10 4 Feature space of the two dominant features for the pathological speech database with cf4 wavelet<br />
Normal<br />
Pathological<br />
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35<br />
Feature 1<br />
Fig. 5. Feature space with the first two dominant features - Pathological<br />
speech database, cf4 wavelet<br />
and constant respectively estimated and derived using the Fisher’s<br />
linear discriminant functions [7]. Using the chi-square distances<br />
and the prior probabilistic values of each group the classification<br />
is performed to assign each sample data to one of the groups.<br />
The classification accuracy was estimated using the leave-one out<br />
method which is known to provide a least bias estimate [4]. In<br />
leave-one-out method, one sample is excluded from the dataset<br />
and the classifier is trained with the remaining samples. Then the<br />
excluded signal is used as the test data and the classification accuracy<br />
is determined. This is repeated for all samples of the dataset.<br />
Since each signal is excluded from the training set in turn, the independence<br />
between the test and the training set are maintained.<br />
3. RESULTS AND DISCUSSIONS<br />
All the signals from both the databases were decomposed using<br />
their corresponding LDB tree structures. Features were extracted<br />
as explained in Section 2.4 and fed to the LDA based classifier.<br />
Classification accuracies were computed for the 9 combinations<br />
of the wavelet and the dissimilarity measures as shown in Table<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.
Wavelet LDA type D1 D2 D3<br />
db4 Regular 65 64 67<br />
Cross.V 61 57 64<br />
cf4 Regular 70 61 61<br />
Cross.V 65 57 48<br />
sy4 Regular 67 63 57<br />
Cross.V 61 60 45<br />
Table 1. Classification table for VAG database. Regular - Normal<br />
LDA, Cross.V - Leave-one-out method LDA, Classification<br />
accuracies are in percentage (%)<br />
Wavelet LDA type D1 D2 D3<br />
db4 Regular 84 64 77<br />
Cross.V 84 60 72<br />
cf4 Regular 85 52 92<br />
Cross.V 84 37 91<br />
sy4 Regular 87 53 86<br />
Cross.V 84 32 84<br />
Table 2. Classification table for pathological speech database.<br />
Regular - Normal LDA, Cross.V - Leave-one-out method LDA,<br />
Classification accuracies are in percentage (%)<br />
1 and Table 2 for both the databases. It can be observed from<br />
Table 1 that even though there are little variations, on an average<br />
all the three dissimilarity measures perform equally for the<br />
VAG database. However from Table 2 for the Pathological speech<br />
database it can be seen that the dissimilarity measures D1 and<br />
D3 provide high classification accuracies, whereas D2 performs<br />
poorly. In overall for VAG database we observe that the db4 wavelet<br />
in combination with all the three dissimilarity measures provides<br />
the highest classification accuracy. Similarly we observe for pathological<br />
database that the cf4 wavelet in combination with D1 and<br />
D3 provides the highest classification accuracy. Using these combinations<br />
we computed the highest possible classification accuracies<br />
for both the databases as shown in Table 3 and Table 4.<br />
For the VAG database an overall classification accuracy of<br />
78.7% using regular LDA and 76.4% using leave-one-out method<br />
were achieved. This is higher than the reported classification accuracy<br />
in [5]. For the pathological speech database an overall classification<br />
accuracy of 97% using regular LDA and 96% using leaveone-out<br />
method were achieved. This is higher than the reported<br />
classification accuracy in [6]. The above results demonstrate the<br />
performance optimization of the LDB algorithm using the right<br />
choice and combination of the dissimilarity measures to achieve<br />
high classification accuracies for non-stationary signal analysis.<br />
4. CONCLUSIONS<br />
The importance of the dissimilarity measure in the performance<br />
optimization of the LDB algorithm was discussed with two classification<br />
examples. Classification accuracies were analyzed for<br />
different combinations of wavelets and the dissimilarity measures.<br />
Improvement in the classification accuracies by using a combination<br />
of multiple dissimilarity measures was demonstrated. High<br />
classification accuracies were achieved for the databases under<br />
study, thus proving the success of the modified LDB in analyz-<br />
<br />
108<br />
Method <strong>Group</strong>s Normal Abnormal Total<br />
Regular Normal 39 12 51<br />
Abnormal 7 31 38<br />
% Normal 76.5 23.5 100<br />
Abnormal 18.4 81.6 100<br />
Cross.V Normal 39 12 51<br />
Abnormal 9 29 38<br />
% Normal 76.5 23.5 100<br />
Abnormal 23.7 76.3 100<br />
Table 3. Table showing the highest classification accuracy<br />
achieved for the VAG database(db4 wavelet and selective combination<br />
of D1, D2 and D3) . Regular - Normal LDA, Cross.V -<br />
Leave-one-out method LDA, % = Percentage of classification<br />
Method <strong>Group</strong>s Normal Pathological Total<br />
Original Normal 48 2 50<br />
Pathological 1 49 50<br />
% Normal 96 4 100<br />
Pathological 2 98 100<br />
Cross.V Normal 48 2 50<br />
Pathological 2 48 50<br />
% Normal 96 4 100<br />
Pathological 4 96 100<br />
Table 4. Table showing the highest classification accuracy<br />
achieved for the pathological speech database (cf4 wavelet and<br />
combined D1 and D3). Regular - Normal LDA, Cross.V - Leaveone-out<br />
method LDA, % = Percentage of classification<br />
ing non-stationary signals. Future work involves in automating<br />
the choice of dissimilarity measures based on the nature of the<br />
databases and applications.<br />
5. REFERENCES<br />
[1] N. Saito and R. R. Coifman, “Local discriminant bases and<br />
their applications,” Journal of Mathematical Imaging and Vision,<br />
vol. 5, no. 4, pp. 337–358, 1995.<br />
[2] Andrew Web, Statistical Pattern Recognition, WILEY, West<br />
Sussex, England, 2002.<br />
[3] Stephane Mallat, A wavelet tour of signal processing, Academic<br />
press, San Diego, CA, 1998.<br />
[4] K. Fukunaga, Introduction to Statistical Pattern Recognition,<br />
Academic Press, Inc., San Diego, CA, 1990.<br />
[5] S. Krishnan, “Adaptive signal processing techniques for analysis<br />
of knee joint vibroarthrographic signals,” in Ph.D dissertation,<br />
<strong>University</strong> of Calgary, June 1999.<br />
[6] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, “Discrimination<br />
of pathological voices using an adaptive timefrequency<br />
approach,” in Proceedings of ICASSP 2002 IEEE<br />
International conference on Acoustics, Speech and <strong>Signal</strong><br />
Processing, Orlando, USA, May 2002, pp. IV 3853–3855.<br />
[7] SPSS Inc., “SPSS Advanced statistics user’s guide,” in User<br />
manual, SPSS Inc., Chicago, IL, 1990.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:08 from IEEE Xplore. Restrictions apply.
RADIO OVER MULTIMODE FIBER FOR WIRELESS ACCESS<br />
Roland Yuen Xavier N. Fernando Sridhar Krishnan<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, Ontario, Canada<br />
r yuenaee. ryerson. ca xavier Qieee. org krishnanQee. ryerson. ca<br />
Abstract<br />
A radio over fiber link is a promising technology for<br />
antenna remoting applications. Typically, the radio<br />
over fiber link employs a single mode fiber. But, the<br />
signal power at the remote antenna is very small. The<br />
main reason is large power loss in the E/O and O/E<br />
convertor. But, the coupling efficiency of a E/O con-<br />
vertor can be improved with multimode fiber (MMF),<br />
so we propose to use a ROF link with a vertical-cavity<br />
surface-emitting laser with a graded index MMF to<br />
transport optical signals. A multimode fiber has a larger<br />
core radius compared to a SMF. A larger core radius<br />
allows more optical power coupled into a fiber. With<br />
simple butt-coupling techniques, the coupling eficiency<br />
can be 90% and simplicity leads to reduction in cost of<br />
the link. Normally, the MMF is used in short distance<br />
digital applications with a bandwidth distance product<br />
of about 500 MHDkm, so it is good for local area pic-<br />
ocells. Our approach as to transmit passband signals<br />
such as QPSK and FSK through the ROF link. Our<br />
simulation shows that a 900 MHz carrier can transport<br />
through a link of 1.22 km long. In this paper, we in-<br />
vestigate the feasibility of using a MMF for antenna<br />
remoting in local area picocells and compare the trade-<br />
08 between coupling efficiency and bandwidth.<br />
Keywords: Radio over fiber; multimode fiber; remote<br />
antenna; coupling eficiency.<br />
1. INTRODUCTION<br />
Radio over fiber (ROF) link is used in remote an-<br />
tenna applications to distribute signals for microcell<br />
or picocell base station (BS). In the remote antenna<br />
application, the downlink RF signals are distributed<br />
from a central base station (CBS) to many BS known<br />
as radio access point (RAP) through fibers. The up-<br />
link signals received at the RAPs are sent back to the<br />
CBS for any signal processing. A RAP is much more<br />
cost effective to deploy than a normal BS because it<br />
is mostly consisted of simple devices, which includes a<br />
E/O convertor, a O/E convertor, and an amplifier. The<br />
cost of signal processing in a CBS is shared amounts of<br />
CCECE 2004 - CCGEI 2004, Niagara Falls, May/mai 2004<br />
0-7803-8253-6/04/$17.00 @ 2004 IEEE<br />
- 1715 -<br />
many RAPs. Additional to the lower cost advantage, a<br />
smaller cell size coverage reduces the near fax effect and<br />
relaxes the battery requirement on mobile receivers.<br />
Although the fiber is a reliable medium with low<br />
attenuation (0.5 dB/km at 1550 nm), challenge still<br />
exists in large loss due to E/O and O/E conversion [l].<br />
In this paper, we propose to employ multimode fiber<br />
(MMF) to increase the coupling efficiency, which re-<br />
duces the E/O conversion loss. However, MMF has<br />
limited bandwidth largely due to modal dispersion.<br />
In this paper, we will be discussing two topics. The<br />
downlink architecture of the ROF link. The tradeoff<br />
between power and bandwidth in remote antenna ap-<br />
plication.<br />
109<br />
2. THE RADIO OVER FIBER LINK<br />
Figure 1: Radio over fiber link in remote antenna ap-<br />
plication i<br />
The radio over fiber (ROF) links in remote antenna<br />
application is illustrated in Figure 1. The central base<br />
station (CBS) and the radio access points (RAPs) are<br />
connected through two fibers, which transport the up-<br />
link and downlink signal. The RAPs act as remote<br />
antenna that receives and transmits signals to mobile<br />
users, whereas the CBS collects signals from the RAPs<br />
for processing and distribute signals to all the RAPs.<br />
The downlink of the ROF can be divided into an op-<br />
tical channel and a wireless channel denoted by ROF<br />
and Air respectively in Figure 2. When a signal s(t)<br />
goes through the optical channel, it is attenuated by<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.
a loss of L,t. After the optical channel, the signal<br />
is boosted by G,l and later in the wireless channel it<br />
further attenuated by a path loss L,l. Noise nopt(t)<br />
is added to the signal in the optical channel, and noise<br />
n,1 (t) is added in the wireless signal. Finally, the quality<br />
of the received signal r(t) is determined from the<br />
signal to noise ratio (SNR) at the mobile user.<br />
Figure 2: Downlink block diagram of radio over fiber<br />
link<br />
2.1 Optical Channel<br />
The optical channel of the ROF link that use a mul-<br />
timode fiber (MMF) is illustrated in Figure 3. It con-<br />
sists of an optical source, a fiber, and a photodetector.<br />
1 +S(t) Multimode fiber h,,,(t) Photodetector<br />
Figure 3: The optical channel<br />
The signal s(t) from the CBS can be in any form<br />
such as QPSK, 16-PSK, and FSK. Usually in mobile<br />
communication, the signal has a bandwidth less than 2<br />
MHz. The signal is directly modulated onto a laser and<br />
biased to minimize nonlinearity and clipping distortion.<br />
The signal, s(t), after biased is given as,<br />
Sbias(t) = [I + ms(t)] (1)<br />
where m is the optical modulation index.<br />
The impulse response of a MMF can be generalized<br />
as a Gaussian response [2] with respect to optical power<br />
and it is given as,<br />
where T is the delay of the channel and c the stan-<br />
dard deviation of the impulse response that increases<br />
linearly with the link distance. Longer the link is, the<br />
modal dispersion effect is more apparent.<br />
110<br />
- 1716 -<br />
Other than distortion from modal dispersion, there<br />
are noises in the optical channel. These noises are combined<br />
into a term, nopt(t). The output photocurrent is<br />
given as,<br />
i(t) = - ps [I+ ms(t)l* hrnrnp(t) + nopt(t) (3)<br />
6<br />
where Ps is the average optical power emitted from the<br />
laser diode, Lopt is the loss in the optical channel, and<br />
hmmf(t) is the impulse response of a MMF.<br />
The Lo, includes the losses from the E/O, O/E<br />
conversion, the fiber attenuation, the connectors, and<br />
matching of the transmitter and the receiver. In [l],<br />
the electrical loss in dB is given as<br />
zin<br />
Lopt,dB = -20 log(%Gm) + lolog(-) + 2(2Zc + ad)<br />
zmt<br />
(4)<br />
and in linear is given as,<br />
where IR is the responsivity of the photodetector in<br />
mA/mW, G, is the modulation gain of the optical<br />
source in mW/mA, Zin is the impedance of the laser,<br />
Zout is the impedance of the optical receiver, IC is the<br />
optical connector, a is the attenuation per km of the<br />
fiber, and d is the distance of the link in km. In above<br />
expression, the E/O, O/E conversion loss, connector<br />
loss, and fiber attenuation are double because they are<br />
the losses in the optical domain. The modulation gain<br />
G, is the coupling efficiency that accounts for the F'res-<br />
ne1 loss and the misalignment loss. It reflects the qual-<br />
ity of coupling techniques.<br />
2.2 Optical <strong>Signal</strong> to Noise Ratio<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.<br />
To evaluate the performance of an optical link, the<br />
optical signal to noise ratio (OSNR) is needed. It is<br />
evaluated at the output of the optical receiver. The<br />
OSNR can be expressed as follows,<br />
m2 P," ( s2 (t))<br />
OSNR =<br />
LOPt (Gpt (t)) .<br />
The nopt(t) is the noise induced in the optical channel,<br />
and it is assumed to be an additive white gaussian<br />
noise. The noise consists of the relative intensity noise<br />
(nLIN(t)) from the laser, the shot noise (&(t)) from<br />
the photodetector, and the thermal noise (n&(t)) from<br />
the receiver electronics. In [l], the total noise power of<br />
the signal is given as
3. COMPARISON OF MULTIMODE<br />
FIBER AND SINGLE MODE FIBER<br />
To reduce the penalty from the E/O conversion,<br />
multimode fiber (MMF) is used. The proposed sys-<br />
tem chose the combination of vertical-cavity surface-<br />
emitting laser with graded index MMF fiber. In [3], the<br />
authors found that the coupling efficiency to a graded<br />
index MMF is strongly depended on the active laser di-<br />
ameter, the index guiding of a laser, and the transverse<br />
mode emission spectrum of a laser. The coupling effi-<br />
ciency also depends on the coupling techniques. How-<br />
ever, for better coupling efficiency, there is a tradeoff<br />
for the bandwidth of a radio over fiber link.<br />
Physically, the MMF has a larger core diameter of<br />
50-200 pm compares to the signal mode fiber (SMF) of<br />
8- 12 pm. In addition, MMF also has higher numerical<br />
aperture in the range of 0.19 - 0.30. The higher nu-<br />
merical aperture means a larger acceptance angle that<br />
allows more optical power to be coupled into a fiber.<br />
These physical characteristics of the SMF and MMF<br />
are found in [4]. Moreover, the typical VCSEL has an<br />
active diameter of 16 - 20 pm [3], which is smaller than<br />
the core diameter of a MMF, but larger than the core<br />
diameter of a SMF. Thus, physically a MMF can better<br />
capture the optical power emitted from a laser.<br />
It has been reported in [3] that the coupling effi-<br />
ciency can exceed 90%. This is achieved with butt-<br />
coupling a graded index MMF with a weakly index<br />
guide proton-implanted VCSEL laser. But, the typ-<br />
ical coupling efficiency lies in the 70% - 80% range.<br />
Whereas, the SMF has a typical coupling efficiency<br />
in the 40% - 70% range [5]. Various coupling tech-<br />
niques have been proposed and evaluated in terms of<br />
their coupling efficiency and the fabrication complexity.<br />
They can be generalized into butt-coupling, lens cou-<br />
pling and pigtail coupling. The butt-coupling is usually<br />
used for MMF. In butt-coupling, the MMF is placed as<br />
close to the laser facet as possible. This results in good<br />
coupling efficiency and increases the misalignment tol-<br />
erance [6]. Moreover, this technique is relative easy<br />
to fabricate. However, this technique is not suitable<br />
for SMF because of its small core diameter and its low<br />
numerical aperture. In practice, more complex tech-<br />
niques like lens coupling and pigtail coupling is used in<br />
SMF. In the lens coupling technique, single or multi-<br />
ple lenes are placed between a laser facet and a optical<br />
fiber [5]. This technique improves the coupling effi-<br />
ciency to be more than 50%. However, it is harder to<br />
fabricate lens that is suitable for the SMF, so pigtail<br />
coupling technique would be used. This technique first<br />
couples the laser to a MMF, then from the MMF to<br />
the SMF. Additional coupling loss is introduced from<br />
- 1717 -<br />
the extra coupling stage, but it is easier to fabricate [7].<br />
From all the discussion above, it is obvious that MMF<br />
has better coupling efficiency and lower cost.<br />
On the other hand, the bandwidth of the MMF<br />
is significantly less than the SMF. It is widely re-<br />
ported that that digital system SMF has a bandwidth<br />
in the GHz.km range and the MMF has a bandwidth<br />
of 500 MHz-km. However, the MMF is sufficient for<br />
the local picocells with short distance and bit rate in<br />
the low Mbps range. This is demonstrated in the next<br />
section.<br />
111<br />
4. NUMERICAL DISCUSSION<br />
In this section, simulation of the downlink transmis-<br />
sion from the central base station to the radio access<br />
point is discussed. A vertical-cavity surface-emitting<br />
laser and a gradient index multimode fiber (MMF) is<br />
assumed for the radio over fiber link. The laser oper-<br />
ates at 850 nm and emits 1 mW of optical power. We<br />
assume the same butt-coupling technique as in [3]. We<br />
also assume a relatively large G, = 0.80 mW/mA. The<br />
responsivity !J? of optical receiver is 0.75 mA/mW. The<br />
c7 of the MMF impulse response (2) is 0.5 ns/km [2] and<br />
the delay T is 30. The fiber attenuation is 2.5 dB/km<br />
and the connector loss is 2 dB. The system is assumed<br />
to be perfectly matched, so there is no loss from match-<br />
ing. Noise is added according to (7) and generated for<br />
a bandwidth of 2 MHz, a relative intensity noise pa-<br />
rameter RIN of -155 dB/Hz, a 50 R load resistance,<br />
and a temperature of 278K. The optical signal to noise<br />
ratio (OSNR) of the link is calculated according to (6).<br />
Figure 4 shows the four different OSNR curves as a<br />
function of the ROF link distance under various chan-<br />
nel and signal conditions. The top most curve is the<br />
OSNR of a SMF and the rest of the curves is the OSNR<br />
of a graded index MMF at virous carrier frequencies.<br />
The dispersion with the MMF has significant impact<br />
on the OSNR of the link. With a carrier frequency of<br />
900 MHz at a distance of 1 km, there is about 30 dB loss<br />
in OSNR compares to a SMF and as the link distance<br />
increase the rate of the loss increases faster. However,<br />
application considered is a short haul link. For a carrier<br />
frequency of 900 MHz, the ROF link still can support<br />
up to 1.22 km with the OSNR better than 10 dB. For<br />
a 1200 MHz and a 1500 Mhz carrier, the link supports<br />
910 m and 740 m respectively. Figure 5 also shows the<br />
same OSNR curves, but the noise bandwidth increase<br />
to 10 MHz. It shows a decrease of about 7 dB in all<br />
the OSNR curves and the distance a link can support<br />
also decrease.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.<br />
'
OSNR of The ROF Link<br />
-- OSNR Carrier Fnsqueney<br />
Distance of the link in h<br />
Figure 4: OSNR versus distance of multimode ROF<br />
links with 2 MHz noise bandwidth<br />
80 ~<br />
70<br />
Figure 5: OSNR versus distance of multimode ROF<br />
links with 10 MHz noise bandwidth<br />
- 1718 -<br />
112<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:52 from IEEE Xplore. Restrictions apply.<br />
5. CONCLUSION<br />
In this paper, we have investigated the radio over<br />
fiber (ROF) link that employs a graded index mul-<br />
timode fiber (MMF) with a vertical-cavity surface-<br />
emitting laser to increase the coupling efficiency. The<br />
coupling efficiency of 90% can be achieved with butt-<br />
coupling, which is relatively simple for the MMF. With<br />
the complexity translated to the cost of a system, it<br />
makes our proposed system attractive. However, there<br />
is tradeoff in bandwidth. For a 900 MHz carrier, the<br />
ROF link distance is restricted to 1.22 km with an op-<br />
tical signal to noise better than 10dB. For the remote<br />
antenna application of local picocells, this configura-<br />
tion of ROF is sufficient.<br />
References<br />
[l] X. N. Fernando; A. Anpalagan, “On the design of<br />
optical fiber based wireless access systesm.. .”, In-<br />
ternation Conference on Communication, 2004, To<br />
be presented.<br />
[2] K. Azadet; E. F. Haratsch; H. Kim; F. Saibi;<br />
J. H. Saunders; M. ShafFer; L. Song; Meng-Lin<br />
Yu, “Equalization and FEC techniques for optical<br />
transceivers”, Solid-state Circuits, IEEE Journal<br />
of, vol. 37, no. 3, pp. 317-327, March 2003.<br />
[3] J. Heinrich; E. Zeeb; K. J. Ebeling, “Butt-coupling<br />
efficiency of VCSELs into multimode fibers”, Pho-<br />
tonics Technology Letters, IEEE, vol. 9, no. 12, pp.<br />
1555 -1557, Dec. 1997.<br />
[4] Gerd Keiser, Optical Fiber Communication,<br />
Boston, MA : McGraw-Hill, 2000.<br />
[5] John M. Senior, Optical Fiber Communications:<br />
Principles and Practice, Prentice Hall, second edi-<br />
tion, 1992.<br />
[6] J. A. Hiltunen; K. Kautio; J.-T. Makinen; P. Kar-<br />
ioja; S. Kauppinen, “Passive multimode fiber-to-<br />
edge-emitting laser alignment based on a multilayer<br />
LTCC substrate”, in Electronic Components and<br />
Technology Conference, 2002. Proceedings. 52nd.<br />
IEEE, May 2002, pp. 815 -820.<br />
[7] Leslie A. Reith; Paul W. Shumate, “Coupling sen-<br />
sitivity of an edge-emitting LED to single-mode<br />
fiber”, Lightwave Technology, Jounal of, vol. LT-<br />
5, no. 1, pp. 29-34, January 1987.
SUB-DICTIONARY SELECTION USING LOCAL DISCRIMINANT BASES<br />
ALGORITHM FOR SIGNAL CLASSIFICATION<br />
Karthikeyan Umapathy and Anindya Das<br />
Dept. of Electrical and Computer Eng.,<br />
The <strong>University</strong> of Western Ontario,<br />
London, Ontario, CANADA.<br />
Email: kumapath@uwo.ca<br />
Abstract<br />
In signal decompositions using over-complere. redundant timefrequency<br />
(TF) dictionaries, oren it is challenging to restrict the<br />
dictionary to a sub-dictionary tailored for specific applications.<br />
In the proposed technique we used a similar appmach as Local<br />
Discriminant Bases Algorithm (mB) to select optimal TF subdictionaries<br />
for signal classification applications. A novel timewidth<br />
versus frequency band mapping was generated for each of<br />
the signal class. These mappings of different classes were compared<br />
using a discriminant measure ro arrive at a sub-dicrionary.<br />
This sub-dictionary was then used for decomposing the testing<br />
set signals, followed by fearure exrraction and classification. Two<br />
highly non-stationary bio-medical databases I . Vibroarrhrographic<br />
signals (89 signals, 51 normal and 38 abnormal) 2. Pathological<br />
speech darabase (103 signals, 50 normal and 50 pathological)<br />
were rested. Classification accuracies as high as 74.2% and 92%<br />
wem achieved respectively. Due Io the sub-dictionary appmach,<br />
appmximately a 40% reduction in signal decomposition time was<br />
observed for the tested databases.<br />
Keywords: timerfrequenq, sub-dictionary, matching pursuit, lo-<br />
cal discriminant bases, discriminanr measure<br />
1. INTRODUCTION<br />
lime-frequency (TF) transformations have signi,kantly contributed<br />
in the area of automatic signal classikation. The TF<br />
transforms help us to understand the signals better, thereby to extract<br />
strong clues or features for classikation. Even though the<br />
complete TF plane contain details about the signals, in classikation<br />
application it is often a small area or pockets of areas in the<br />
TF plane that actually exhibit the difference between the classes<br />
of signals. The success of a classiPcation application depends on<br />
how well these target areas can be identiPed and analyzed in the<br />
TF plane. Once the target areas are identiPed, it is easier to zoom<br />
into these areas by performing time and frequency localized decompositions<br />
to extract relevant features for classiPcations.<br />
Pruning algorithms such as Local Discriminant Bases algorithm<br />
(LDB) [I] were introduced to identify the target subspace in<br />
the TF plane that exhibit high discrimination values between signal<br />
classes. However most of the existing literature of LDB deals<br />
only with dictionary of orthonormal bases (Wavelet packets). Considering<br />
the various advantages 121 of using redundant dictionaries<br />
for Bexible signal representations, in the proposed technique<br />
we use adaptive time-frequency transformation (ATFT) based on<br />
matching pursuit algorithm. The nature of the ATIT based on<br />
CCECE 2004 - CCGEI 2004, Niagara Falls, May/rnai 2004<br />
0-7803-8253-6/04/517.00 @ 2004 IEEE<br />
Sridhar Krishnan<br />
Dept. of Electrical and Computer Eng.,<br />
<strong>Ryerson</strong> <strong>University</strong>,<br />
Toronto, Ontario, CANADA.<br />
Email: krishnan@ee.ryerson.ca<br />
matching pursuit is different from wavelet packet transform (Un-<br />
like wavelet/wavelet packet transform the scale and frequency pa-<br />
rameters are not related in ATFT), hence the LDB approach in<br />
identifying the target subspace has to be modiPed before it can be<br />
applied to the matching pursuit based ATlT decomposition.<br />
In this paper we demonstrate the process of selecting a sub-<br />
dictionary (subspace) from a redundant dictionary using a similar<br />
approach to LDB for classikation application. The selected sub-<br />
dictionaries were then used to decompose two biomedical signal<br />
databases to a localized TF plane. Features were extracted and<br />
classiPcation was performed. The paper is organized as follows:<br />
Section I1 covers Methodology consisting of ATFT, LDB, feature<br />
extraction and pattern classiPcation. Results and conclusions are<br />
given in Section 111.<br />
2. METHODOLOGY<br />
2.1. Adaptive Time-frequency Transform<br />
The signal decomposition technique used in this work is based<br />
on the matching pursuit (MP) [2] algorithm. MP is a general<br />
framework for signal decomposition. The nature of the decom-<br />
position varies according to the dictionary of basis functions used.<br />
When a dictionary of TF functions is used, MP yields an adaptive<br />
time-frequency transformation [Z]. In MP any signal x(t) is de-<br />
composed into a linear combination of TF functions g(t) selected<br />
horn a redundant dictionary of TF functions.<br />
where<br />
-2001 -<br />
and a, are the expansion coefkients. The scale factor sn also<br />
called as octave or time-width parameter is used to control the<br />
width of the window function, and the parameter p, controls the<br />
temporal placement. The parameters fn and $., are the frequency<br />
and phase of the exponential function respectively. The signal<br />
x(t) is projected over a redundant dictionary of TF functions with<br />
all possible combinations of scaling, translations and modulations.<br />
The dictionary of TF functions can either suitably be modiPed or<br />
selected based on the application in hand. In our technique, we<br />
are using the Gabor dictionary (Gaussian functions) which has the<br />
113<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.
est TF localization properties. At each iteration, the best corre-<br />
lated TF functions to the local signal structures are selected from<br />
the dictionary. The remaining signal called the residue is further<br />
decomposed in the same way at each iteration subdividing them<br />
into TF functions.<br />
Theoretically when using a redundant dictionary the decom-<br />
position parameters a,,, sn. fn, p, and can take any values<br />
within their ranges. However in the practical discrete implementa-<br />
tion used in this work, sn can vary in powers of 2 from 2 to 14, fn<br />
can vary from 0 to Fs/2 (Fs is the sampling frequency), p,, can<br />
vary from 0 to signal size. &, can vary from 0 to 1. The possible<br />
values taken by these parameters can be restricted to construct a<br />
sub-dictionary, in Section 2.3 we will demonstrate how these pa-<br />
rameters can be restricted to a localized area in the TF plane for<br />
classiPcation application.<br />
2.2. Local Discriminant Bases Algorithm<br />
In the LDB [I] algorithm (using wavelet packet bases), a set<br />
of training signals xf for all C classes are decomposed to a full<br />
tree structure of order N. We restrict our explanation to binary<br />
wavelet packet trees. Let Ro,o be a vector space in R- corre-<br />
sponding to the node 0 of the parent tree. Then al each level<br />
the vector space is spilt into two mutually orthogonal subspaces<br />
given by Cl,,, = Rj+l,2k eB Rj+l,2r;+~ where j indicates the level<br />
of the tree and k represents the node index in level j, given by<br />
k = 0, __._, 21 - 1. This process repeats till the level J, giving rise<br />
to 2' mutually orthogonal subspaces. The goal is to select the set<br />
of best subspaces that provide maximum discriminant information<br />
between the classes of the signal. Each node k contains a set of<br />
[='"a-'-l<br />
basis vectors Bj,k = [wj,!+],=, , where 2"" corresponds<br />
to the length of the signal. Then the signals xi can be represented<br />
by a set of coetkients cas:<br />
xi = Cik,lW1,k,L . (3)<br />
AkJ<br />
The time index of the signals xi has been dropped for nota-<br />
tional convenience. Basically the signals xi are decomposed into<br />
2' subspaces with Cj,k,l coetkients in each subspace. The sub-<br />
spaces which exhibit high discriminant values for the discriminant<br />
measure D, are then used to expand the testing set signals and<br />
features are extracted for classikation.<br />
Unlike the wavelet packet decomposition, in ATFl the scale<br />
and frequency parameters are not explicitly related. The LDB<br />
strategy of splitting a subspace (node) to obtain children subspaces<br />
(node) does not apply to A m. In ATFI any scale can occur in<br />
combination with any frequencies (only resuicted by the uncer-<br />
tainty principle) giving it the extreme Bexibility to obtain any local<br />
TF resolution. So we will have to adapt the LDB approach before<br />
it can be applied to ATFT.<br />
2.3. Sub-dictionary Selection Process<br />
As we will be performing a two-group ciassikation on the<br />
given datasets, the following sub-dictionary selection process will<br />
be explained with a two-group ciassiPcation of signals (Class A<br />
and Class B). Coarse TF decompositions were performed on the<br />
training sets of both classes of signals. Coarse TF decompositions<br />
can be achieved, by controlling the step size of the decomposi-<br />
tion parameters. In the proposed technique, out of the possible<br />
114<br />
sn values (2' to 214), we force the decomposition to select only<br />
scales of sl = 2', 92 = Z6, 93 = 2" and s4 = 214. Simi-<br />
larly we group the f,, parameters into frequency bands off 1 = 0<br />
to Fs/8, f2 = Fs/8 to Fs/4, f3 = Fs/4 to 3 * Fs/8 and<br />
f4 = 3 * Fs/8 to Fs/2. As we choose to completely cover the<br />
frequency range in 4 bands without gaps, we allow the decompo-<br />
sition to choose fn from the complete range 0 to Fs/2. Later<br />
in the processing of the decomposition parameters we group them<br />
into four frequency bands. With all the training signals decom-<br />
posed using the restricted s, values, we group the decomposition<br />
parameters in combinations of (sl, s2, s3 and s4) and (f 1, f 2, f 3<br />
and f4). So in total we will be able to group them into 16 cells<br />
as shown in Fig. I. The cells in the respective time-width versus<br />
frequency band mapping are numbered as A1 to A16 and B1 to<br />
B16.<br />
Here it should be noted that the time-width axis in the Fig. 1<br />
does not correspond to time, but scale (time-width). During the<br />
decomposition process, any scale parameter can occur at any time<br />
position so the arranging of the scale parameter from low to high<br />
does not mean they occur in that order in real time. Once we get<br />
this time-width vs frequency hand mapping for the training set of<br />
signals for both the classes, we average them to get an averaged<br />
time-width versus frequency band mapping for both classes of sig-<br />
nals.<br />
In order to identify the cells which demonstrate high discrimi-<br />
nant values between the classes we use a similar approach to LDB.<br />
We dePne a discriminant measure D, which is used to compare the<br />
corresponding cells in the time-width versus frequency hand map-<br />
pings. Unlike LDB with orthonormal bases where the set of basis<br />
functions and their variations are limited and Pxed, in ATFT the<br />
variations can be limitless theoretically (only restricted by the un-<br />
cerlainty principle). In other words the TF tiling (TF resolution)<br />
is Pxed for a particular scale function of waveletslwavelet pack-<br />
ets although their position in the TF plane can be altered (wavelet<br />
packets). In ATFT it is &@cult to assign a Pxed subspace shape or<br />
size based only on scale parameter or frequency parameter. Hence<br />
we choose both scale and frequency to assign a subspace 011 the<br />
TF plane. However this cannot be generalized as the combination<br />
of scale and frequency can be limitless (only restricted by the un-<br />
certainty principle) based on the signal structures.<br />
In the proposed technique we use the normalized cumulative<br />
energy difference between the cells as the discriminant D measure.<br />
The discriminant D measure is give by:<br />
and<br />
- 2002 -<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.<br />
where E is the normalized cumulative energy of the TF functions<br />
in a cell, at is the energy coepcient of the TF function. k is the<br />
number of TF functions grouped in a cell and B, is the total de-<br />
composed energy of the signal.<br />
We compare these cumulative energies of the corresponding<br />
cells and compute the D. We sort them in decreasing order of D.<br />
The cells which yields high values for D exhibit signiPcant differ-<br />
ences between the classes. This indicates the target space for clas-<br />
sipcation lies within these cells. Fig I pictorially explains the way<br />
we compare the corresponding cells using D and as an example we<br />
have shown a possible outcome with 5 cells (cross hatched) iden-<br />
tibed as the highly discriminative cells between classes. These 5
0 : SI 92 r3 54 =-ax<br />
Tirnep"idfh<br />
: S"b-discio"ary :<br />
cells are chosen as the hst h e cells when soned in the decreasing<br />
order of their discriminant values. Now we identify the range covering<br />
all these cells both in time-width and frequency band axis. In<br />
the given example as shown in Fig I with doned lines, we choose<br />
the frequency band axis ranging from 0 to 3*Fs/8 and time-width<br />
ranging from sl to s2. Once these ranges are identiPed we restrict<br />
the redundant dictionary and construct a sub-dictionary with these<br />
ranges for time-width and frequencies. The testing set signals will<br />
now he decomposed using this sub-dictionary enabling them to<br />
zoom into only the target space in the TF plane. In the decomposing<br />
process of testing set signals using the sub-dictionary we allow<br />
the decomposition to go in Pne steps of time-width and frequencies.<br />
This targeted decomposition yields parameters that contain<br />
high discriminatory information between the classes. Features are<br />
extracted from these decomposition parameters and fed to a Linear<br />
Discriminant <strong>Analysis</strong> (LDA) based classiper as will be explained<br />
in subsequent sections.<br />
0 . 1 U 0 r(rmu 0 . 1 e 3, dvnu<br />
T>nr"d* i,rnrwd,h<br />
(a) (b)<br />
Fig. 2. Sub-dictionary selection for VAG (a) and Pathological<br />
speech signals (b).<br />
2.4. Feature Extraction and Pattern Classifi-<br />
cation<br />
We use the following two highly non-stationary databases for<br />
Fig. 1. Sub-dictionary selection process<br />
- 2003 -<br />
testing with our proposed technique. 1. Vibroarthrographic (VAG)<br />
signals and 2. Pathological speech signals. Vibroarthrographic<br />
signals are the signals emitted from the human knee joints during<br />
an active movement of the leg. More details of this database can he<br />
had from [3]. Pathological speech signal database contains speech<br />
signals from both normal and pathological talkers. More details of<br />
this database can be had from [4].<br />
As explained in Section 2.3 we obtained sub-dictionaries for<br />
both the VAG and pathological speech databases. In performing<br />
the coarse TF decomposition on the training set, a faster version of<br />
the ATFT algorithm [5] was used with 2000 iterations. IO ran-<br />
domly selected signals from both classes, from both VAG and<br />
pathological speech signals were used as the training set. We<br />
use the Rst 3 highly discriminating cells in arriving at the time-<br />
width and frequency ranges. Figs 2(a) and 2(b) show the ranges<br />
(cross hatched) obtained for time-width and frequencies for VAG<br />
and pathological speech databases respectively. For VAG database<br />
based on the chosen cells, time-width varies from 2? to 2' and<br />
the frequency varies from 0 to Fs/4. For pathological speech<br />
database the time-width varies from 26 to 214 and the frequency<br />
vanes from 0 to Fs/8. All the 89 VAG signals and 100 patho-<br />
logical speech signals were decomposed using their correspond-<br />
ing sub-dictionaries with the regular ATFT algorithm. We use Pne<br />
steps of time-width within the range of the sub-dictionary. The<br />
iterations were limited to I000 for both VAG and pathological<br />
speech signals as we are only interested in the discriminative sub-<br />
space in the TFplane and we do not require a complete decomposi-<br />
tion of the signal. Due to the usage of sub-dictionary approach the<br />
decomposition times were reduced by approximately 40% in the<br />
proposed work compared to using a full range redundant dictio-<br />
nary. The reduction in the decomposition times depends on how<br />
small the sub-dictionary is. The decomposition parameters were<br />
analyzed and the following features were extracted. 1. Number of<br />
TF functions (Flc,): This feature is the number of TF functions<br />
falling into each of the cells covering the same area as the highly<br />
discriminative cells that were used to construct the sub-dictionary.<br />
115<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.
2. Cumulative energy of the cells (FZc,): We compute the cu-<br />
mulative energy contained in each of the cells that were used to<br />
compute (Flc,). It should he noted here that as we are using<br />
Pne steps of time-width in the decomposition of testing signals we<br />
should be having more number of cells covering the same range of<br />
the sub-dictionary. For example in aniving at the sub-dictionary of<br />
VAG signal we identikd the cells corresponding to 3 time widths<br />
2', Z6 and 2". However in decomposing the testing signal we<br />
used Pne steps of time-widths which means, the time-width step<br />
size is reduced fiom 4 to 1 and so now we will have 9 cells cover-<br />
ing the same time-width range.<br />
Both the above explained feature vectors (Flcn and F2c,)<br />
were evaluated for their discriminant power and only 9 out the total<br />
18 features (both feature vector put together) were selected for the<br />
purpose of classiPcation. This selected 9 features contained fea-<br />
tures from both feature vectors (FIG,, and FZc,). The motiva-<br />
tion for the pattern classiPcation is to automatically group signals<br />
of same characteristics using the discriminatory features derived.<br />
In LDA, the feature vector derived as explained above were trans-<br />
formed into canonical discriminant functions such as<br />
f = ulbl + uzbz + ....... + vgbg + a, (6)<br />
where {U} is the set of features, {b} and a are the coeecients and<br />
constant respectively estimated and derived using the Fisher& lin.<br />
ear discriminant functions [6]. Using the chi-square distances and<br />
the prior probabilistic values of each group the classiPcation is per-<br />
formed to assign each sample data to one of the groups. The clas-<br />
siPcation accuracy was estimated using the leave-one-out method<br />
which is known to provide a least bias estimate 171.<br />
Table 1. Table showing classiPcation accuracy achieved for the<br />
VAG database. Regular - Normal LDA, Cr0ss.V - Leave-one-out<br />
method LDA, % =Percentage of classiPcation<br />
Method 1-<br />
Regular I Normal I 35 I 16 I 51<br />
Abnormal 81.6<br />
I Abnormal I 9 1 29<br />
% Normal 66.1 33.5 100<br />
Abnormal 23.7 16.3 100<br />
Table 2. Table showing the classiPcation accuracy achieved for the<br />
pathological speech database. Regular - Normal LDA, Cr0ss.V -<br />
Leaveane-out method LDA, %=Percentage of classiPcation<br />
Method I <strong>Group</strong>s I Normal I Pathological 1 Total<br />
Reeular I Normal I 48 I 2 I 50<br />
Pathological 6 44 50<br />
% Normal 96 4 100<br />
Pathological 12 88 100<br />
Cr0ss.V Normal 48 2 50<br />
% Normal 96 4<br />
Pathological 12 88<br />
100<br />
100<br />
116<br />
- 2004 -<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:49 from IEEE Xplore. Restrictions apply.<br />
3. RESULTS AND CONCLUSIONS<br />
The paper describes a novel way of constructing a target spe-<br />
cik sub-dictionary from a redundant dictionary, for classiPcation<br />
applications. High classiPcation accuracies were achieved with<br />
approximately 40% reduction in decomposition time. T\uo highly<br />
non-stationary biomedical databases were used to demonstrate the<br />
performance of the proposed technique.<br />
Features were extracted as explained in Section 2.4 for all the<br />
89 VAG signals and the 100 pathological signals. They were fed to<br />
a LDA based classikr using the SPSS software. ClassiPcation was<br />
performed and the results are given in Tables 1 and 2. For the VAG<br />
database an overall classiPcation accuracy of 74.2% using regular<br />
LDA and 70% using leave-one-out method were achieved. This<br />
is higher than the reported classikation accuracy in [3]. For the<br />
pathological speech database an overall classiPcation accuracy of<br />
92% using regular LDA and 92% using leave-one-out method was<br />
achieved. This is higher than the reported classiPcation accuracy<br />
in [4]. However the classkation accuracies for both the databases<br />
are not higher than the authors recent work with LDB based clas-<br />
sikation (74.2% vs 78.6% and 92% vs 97%). Considering the<br />
following facts the results obtained in the proposed technique can<br />
he justiPed (1) Novelty involved in the sub-dictionary construc-<br />
tion (2) While writing this paper, only Gaussian basis functions<br />
were tested with the databases (3) Reduced decomposition times<br />
(4) Simple features. Future work involves in rePning the proposed<br />
technique to include more bases and optimizing the targeted de-<br />
compositions to yield high classiPcation accuracies than the re-<br />
ported.<br />
Acknowledgements<br />
The authors thankfully acknowledge the NSERC organization<br />
for funding this project. The authors also acknowledge the Last-<br />
Wave software package group.<br />
References<br />
[I] N. Saito and R. R. Coifman, bLocal discriminant bases and<br />
their applications.6 Journal of Mathematical Imaging and W-<br />
sion, vol. 5, no. 4, pp. 3379358, 1995.<br />
[2] S. G. Mallat and Z. Zhang, bMatching pursuit with time-<br />
frequency dictionaries.6 IEEE Trans. <strong>Signal</strong> Pmcessing, vol.<br />
41,110. 12,pp. 339793415, 1993.<br />
[3] S. Krishnan, bAdaptive signal processing techniques for anal-<br />
ysis of knee joint vibroarthrographic signa1s,6 inPh.D disser-<br />
tation, <strong>University</strong> ofcalgary, June 1999.<br />
[4] K. Umapathy, S. Krishnan, V. Parsa, and D. Jamieson, bDis-<br />
crimination of pathological voices using an adaptive time-<br />
frequency approach.6 inproceedings of ICASSP 2002 IEEE<br />
International conference on Acoustics, Speech and Sigml<br />
Pmcessing, Orlando, USA, May 2002, pp. IV 3853D3855.<br />
[5] R. Gribonval, bFast matching pursuit with multiscale dictio-<br />
nary of Gaussian chirps.6 IEEE Transactions on <strong>Signal</strong> Pm-<br />
cessing. vol. 9, no. 5. May 2001.<br />
[6] SPSS Inc., bSPSS Advanced statistics user6 guide.6 Sser<br />
manual, SPSS Inc., Chicago, IL, 1990.<br />
[7] K. Fukunaga, Introduction to Slatistical Pattern Recognition,<br />
Academic Press, Inc., San Diego, CA, 1990.
Proceedings of the 25h Annual International Conference of the IEEE EMBS<br />
Cancun, Mexico September 17-21,2003<br />
Ultrasound Backscatter <strong>Signal</strong> Characterization and Classification Using<br />
Autoregressive Modeling and Machine Learning Algorithms<br />
Noushin R.Farnoud', Michael Kolios1*2<br />
Co-author: Srindhar Krishnan'<br />
'Department of Electrical Engineering, <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />
'Department of Math-Physics and Computer Science, <strong>Ryerson</strong> <strong>University</strong>, Toronto, Canada<br />
Abstract- This research explores the possibility of monitoring<br />
apoptosis and classifying clusters of apoptotic cells based on<br />
the changes in ultrasound backscatter signals from the tissues.<br />
The backscatter from normal and apoptotic cells, using a high<br />
frequency ultrasound instrument are modeled through an<br />
Autoregressive (AR) modeling technique. The proper model<br />
order is calculated by tracking the error criteria in the<br />
reconstruction of the original signal. The AR model<br />
coefficients, which are assumed to contain the main statistical<br />
features of the signal, are passed as the input to Linear and<br />
Nonlinear machine classifiers (Fisher Linear Discriminant,<br />
Conditional Gaussian Classifier, Naive Bayes Classifier and<br />
Neural Networks with nonlinear activation functions). In<br />
addition, an adaptive signal segmentation method ,(Least<br />
Squares Lattice Filter) is used to differentiate the data from<br />
layers of different cell types into stationary parts ready for<br />
modeling and classification.<br />
Keywords-Apoptosis, Ultrasound Backscatter<br />
I. INTRODUCTION<br />
High frequency ultrasound (US) has been shown to<br />
detect the structural changes cells and tissues undergo<br />
during cell death. <strong>Research</strong> has shown that the ultrasound<br />
backscatter signals from apoptotic' acute myeloid<br />
leukemia(AML) cells differ in intensity and frequency<br />
spectrum as the result of the change in size, spatial<br />
distribution and acoustic impedance of the scattering sources<br />
within the cell [l] (Fig. 1). Therefore, we assume that pulse<br />
echo data from different cell types contain distinguishable<br />
statistical regularities. In this work we attempt to classify<br />
normal and apoptotic cancerous cells by tracking the<br />
statistics of the ultrasound backscatter signals from tissues<br />
by using Autoregressive (AR) method for time series<br />
modeling of ultrasound signals.<br />
11. METHODOLOGY<br />
A. Autoregressive (AR) Modeling of US signals<br />
Biomedical signals contain large quantities of data.<br />
Moreover these data usually contain some redundancies<br />
which make processing and analyzing them more difficult.<br />
In such situations signal modeling may help to take out the<br />
' Apoptosis is a genetically determined destruction of cells from<br />
within due to activation of a stimulus or removal of a suppressing<br />
agent or stimuli.<br />
0-7803-7789-3/03/$17.00 02003 IEEE 2861<br />
117<br />
Fig 1 a) H&F ' stains of b) 11 &C stains of<br />
Normal Cells Apoptotic Cells<br />
irrelevant information carried by the signal and simplifies<br />
classification and segmentation by using a reduced number<br />
of model parameters. Autoregressive (AR) modeling is<br />
widely used for speech and biomedical signal processing<br />
[2-41. This model is linear and has been successfully used<br />
for high-resolution spectral estimation [5]. An AR model is<br />
defined by the difference equation:<br />
P<br />
x(n> = -C akx(n - IC) + e(n> (1)<br />
k=l<br />
where x(n) is a wide-sense stationary3 AR process, {a(k)}<br />
represent AR coefficients, e(n) is white Gaussian noise and<br />
p is the model order which determines the error criterion. In<br />
section C, we will present a way to estimate this error and<br />
reduce it based on choosing the proper model order @).<br />
B. Data Acquisition<br />
AML cells were grown in suspension and exposed to the<br />
chemotherapeutic cisplatin to induce apoptosis. Pellets were<br />
made by swing bucket centrifugation. Details on the<br />
biological procedure can be found elsewhere (Czemote et al.<br />
1996)[6]. A 20MHz f2.35 or 40 MHz f2 transducer (Visual<br />
Sonics4) was used to image the pellets of normal and<br />
apoptotic cells. RF backscatter data was digitized at<br />
SOOMHz and stored for later analysis. In one experiment,<br />
layers of normal and apoptotic cells were created to emulate<br />
a clinical situation.<br />
C. Choosing the proper Model Order<br />
The modeling order @) controls the error associated<br />
with the AR signal approximation. This parameter<br />
Hematoxylin and Eosin.<br />
'A stochastic process is called wide-sense stationary (WSS) if its<br />
mean is constant and its autocorrelation depends only on the time<br />
difference.<br />
www.visualsonics.com<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.
determines the number of previous samples used to model<br />
the original signal. A small model order ignores the main<br />
statistical properties of the original signal while a big model<br />
order will result in modeling the noise associated with data<br />
and over-fitting5 occurs. A very common method for<br />
estimating the proper model order is Akaike Information<br />
Criterion (AIC) [7], although applying this method would be<br />
very difficult in our work due to nature of US signals.<br />
Instead, we used the following parameters based on the<br />
statistics of the reconstructed signal and its frequency with<br />
different model orders to determine the best modeling order.<br />
a) Ensemble Reconstruction Error<br />
The error(2) shows the total difference of original and<br />
reconstructed signals in frequency domain using AR<br />
modeling technique:<br />
4<br />
Z(n) = -Ca,x(n - k)<br />
k=l<br />
rt=l '<br />
where :(U) is the approximated signal based on AR<br />
modeling with order p, N is the total number of samples<br />
within an individual RF line, f andj represents the fft of<br />
original and estimated signals respectively.<br />
b) Model Noise (error) Variance<br />
The AR process is the output of an all-pole filter<br />
invoked by a white noise e(@. This noise, which is also<br />
our modeling error, can be viewed as the output of the<br />
prediction error filter A(z), as shown in Fig. 2, where<br />
x(n) is the original signal and A(z) is the transfer<br />
function of AR modeling.<br />
Fig. 2. Block diagram of AR process<br />
(Model error)<br />
Therefore we expect that after estimating the AR<br />
coefficients of our model, if we invoke a filter as shown<br />
in fig. 2 with the estimated AR coefficients in A(z) the<br />
filter output, e(n), would be a white Gaussian noise. We<br />
can verify this by estimating the variance of the output<br />
of such a filter and its auto-correlation (which has a jump<br />
to one in zero lag and remains zero otherwise).<br />
D. <strong>Signal</strong> Segmentation<br />
The classification methods we discussed were based on<br />
US backscatter from pure apoptotic and normal cell pellets.<br />
When the model do well on training data but poorly on test data.<br />
2862<br />
118<br />
In patient imaging the data are acquired from tissues which<br />
contain different layers or layers with different mixtures on<br />
normal and apoptotic cells. The probabilistic behavior of the<br />
backscattered US signal from these cells, make the signal<br />
non-stationary6. This non-stationarity is important from the<br />
point of view of AR modeling, as this method is applicable<br />
if the signal is stationary'. Therefore we must use signal<br />
segmentation algorithms to break the signal acquired from<br />
tissues into stationary segments and classify each segment<br />
respectively. The segmentation algorithms can be classified<br />
into fixed *[8] and adaptive [2,9-111. Adaptive segmentation<br />
algorithms rely on tracking the statistical changes in the<br />
signal (such as mean and variance) to set a breaking<br />
boundary. We used this method for US signals due to its<br />
accuracy, modularity and ease of testing [2].<br />
E. Adaptive signal Segmentation: Recursive-Least Squares<br />
Lattice Filter (RLSL)<br />
In adaptive segment,ation, the segment length changes<br />
dynamically according to the statistical changes in the<br />
signal. The main idea of using RLSL filter was to get to a<br />
fast convergence by using forward and backward filters. The<br />
parameter which expresses the statistical change in the<br />
signal is called convergence factor (y,(n)). The convergence<br />
factor provides the connecting link between different sets of<br />
a priori and posteriori estimation errors in this algorithm and<br />
is defined by<br />
where m is the order of the lattice filter, y,(n) is the<br />
convergence factor at time sample n in the mth stage of<br />
lattice, bm-, (n) and Bm-, (n) are the backward prediction<br />
error and its power at this stage [2].<br />
IV. RESULTS<br />
a) Model Order Determination for Autoregressive (AR)<br />
Modeling of US signals<br />
Using the error criteria explained in section C, we<br />
calculated the error associated with the frequency of<br />
reconstructed and original US signals averaged over 30<br />
normal and apoptotic sample RF lines respectively (Fig. 3).<br />
Matlab (version 6.5) was used for all the calculations. Also,<br />
as explained in section D, we found the variance of the<br />
' The statistics of a non-stationary process are variant with respect to<br />
any translation among the time axis.<br />
' We have determined that US, signals from normal and apoptotic cells<br />
are quasi-stationary.<br />
' Fixed segmentation algoritlhms are widely used for speech signal<br />
processing.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.
estimated noise generated as the output of a filter with the<br />
estimated AR coefficients in its transfer function and the<br />
original signal as its input. The result of averaging the<br />
variance of this noise over 30 samples is shown in fig. 4.<br />
These graphs indicate that model order 15 (p=l5) is a good<br />
choice for AR modeling order for high frequency US<br />
backscatter signals, as we do not see much improvement in<br />
ensemble error(the ratio of error between model order 15<br />
and 40 is 2.6 in comparison to 2.9e5 between model order 1<br />
and 15). Furthermore, the variance of the estimated model<br />
noise does not change dramatically after this model order.<br />
To verify this result, we modeled an US backscatter signal<br />
with order 15, reconstructed this signal with the estimated<br />
AR coefficients and found the auto-correlation of the model<br />
error' (noise) .As depicted in Fig. 5; this auto- correlation<br />
indicates the similarity of the estimated error to white noise.<br />
Therefore we used AR modeling with order 15 for US<br />
backscatter signals in the rest of this paper.<br />
3<br />
2 Algorithm Normal Accuracy Apoptotic Accuracy<br />
1.5'<br />
+ 5 10 15<br />
P<br />
20 25 30 35 40<br />
Model Order<br />
Fig. 3: Average Ensemble Error between the ffts of estimated<br />
and original US signal (30 samples of normal and apoptotic signals)<br />
,@Y<br />
14<br />
12.<br />
3<br />
%lo'<br />
8. ;<br />
z 6.<br />
1 5 10 15 20 25 30 35 40<br />
Model Order<br />
Fig. 4: Average variance of the estimated model noised based on the<br />
estimated AR coefficient (30 samples).<br />
This error was assumed to be the absolute difference between original<br />
and reconstructed signals.<br />
2863<br />
119<br />
1<br />
0<br />
0 IOW 2000<br />
Lags<br />
Fig. 5: Auto-correlation of the estimated model error (noise)<br />
6) Ultrasound <strong>Signal</strong> Classification<br />
Conditional Gaussian<br />
Classifier"<br />
Naive Bayes Classifier I<br />
Fisher's Linear<br />
Discriminant<br />
Neural Network with<br />
40%<br />
46% I<br />
60%<br />
71%<br />
Sigmoid activation 93.8% 99%<br />
1 98% I 64% I<br />
tanh activation 95.5% 99%<br />
This result shows the ability of Neural Networks with non-<br />
linear activation functions (in both hidden and output layers)<br />
to classify US signals from normal and apoptotic cells. We<br />
are still investigating the advantages and disadvantages of<br />
each approach.<br />
c) Ultrasound <strong>Signal</strong> Segmentation<br />
Fig. 5 shows RLSL algorithm applied on a layer on<br />
Normal-Apoptotic-Normal cell pellet with the apoptotic<br />
layer located between samples 800 and 15000. As long as<br />
the input data is stationary, the convergence factor would<br />
remain in the same range, but when it drops below a<br />
I" The priors for each class were equally set (p=0.5)<br />
I' The network was trained using 50000 iterations.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.<br />
I<br />
1
threshold '* it indicates a sudden change in statistical<br />
properties of the signal which is set to the segment<br />
boundary.<br />
100 500 800 1200 1500 2000<br />
Sample Index<br />
(b)<br />
Fig. 5. (a): Original signal from a 3 layer Normal-Apoptotic-Normal cell<br />
pellet. (b): Convergence factor as a parameter to detect the layer boundaries<br />
(stationary).<br />
These figures indicate that RLSL algorithm can detect the<br />
sudden changes in the signal due to the different statistical<br />
properties of normal and apoptotic layers and therefore can<br />
adaptively found their corresponding boundary in an US<br />
backscatter signal. While in Fig. 5.a the difference is<br />
evident, in clinical situations it is anticipated that small<br />
percentage of apoptotic cells would be surrounded by<br />
normal cells.<br />
V. CONCLUSION<br />
The best model order in using AR technique for US<br />
signals was found to be p=15. The accuracy of different<br />
classifiers has been studied and it was found that non-linear<br />
neural networks were most successful in classification.<br />
Because the actual clinical data from patients include US<br />
backscatter from layers and mixtures of cells, a method for<br />
'* The threshold in this work is set by visual inspection (however in<br />
the future it will be extracted from the signal based on its statistical<br />
properties).<br />
2864<br />
120<br />
differentiating these layers was presented which enables the<br />
AR modeling to be applicable for US signals.<br />
ACKNOWLEDGMENT<br />
We should thank Dr. Michael Sherar and Ontario Cancer<br />
Institute of the Princess Margaret Hospital for their support,<br />
Anoja Giles for helping 11s with the biological work and Dr.<br />
Gregory Czarnota for his scientific input. Noushin<br />
R.Farnoud would also like to thank Dr. Sam Roweis at the<br />
Computer Science Department of the <strong>University</strong> of Toronto<br />
for his help with the Machine Learning algorithms.<br />
REFERENCES<br />
[I] MC. Kolios, GJ. Czamota, M. Lee, JW. Hunt, MD. Sherar,<br />
Ultrasonic spectral parameter characterization of apoptosis,<br />
[2]<br />
Ultrasound Med. & Biol. 2002 May, 8(5):589-97.<br />
S. Krishnan, Adaptivefili'ering. Modeling, and Classification of Knee<br />
Joint Vibroarthrographic <strong>Signal</strong>s, Master's Thesis, <strong>University</strong> of<br />
Calgary, Alberta, Canada, 1996.<br />
[3] F. Towfiq, C.W. Barenes, E.J. Pisa, Tissue classification basedon<br />
autoregressive models for ultrasound pulse echo data, ACTA<br />
Electronica,l984, (.26): 95-1 10.<br />
[4] A. Nair, BD. Kuban, N. Obuchowski, DG. Vince, Assessing spectral<br />
algorithms to predict atherosclerotic plaque composition with<br />
normalized and raw intravascular ultrasound, Ultrasound in Med. &<br />
Biol., 27(10): 13 19-1 331,2001.<br />
[51 M. Akay, Time Frequency and Wavelets in Biomedical <strong>Signal</strong><br />
Processing (Book style). Piscataway, NJ: IEEE Press, 1998: 123-135.<br />
[6] GJ. Czamota, MC. Kolios, J. Abraham, M. Portnoy, FP.<br />
Ottensmeyer, JW. Hunt, hfD. Sherar, Ultrasound imaging of<br />
apoptosis: high-resolution non-invasive monitoring of programmed<br />
cell death in vitro, in situ ;and in vivo, Br J Cancer. 1999 Oct;<br />
81(3):520-7.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:50 from IEEE Xplore. Restrictions apply.<br />
[7] Y. Sakamoto, M. Ishiguro, G. Kitagawa, Akaike Information Criferion<br />
Statistics, D. Reidel Publishing Company, KTK Scientific Publishers,<br />
Tokyo Hardbound, ISBN 90-277-2253-6, November 1986.<br />
[8] J.D. Markel, A.H. Gray, Jr. Linear Prediction of Speech. Springer-<br />
Verlag, N.Y., New York, 1976.<br />
[9] D. Michael, J. Houchin, Automatic EEG analysis: A segmentation<br />
procedure based on the atrtocorrelation function, Electroenceph.,<br />
Clinical Neurophysiology, (46):.232-235, 1979.<br />
[IO] G. Bodenstein, H.M. Praetorious, Feature extraction from the<br />
electroencephalogram by adaptive segmentation, Proceeding of IEEE,<br />
65(5): 642-652, May 199;'.<br />
[I I] H.M. Praetorious, G. Bodfenstein, O.D. Creutzfeldt, Adaptive<br />
segmentation of EEG records: A new approach to automatic EEG<br />
analysis, Electroencephalogram, Clinical Neurophysiology, Vo1.42,<br />
pp.84-91, 1917.<br />
[12] T. Mitchell, Machine Learning, McGraw Hill, 1997.<br />
[I31 C.D. Nugent, J.A. Lopez, A.E. Smith, Prediction Models in Design of<br />
Neural Network based ECG classifiers, BMC Medical Informatics<br />
and Decision Making, 2001.<br />
[I41 S. Chakrabarti, N. Bindal, Robust Radar Target Classifier Using<br />
Artificial Neural Networks, IEEE Transaction on Neural Networks,<br />
6(3), May 1995.<br />
[ 151 D. Docampo, Intelligent Methods in <strong>Signal</strong> Processing and Artificial<br />
Communications, Birkauser Boston, 1997.<br />
[ 161 D.M.Skapura, Building Neural Networks Algorithms, Applications.<br />
and Programming Techniques, ACM press, 1998.<br />
[ 171 J.A. Freeman, D.M. Skapura, Building Neural Networks. ACM press,<br />
1998.
ROBUST AUDIO WATERMARKING USING A CHIRP BASED TECHNIQUE<br />
Serhut Erkugiik, Sridhar Krishnan and Mehntet Zeyfinoglu<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON M5B 2K3 Canada<br />
e-mail: (serkucuk)(knshnan)(mzeytin)@ee.ryerson.ca<br />
ABSTRACT<br />
In this study, we propose a new spread spectrum audio wa-<br />
termarking algorithm that embeds linear chirps as water-<br />
mark messages. Different chirp rates, i.e., slopes on the<br />
time-frequency (TF) plane, represent watermark messages<br />
such that each slope corresponds to a different message.<br />
We extract the watermark message using a line detection al-<br />
gorithm based on the Hough-Radon transform (HRT). The<br />
HRT detects the directional elements that satisfy a paramet-<br />
ric constraint in the image of a TF plane. The proposed<br />
method not only detects the presence of watermark, but also<br />
extracts the embedded watermark bits and ensures the mes-<br />
sage is received correctly. The results show that the HRT de-<br />
tects the embedded watermark message even after common<br />
signal processing operations such as MPEG audio coding,<br />
resampling, lowpass filtering and amplitude re-scaling.<br />
1. INTRODUCTXON<br />
In recent years, the digital format has become the standard<br />
for the representation of multimedia content. Today’s tech-<br />
nology allows the copying and redistribution of multime-<br />
dia content over the Internet at a very low or no cost. This<br />
has become a serious threat for multimedia content owners.<br />
Therefore, there is significant interest to protect copyright<br />
ownership of multimedia content (audio, image, and video).<br />
Watermarking is the process of embedding additional data<br />
into the host signal for copyright ownership. The embed-<br />
ded data characterizes the owner of the data and should be<br />
extracted to prove the ownership. Besides copyright protec-<br />
tion, watermarking may be used for data monitoring, finger-<br />
printing, and observing content manipulations. All water-<br />
marking techniques should satisfy a set of requirements [I].<br />
In particular, the embedded watermark should be: (i) imper-<br />
ceptible, (ii) undetectable to prevent unauthorized removal,<br />
(iii) resistant to all signal manipulations, and (iv) extractable<br />
to prove ownership. Before the proposed technique is made<br />
public, all the above requirements should be met.<br />
Thin work was supponed by NSERC and Minonet<br />
The watermarking literature describes two classes of wa-<br />
termarking schemes. The first class of techniques called<br />
the one-bit watermarks [2], only detects the presence of the<br />
watermark rather than extracting it [3, 4, 51. The second<br />
class of techniques detects and extracts the embedded wa-<br />
termark message [6, 7, 81. If b bits represent the embedded<br />
watermark message, the detector has the task of correctly<br />
identifying the watermark message from an alphabet of 2 *<br />
possible watermark messages. As a result of signal manipu-<br />
lations some message bits extracted by the detector may be<br />
in error potentially resulting in the detection of the wrong<br />
watermark message. Our motivation for the proposed au-<br />
dio watermarking algorithm is to detect the presence of the<br />
watermark, extract the embedded watermark message bits<br />
and decide on the watermark message even if some bits are<br />
received in error. To achieve this goal, we embed linear<br />
chirps as watermark messages. Different chirp rates, i.e..<br />
slopes on the TF plane, represent watermark messages such<br />
that each slope corresponds to a different message. The nar-<br />
rowband watermark messages are spread with a watermark<br />
key (binary PN sequence) across a wider range of frequen-<br />
cies before embedding. The resulting wideband noise is<br />
perceptually shaped and added to the original signal. The<br />
original and watermarked signals exhibit no perceptual dif-<br />
ferences. At the receiver a line detection algorithm based<br />
on the Hough-Radon transform (HRT) detects the slope of<br />
the extracted chirp in the image of the TF plane, even at<br />
discontinuities corresponding to bit errors.<br />
2. WATERMARK EMBEDDING<br />
Let x = [z(O)z(l) . ..IT be the audio signal which we<br />
first divide into N-sample long blocks. We use the notation<br />
xk = [z(kN) . . . z((k + l)N - 1)IT to represent the sam-<br />
ples for the kth audio block. Let m be a normalized chirp<br />
function that represents the watermark message to be em-<br />
bedded into the original signal. m takes continuous values<br />
in the interval [-1.1]. and needs to be quantized for the de-<br />
tection of each embedded bit. mq is the quantized version<br />
of m formed according to the sign of the sample values of<br />
m, taking values -1 and 1. Let ml represent a watermark<br />
0-7803-7965-9/03/%17.00 02003 IEEE 11 - 513 ICME 2003<br />
121<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.
message bit to be embedded into the kth audio block. We<br />
embed one watermark bit into each block. Each bit is spread<br />
with a binary PN sequence pk with a chip length of N to<br />
generate the wideband noise vector wk.<br />
We need to perceptually shape wk before adding to each<br />
block. To make WI imperceptible, the amplitude level of the<br />
wideband noise should be attenuated to 0.5 percent of the<br />
dynamic range of the host signal [9]. Let w; = [w’(kN) . . .<br />
~’((lc+l)N-l)]~ bethe signal dependentwidebandnoise<br />
generated from wk such that<br />
w;(n) = awk(n)Isk(n)l> (1)<br />
where a is the embedding strength. The high frequency<br />
band of the wideband noise sequence w; will not be robust<br />
to compression and lowpass filtering. Therefore. we gen-<br />
erate the frequency-limited noise wl;‘ by lowpass filtering<br />
w; with a cut-off frequency of IO percent (i.e. 2.2 kHz) of<br />
the maximum audio frequency. which represents part of the<br />
signal spectrum with significant energy content. After lim-<br />
iting the maximum frequency of the wideband noise to 2.2<br />
kHz, we found that a = 0.3 (independent of the audio sig-<br />
nal) is sufficient to achieve imperceptibility. This value of<br />
a is different than what is used in [9] because we embed a<br />
frequency-limited noise rather than a wideband noise. The<br />
resulting watermarked audio signal block yk equals<br />
yk = Xk i wk’.<br />
(2)<br />
We repeat the process for each block until we embed all the<br />
bits in mq. The resulting watermarked signal y is perceptu-<br />
ally the same as the original signal x. Figure I provides an<br />
overview of the proposed watermark embedding scheme.<br />
Pk<br />
Fig. 1. Watermark embedding scheme.<br />
3. WATERMARK DETECTION<br />
Under ideal signal conditions the received signal will be<br />
identical to the transmitted sequence yk. In Section 4, we<br />
will relax this condition and investigate the proposed wa-<br />
termarking scheme under the assumption that the received<br />
signal is different than yk as a result of various signal pro-<br />
cessing operations. Assuming ideal signal conditions and<br />
perfect synchronization of the signal and the PN sequence,<br />
we first lowpass filter the received signal yk to 2.2 Hz. Let<br />
y;’ represent the output of the lowpass filter in the receiver.<br />
Since w;‘ is band-limited to 2.2 kHz, we can express yp as:<br />
y;’ = x; + w;.<br />
(3)<br />
11 - 514<br />
122<br />
where xi is the audio signal component at the output of<br />
the lowpass filter. We then use the watermark key, i.e. the<br />
PN sequence Pk. to despread y;’ and integrate the result-<br />
ing sequence to recover mi, the embedded message bit for<br />
that block. Let (y;‘, pa) be the output of this integration<br />
Fig. 2. Watermark bit detection s,cheme.<br />
operation, where () represents the inner product operation.<br />
Under the assumption x k is a zero-mean sequence which is<br />
statistically independent from pk, we can approximate the<br />
expected value of (y;, pk) by the expression<br />
N-1<br />
lL=O<br />
(4)<br />
where /3 is a positive constant resulting from the filtering<br />
operations. Therefore, the extracted message bit m;, which<br />
estimates ml, can be based on the test statistic (y;!pk)<br />
such that<br />
To achieve improved watermark extraction performance we<br />
postprocess the extracted message bits using a time-frequency<br />
distribution (TFD). After all message bits are extracted, we<br />
construct the TFD (spectrogram) of the elements in mg.<br />
The TFD of a chirp watermark message is a line with vari-<br />
able slopes. A line detection algorithm based on the HRT<br />
then detects the presence of the line and estimates its param-<br />
eters. This second stage, consisting of TFD and HRT, func-<br />
As TFD<br />
Fig. 3. Detection of the watermark message.<br />
tions as an error-correcting technique and significantly increases<br />
the robustness of the proposed watermarking scheme.<br />
The HRT is an efficient tool to detect energy-varying directional<br />
chirps [IO]. It treats the TFD as an image, where each<br />
pixel value corresponds to the energy present at a particular<br />
time and frequency [lo]. The Radon transform (RT) computes<br />
the projections of different angles of an image (TFD)<br />
or two-dimensional data distribution f(u, v) measured as<br />
line integrals along ray paths [ 111:<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.
where 0 is the angle of the ray path of integration, p is the<br />
distance of the ray path from the center of the image and<br />
6 is the Dirac delta function. Equation (6) represents in-<br />
tegration of f(u,u) along the line p = ucosb' f usinb'.<br />
The Hough transform (HT) is a pattern-recognition method<br />
Fig. 4. Line detection using HRT.<br />
c<br />
U<br />
that calculates the number of image points satisfying a para-<br />
metric constraint [12]. The HT can be applied only to bi-<br />
nary images. However, TFDs can be gray-level images with<br />
intensity levels corresponding to energy values. The HRT<br />
method is the combination of HT and RT. It has the ad-<br />
vantage over HT as it can be applied to gray-level images<br />
to detect energy varying chirp components [IO]. The HRT<br />
method can also detect lines with discontinuities. This char-<br />
acteristic allows the extraction of the watermark message<br />
even if some of the watermark bits are incorrectly detected.<br />
Once the points on the two-dimensional data distribu-<br />
tion f(u,v) (in this case the probability density function<br />
of the TFD) that satisfy directional parametric constraints<br />
are found, the presence of the chirp, i.e. the watermark is<br />
decided. If there is watermark. the slope of the chirp deter-<br />
mines the watermark message.<br />
4. ROBUSTNESS TESTS AND DISCUSSIONS<br />
We evaluated the proposed scheme using 5 different audio<br />
signals (SI, . . . , S,) sampled at 44.1 kHz. Due to the limited<br />
resolution of the spectrogram, watermark messages are<br />
modulated as linear chirp functions with initial and final frequencies<br />
in one of the 17 frequency bands of 30 Hz bandwidth<br />
in the [0-510] Hz range. This approach allowed us to<br />
use a message alphabet with 289 possible watermark messages.<br />
We embedded these messages into audio signals of<br />
40 second duration for a chip length of 10000, and into audio<br />
signals of 20 second duration for a chip length of 5000.<br />
Hence, each audio signal contains 176 embedded message<br />
bits. To measure the robustness of the system, we performed<br />
the following tests: (i) TI: MP3 compression with bit rate<br />
11 - 515<br />
123<br />
128 kbps, (ii) Tz: MP3 compression with bit rate 80 kbps,<br />
(iii) Tg: lowpass filtering to 4 Wz, (iv) Tq: resampling at<br />
different sampling rates (22.05 kHz and 11.025 W), and<br />
(v) T,: amplitude scaling. We use the notation To to refer<br />
to watermark detection without signal manipulation. Therefore,<br />
the results corresponding to To serve as a reference.<br />
After the watermark embedded signal y goes through a<br />
signal manipulation process, the message bits are extracted<br />
using the detection scheme described in Section 3. During<br />
all the robustness tests, we assumed that the audio signal and<br />
the PN sequence are synchronized. Tables 1 and 2 show the<br />
bit error rate (BER) results expressed as a percentage of the<br />
total number of message bits (176) for the two chip lengths<br />
and for each signal manipulation operation. Extracted bits<br />
1 Audio I Robustness Test I<br />
-1 0.00<br />
Audio<br />
Sample<br />
1.14<br />
0.57<br />
I 1.14 1 1.14 I 3.42<br />
1 0.57 1 0.57 1 3.42<br />
0.00 0.00 1.70<br />
Table 1. BER (in percentage) for N = 10000.<br />
Robustness Test<br />
TO/T4/TS I TI 1 Tz I TS<br />
4.55 5.11 11.36<br />
s4 3.98 3.98 3.98 10.80<br />
S< 3.98 3.98 4.55 9.66<br />
Table 2. BER (in percentage) for N = 5000<br />
are localized on the TF plane using a spectrogram generated<br />
by a fixed window length short-time Fourier transform. Al-<br />
though some hits are received in error (even in the case of no<br />
signal manipulation), the HRT correctly detected the slope<br />
of the chirp functions in the image of the TF plane and suc-<br />
cessfully extracted the embedded watermark messages thus<br />
providing error-correction capability. In the simulations re-<br />
ported in this study we detected all the embedded water-<br />
mark messages correctly. Figure 5 shows the TFDs of the<br />
message hits embedded with chip length 5000 and extracted<br />
after various signal manipulations for the audio signal SS of<br />
20 second duration.<br />
The definition of the watermark message as a linear chirp<br />
function limits the data payload. We can increase the data<br />
payload by using any of the following techniques. (I) Em-<br />
bedding watermark messages in shorter signal segments.<br />
(2) Selecting the initial and final frequencies for the water-<br />
mark messages from a wider frequency band rather than the<br />
current [O-5101 Hz band. (3) Narrowing the 30 Hz deci-<br />
sion bands using higher TF resolution. We can improve TF<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.
m ntddd msp.<br />
Fig. 5. TFDs of the embedded and extracted bits<br />
resolution by using adaptive TF representation of the signal<br />
based on longer windows for slowly varying functions, and<br />
shorter windows for quickly varying functions. However,<br />
any of the above techniques can potentially degrade the de-<br />
tectibility of the watermark. We are currently investigating<br />
the potential of these techniques for increasing the data pay-<br />
load and their impact on the performance of the proposed<br />
watermark detection scheme.<br />
To test the robustness of the HRT with respect to large<br />
discontinuities, we corrupted the signal by adding white<br />
Gaussian noise of 5 second duration starting at the 15 sec-<br />
ond mark of a 40 second long audio sample. During this<br />
interval the watermark bit detection scheme incorrectly de-<br />
tected a significant number of bits (BER = 50%). Yet, the<br />
HRT successfully detected the slope of the linear chirp at<br />
the discontinuity and extracted the message.<br />
The initial robustness tests for additive noise, multiple<br />
watermarks, and multiple attacks resulted in small BERs to<br />
be further evaluated.<br />
5. CONCLUSIONS<br />
In this paper, we proposed a new audio watermarking al-<br />
gorithm that extracts the watermark message even if some<br />
of the message bits are extracted in error. A line detection<br />
algorithm based on the HRT detects the slope of the water-<br />
mark message in the image of the TF plane of the signal.<br />
The HRT provides error correcting capability and can be<br />
efficiently implemented as it operates on small images of<br />
the TF plane. Initial studies confirm that the proposed al-<br />
gorithm achieves robustness with respect to common signal<br />
manipulations. The current implementation yields a modest<br />
data payload. However, the use of higher resolution TFDs<br />
124<br />
promises to increase the data payload while retaining all the<br />
desirable characteristics of the proposed watermarking al-<br />
gorithm. We are currently working on the synchronization<br />
problem and the expansion of the robustness tests.<br />
6. REFERENCES<br />
[I] M. Arnold, ‘‘Audio watermarking: Features, applica-<br />
tions and algorithms,” lEEE lntl. Conf oti Multiniedia<br />
arid Expo, vol. 2, pp. 1013-1016,20ClO.<br />
[2] I.J. Cox, M.L. Miller and J.A. Bloom, Digiral ”mer-<br />
niarkbig, San Diego, Academic Pres, 2002.<br />
[3] S. Lee and Y. Ho, “Digital audio watermarking in the<br />
cepstrum domain,” IEEE Trans. 011 Cotisioner Elec-<br />
tronics, vol. 46, no. 3, pp. 744-750, .August 2000.<br />
[4] P. Bassia, 1. Pitas and N. Nikolaidis, “Robust audio<br />
watermarking in the time domain,”’ IEEE Trans. on<br />
Mirlriniedin, vol. 3, no. 2, June 2001.<br />
[5] D. Kirovski and H. Malvar, “Spread-spectrum audio<br />
watermarking: Requirements, applications, and limi-<br />
tations,” lEEE Foenh Workshop OIL Mitlrirriedia Sigriol<br />
Processing, pp. 219-224, October 2001.<br />
[6] M.D. Swanson. B. Zhu and A.H. Tewfik, “Current<br />
state of the art, challenges and future directions for<br />
audio watermarking:’ lEEE Ind. Cot$ on Mirltiniedia<br />
Conipriritig otid Sysrenis, pp. 19-24, vol. I, 1999.<br />
[7] W.N. Lie and L.C. Chang, “Robust high quality time-<br />
domain audio watermarking subject to psychoacoustic<br />
masking:’ IEEE Itirl. Synip. on Circrrifs arid Systeriis,<br />
pp. 45-48, vol. 2,2001.<br />
[SI J.W. Seok and J.W. Hong, “Audio watermarking for<br />
copyright protection of digital audio data,” Electronics<br />
letters. pp. 60-61, vol. 37, no. I; Jan. 2001.<br />
[9] W. Bender, D. Gruhl, N. Morimoto and A. Lu “Tech-<br />
niques for data hiding,” IBM Systems Jormial, vol. 35,<br />
nos. 3 & 4, pp. 313-336.1996.<br />
[IO] R.M. Rangayyan and S. Krishnan, “Feature identifica-<br />
tion in the time-frequency plane by using the Hough-<br />
Radon transform,” Trans. Partern Recognition, vol. 34,<br />
pp. 1147-1 158,2001.<br />
[Ill G.T. Herman, Image Reconsrrrrcrion from Projec-<br />
tions: The Firndameiitals of Conipurerized Tomogra-<br />
phy, New York, Academic Press, 1’980.<br />
[ 121 R.O. Duda and P.E. Hart, “Use of Hough transform to<br />
detect lines and curves in pictures,” Comniunicnrions<br />
of the ACM, 15(1): 11-15, January 1972.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 11:03 from IEEE Xplore. Restrictions apply.
TIME-FREQUENCY FILTERING OF INTERFERENCES IN SPREAD SPECTRUM<br />
COMMUNICATIONS<br />
ABSTRACT<br />
A novel technique to excise single and multi-component<br />
chirp-like interferences in direct sequence spread spectrum<br />
communications is proposed. The received signal is decom-<br />
posed into its time-frequency (TF) functions using an adap-<br />
tive signal decomposition algorithm and the TF fnnctions<br />
are mapped onto the TF plane. The TF plane is optimized<br />
and treated as an image, and the interference represented<br />
in the TF plane is detected using the Hough-Radon trans-<br />
form (HRT). Simulation results with synthetic models have<br />
shown successful performance for the excision of linear and<br />
non-linear chirp interferences. The method has shown im-<br />
munity to both white noise and multiple interferences even<br />
under very low SNR conditions of -10 dB.<br />
Keywords: interference excision, spread-sprectrum com-<br />
munications, adaptive signal decomposition, Hough-Radon<br />
transform, time-frequency filtering<br />
1. INTRODUCTION<br />
Serliat Erkiiqiik and Sridhar Krishnan<br />
Department of Electrical and Computer Engineering<br />
<strong>Ryerson</strong> <strong>University</strong>, Toronto, ON MSB 2K3, Canada<br />
e-mail: (serkucuk)(krishnan)@ee.ryerson.ca<br />
In spread spectrum communications, the message signal is<br />
modulated and spread over a widerbandwidth with a pseudo-<br />
noise (PN) code also known at the receiver, and is transmit-<br />
ted over the channel. The bandwidth increase of the trans-<br />
mitted signal yields a processing gain, defined as the ratio of<br />
the bandwidth of the transmitted signal to the bandwidth of<br />
the message signal. Although the processing gain provides<br />
a high degree of interference suppression, there is a trade-<br />
off between increasing the processing gain and the available<br />
frequency spectrum. In case of high interference to signal<br />
ratio (ISR), the spread spectmm system with a limited pro-<br />
cessing gain may not be able to suppress the interference.<br />
Therefore, excising the interference prior to despreading the<br />
received signal is necessaly to increase the performance of<br />
the system [I].<br />
In this study, we will evaluate the proposed interference<br />
excision algorithm using the direct sequence spread spec-<br />
trum (DSSS) system, one of the most widely used spread<br />
spectrum techniques [I]. In DSSS, m k, the kth hit of the<br />
This work was supported by NSERC and Micronet.<br />
0-7803-7946-2/03/$17.00 82003 IEEE<br />
323<br />
message signal m(t), is multiplied with a PN code p(t),<br />
where each message bit occurs every T,, seconds and the<br />
PN code bits every Tp seconds. The processing gain, i.e.<br />
the length of the PN code is therefore L = T,/Tp, where<br />
T,,, >> Tp. During the transmission of the modulated signal,<br />
additive white Gaussian noise n(t) and intelference i(t)<br />
are added to the signal in the channel, and the following signal<br />
is received:<br />
.(t) = 7n&) + n(t) + i(t).<br />
(1)<br />
At the receiver, the received signal r(t) is synchronized and<br />
correlated with the same PN code p(t) and the estimate of<br />
the message bit f i g is made,<br />
125<br />
fie = b(t)>P(t))<br />
= mk (P(t),P(t)) + (n(t),p(t)) + (i(t)>P(t)) ' (2)<br />
As seen in the above equation, despreading of the received<br />
signal recovers the message signal, while spreading the noise<br />
and the interference. The decision is made on the polarity<br />
of nik. If the ratio of the interference power to the signal<br />
power is large so the processing gain can not suppress the<br />
interference, then the estimate of the message hit, I?L~ may<br />
be wrong. Therefore the interference should be suppressed<br />
prior to despreading.<br />
Some excision techniques such as adaptive notch filtering,<br />
decision-directed adaptive filtering, and analog to digital<br />
conversion techniques are commonly used to suppress<br />
narrowband interferences in DSSS [I]. However, if the interference<br />
has a narrowband instantaneous bandwidth in a<br />
wideband frequency range such as chirps, time-frequency<br />
(TF) methods perform well to localize the interference [2].<br />
There have been several techniques proposed to suppress<br />
the interference using time-frequency distributions (TFDs)<br />
of the signal [3,4,5]. TFDs localize any interference both<br />
in time and frequency domain [2], and are ideally suited<br />
for interference excision. The commonly used TFDs suffer<br />
from a trade-off between TF resolution and cross-terms<br />
suppression.<br />
In this paper, we focus on a new excision technique<br />
based on constructing a positive TFD [6, 7, 81 of the received<br />
spread spectrum signal using an adaptive signal de-<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.
composition technique [9]. The block diagram of the pro-<br />
posed interference excision algorithm is shown in Figure<br />
1. By decomposing a signal into components, the inter-<br />
action between components can be avoided, and the TFD<br />
constructed by combining the TFDs of the individual com-<br />
ponents would be free of cross-terms. Also, by using Gaus-<br />
sian functions as bases for decomposition, a high TF reso-<br />
lution of interference signals can be achieved. By treating<br />
the TF plane as an image, the interference patterns can be<br />
detected by using the image analysis technique of Hough-<br />
Radon transform (HRT). Curves with mathematical equa-<br />
tions can be easily detected by transforming the shapes in<br />
the TF image into the Hough domain (also known as para-<br />
metric domain), and searching for dominant peaks (maxi-<br />
mum values). The co-ordinates of the dominant peaks pro-<br />
vide the parameters of the shape. Interferences are recon-<br />
structed by suitably thresholding the corresponding energy<br />
values in the TF plane and subtracted from the received<br />
spread sprectrum signal.<br />
The paper is organized as follows: In Section 2, con-<br />
struction of TF image is explained. The HRT theoty for<br />
detection of chirps in the TF image is explained in Section<br />
3. In Section 4, the performance of the proposed system is<br />
evaluated in terms of ISR, bit error rate (BER) and average<br />
chip error rate. The paper is concluded in Section 5.<br />
Figure 1: Interference excision algorithm.<br />
2. CONSTRUCTION OF TIME-FREQUENCY<br />
IMAGE<br />
The adaptive signal decomposition algorithm we use to de-<br />
compose the signal into its TF functions is the matching<br />
pursuit (MP) algorithm [9]. In MP, the received signal r(t)<br />
is decomposed into its linear combinations of TF functions<br />
g,(t) selected from an overcomplete dictionaly of TF func-<br />
tions. The signal r(t) can be represented as<br />
m<br />
where<br />
1 t-pn<br />
grn(t) = - g(-)<br />
s,.<br />
exp Ij(Znfnt + 4d1, (4)<br />
and a, are the expansion coefficients. The scale factor s,<br />
controls the width ofthe window functionandp, is the temporal<br />
placement coefficient. The parameters f and 4,, rep-<br />
(3)<br />
324<br />
126<br />
resent the frequency and the phase of the exponential func-<br />
tion, respectively. The signal r(t) is projected onto an over-<br />
complete dictionary of TF functions with all possible win-<br />
dow sizes, frequencies and temporal placements. At each<br />
iteration, the best-correlated function is selected from the<br />
dictionary and the remaining of the signal, which is called<br />
residue is further decomposed using the same iteration pro-<br />
cedure. For our application, we use the Gabor dictionaty<br />
consisting of Gaussian functions. Gaussian functions sat-<br />
isfy the minimum time-handwidth product and represent the<br />
signal on the TF plane with an optimal time-frequency res-<br />
olution [2]. After M iterations, the signal r(t) can be repre-<br />
sented as,<br />
M-l<br />
r(t) = (Rnr,gy,,(t))gy,(t) + R M~, (5)<br />
"=O<br />
where Rnr represents the residue of the signal r(t) after n<br />
iterations. The first term in Eqn. 5 represents the first M<br />
Gaussian functions matching the signal best (we will refer<br />
to the first term as r'(t)) and the second term (referred as<br />
r"(t)) represents the residue of the signal r(t). In order<br />
for the signal to be fully decomposed, the iteration process<br />
should continue until all the energy of the residue signal<br />
is consumed. In this study, we are interested in modeling<br />
the interferences with power higher than the power of the<br />
transmitted signal. The unmodeled part of the interference<br />
is suppressed by the processing gain. Therefore, we stop<br />
our iterations when the power of the residue of the signal<br />
r" (t) becomes less than the expected power of the interference<br />
free signal for less computational load. After the signal<br />
decomposition is achieved, the TFD W(t, w) may be constructed<br />
by taking the Wiper-Wle distribution (WVD) [2]<br />
of the Gaussian functions represented in r'(t):<br />
M-l<br />
W(t,w) = I(R"r,g7n(t))12Wg~~((t,w)<br />
M-I M-I<br />
*=0<br />
+ (R"r, 97, (t)) (R"'r.gT.., (0) Yg7" ,g,,,.~(t, (6)<br />
*=a<br />
where WgTn(t,w) is the WVD of the Gaussian window<br />
function. The double sum corresponds to the cross-terms of<br />
the WVD and should be rejected in order to obtain a crossterm<br />
free energy distribution of r'(t) in the TF plane [9].<br />
Therefore the resulting TFD is given by the first term of<br />
W(t,w), which we denote it by W'(t,w). W'(t, w) is a<br />
positive and cross-term free distribution but it does not satisfy<br />
the marginal properties<br />
.I<br />
JW.(t,w)dw # Ir'(t)lZand W'(t,w)dt # IR'(w)12 (7)<br />
in order to be a proper TFD for feature identification applications.<br />
The TFD W'(t, w) may be modified to satisfy the<br />
marginal requirements and still preserve its important properties.<br />
The cross-entropy minimization method can be used<br />
to optimize the TFD [PI. The resulting TFD is a true probability<br />
density function and it can be used for feature identification<br />
[6]. Let's denote the optimizedTFD by W"(t, w).<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.
3. INTERFERENCE DETECTION. nouen AND<br />
RADON TRANSFORM<br />
The combined Hough and Radon transform (HRT) is an efficient<br />
tool to detect energy-varying directional chirps [IO].<br />
In the HRT, the TFD is treated as an image, where each pixel<br />
value corresponds to the energy present at a particular time<br />
and frequency. For convenience, we will refer to the graylevel<br />
image of the optimized TFD W"(t, tu) as f (U, u).<br />
The Radon transform (RT) computes the projections of<br />
different angles of an image or two-dimensional data dis-<br />
tribution f (U, U) measured as line integrals along ray paths.<br />
The RT can be expressed as<br />
R = 11: f(u, u)6(p - ( UCOS~ + UsinO)) dudu, (8)<br />
where 8 is the angle of the ray path of integration, p is the<br />
distance of the ray path from the center of the image and<br />
6 is the Dirac delta function. The equation represents integration<br />
of f (U, U) along the line p = U cos 8 + U sin 8.<br />
The Hough transform (HT) is a paitem-recognition method<br />
that calculates the number of image points that satisfy a<br />
parametric constraint (quadratic interferences are modeled<br />
as second order equations as in [lo]). The HT can be applied<br />
to binary images only. The advantage of the combined<br />
HRT over HT is that it can be applied to gray-level images<br />
where we can detect energy varying chirp components as<br />
well. Once the points on the two-dimensional data distribution<br />
f (U, U) that satisfy directional parametric constraints<br />
are found, we transform the parameters to the TF domain<br />
and threshold the energy values of the TF functions corresponding<br />
to the directional interference on the TF plane. As<br />
illustrated in Figure I, the estimate ofthe interference ;(t) is<br />
reconstructed and subtracted from the received spread spectrum<br />
signal.<br />
4. EXPERIMENTAL RESULTS<br />
In our simulations, we used 128 chips per message bit for<br />
spreading the message signal and assumed the channel to be<br />
non-dispersive. We considered synthetic linear, quadratic,<br />
and multiple (linear and quadratic) chirps as the interference<br />
sources. We initially evaluated the bit error rates (BERs) re-<br />
sulting from the presence of a constant amplitude linear and<br />
quadratic chirps that sweep the entire frequency band oftbe<br />
spread spectrum signal, for different ISRs in the range of<br />
[0,50] dB. We assumed the SNR to be 10 dB for each case.<br />
When the ISR was below IO dB, the system was able to de-<br />
spread the interference so that no bit errors occurred at the<br />
receiver. For ISRs in the range of [ 10,501 dB, we supressed<br />
the single and multiple interferences using the proposed ex-<br />
cision algorithm before despreading. Multiple interferences<br />
included a linear and a quadratic chirp in the same TF do-<br />
main. We recorded no bit errors ajier the excision of single<br />
and mirltiple infeferences. We repeated the same process<br />
325<br />
127<br />
for different SNR values in the range of [-10,10] dB and<br />
also recorded no bit errors.<br />
One of the main reasons for this is an accurate TF repre-<br />
sentation of interferences in the adaptive TF plane and suc-<br />
cessful detection and filtering by HRT. A similar obsetva-<br />
tion was made by Bultan et al. in [I I], where they repre-<br />
sent the linear interferences with good TF localization us-<br />
ing adaptive chirplet decomposition. However, they do not<br />
report any results on the excision of quadratic and multiple<br />
interferences. Other TFD based methods reported bit errors<br />
for similar excision conditions [3,4, 51. Since interferences<br />
with different power levels were successfully removed from<br />
the signal resulting in no BERs, we evaluated our system<br />
by calculating the percentage of chips received in error for<br />
various SNR values. Figures 2 and 3 show the simulation<br />
results for the ISR values 40 and 5 dB, respectively. The<br />
fint ISR value is chosen as 40 dB because the system gives<br />
around 50% BER (the case when the system cannot reject<br />
any part of the interference) when there is no excision.<br />
Figure 2: Probability of chips in error for ISR=40 dB.<br />
The second ISR value is chosen as 5 dB, where the sys-<br />
tem can reject the interference without pre-processing prior<br />
to despreading. In some of the systems proposed, exci-<br />
sion of the interference with low power degrades the per-<br />
formance of the system [3], whereas our system substan-<br />
tially improves the chip error rate. For illustration purpose,<br />
TFDs of (i) the SS signal with a single interence (ISR = 5<br />
dB), (ii) the detected interference, and (iii) the interference<br />
suppressed SS signal are shown in Figure 4.<br />
The experimental results show that the proposed tech-<br />
nique can be successfully used for excision of single and<br />
multiple-component chirp-like interferences using adaptive<br />
TFDs and HRT, where as Bultan et al. [I I] focus only on<br />
suppression of linear chirps and Amin uses different kernels<br />
for different interferences [3].<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.
SNR (dB)<br />
. . .<br />
. . . . . . . . . . . . . .<br />
Figure 3: Probability of chips in error for ISR=5 dB.<br />
time sal”<br />
Figure 4: TFDs of (i) SS signal with a linear interference<br />
(chirp) (ii) estimate of the interference, (iii) interference fil-<br />
tered SS signal.<br />
5. CONCLUSIONS<br />
A new technique is introduced for the excisionof frequency-<br />
modulated interferences in spread spectrum communications<br />
The localization of the interference is provided by an adap-<br />
tive signal decomposition algorithm using Gaussian func-<br />
tions as bases and rejecting the cross WVDs of the Gaussian<br />
functions. Therefore, single and multiple time-varying FM<br />
interferences are represented with good resolution on the<br />
TF plane. The interference is then detected using a line de-<br />
tection algorithm, HRT. The estimated interference is suh-<br />
tracted from the signal, prior to despreading. The simu-<br />
lation results for the proposed algorithm had no bit errors<br />
after suppressing the interference for different ISR values<br />
even under very low SNR conditions of - IO dB. The perfor-<br />
mance of the system is evaluated by calculating the received<br />
chips in error before and after interference suppression. The<br />
proposed technique improves the performance of the system<br />
326<br />
128<br />
by reducing the number of chips received in error after exci-<br />
sion of the interference in both cases, wbenthe ISR is low or<br />
high. This technique can be used for any kind of chirp-like<br />
interference suppression with high accuracy.<br />
6. REFERENCES<br />
[I] J.D. Laster and J.H. Reed, “Interference rejection in<br />
digital wireless communication:’ IEEE <strong>Signal</strong> Pro-<br />
cessing Mag., pp. 37-62, May 1997<br />
[2] L. Cohen, “Time-frequency distributions - A review,”<br />
Proc. IEEE, vol. 77, pp. 941-981, 1989<br />
[3] M.G. Amin, “Interference mitigation in spread spec-<br />
trum communication systems using time-frequency<br />
distributions,” IEEE Trans. <strong>Signal</strong> Processing, vol. 45,<br />
no. 1, pp. 90-101, Jan 1997<br />
[4] S. Barbarossa and A. Scaglione, “Adaptive time-<br />
varying cancellation of wideband interferences in<br />
spread-spectrum communications based on time-<br />
frequency distributions:’ IEEE Trans. <strong>Signal</strong> Process-<br />
ing, vol. 47, no. 4, pp. 957-965, Apr. 1999<br />
[5] X. Ouyang and M.G. Amin, “Short-time Fourier trans-<br />
form receiver for nonstationary interference excision<br />
in direct sequence spread spect“ communications,”<br />
IEEE Trans. <strong>Signal</strong> Processing, vol. 49, no. 4, pp. 85 1 -<br />
863, Apr. 2001<br />
[6] S. Krishnan, “Adaptive <strong>Signal</strong> Processing Techniques<br />
for <strong>Analysis</strong> of Knee Joint Vibroarthrographic Sig-<br />
nals,” PhD. Thesis, <strong>University</strong> of Calgary, June 1999<br />
[7] L. Coben and T. Poscb, “Positive time-frequency dis-<br />
tribution functions,” IEEE Trans. Acousf. Speech Sig-<br />
nal Processing, vol. ASSP-33, no. 1, pp. 31-38, 1985<br />
[8] P.J. Loughlin, J.W. Pitton and L.E. Atlas, “Construc-<br />
tion of positive time-frequency distributions,” IEEE<br />
Trans. <strong>Signal</strong> Proc., vol. 42, no. 10, pp. 2697-2705,<br />
Oct 1994<br />
[9] S.G. Mallat and Z. Zhang, “Matching pursuit with<br />
time-frequency dictionaries,” IEEE Trans. on <strong>Signal</strong><br />
Proc., 41(12): 3397-3415, 1993<br />
[IO] R.M. Rangayyan and S. Krishnan, “Feature identifica-<br />
tion in the time-frequency plane by using the Hough-<br />
Radon transform:’ Pattern Recognition, vol. 34, pp.<br />
1147-1 158,2001<br />
[ 111 A. Bultan and A.N. Akansu, “A novel time-frequency<br />
exciser io spread spectrum communications for chirp-<br />
like interference:’ Proc. ICASSP-1998, pp. 3265-<br />
3268.1998<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 09:49 from IEEE Xplore. Restrictions apply.
A GENERAL PERCEPTUAL TOOL FOR EVALUATION OF AUDIO CODECS<br />
Karthikeyan Umapathy<br />
Dept. of Electrical and Computer Eng.,<br />
The <strong>University</strong> of Westem Ontario,<br />
London, Ontario, CANADA.<br />
Email: kumapath@uwo.ca<br />
Abstract<br />
Subjective evaluation forms an important part of any re-<br />
search work, where the feedback and perception of general<br />
public or a trained set of specialists are mandatory. Many<br />
audio and video coding techniques have emerged to tackle<br />
the bandwidth pmblems imposed by the Internet with data<br />
compression schemes either with lossless or perceptually<br />
lossless quality. In order to evaluate the performance of<br />
these techniques a Mean Opinion Score (MOS) test hns to<br />
be performed with wide variety of subjects. In this paper we<br />
present a MOS tool developed to evaluate the audio codecs<br />
both in controlled and uncontrolled listening envimnments.<br />
The technique is based on the International Telecommuni-<br />
cation Union - Radiocommunication sector (ITU-R) stan-<br />
dard. This novel appmach of performing distributed listen-<br />
ing tests in uncontmlled envimnments will help researchers<br />
to collect substantial feedback andperform statistical anal-<br />
ysis of an audio codec’sperformance in an eficient manner<br />
particularly for intemet driven applications. Results ofper-<br />
ceptual evaluation of 8 sample audio files of different va-<br />
rieties with an adaptive time-frequency transform (ATFTJ<br />
audio codec indicates the ease, independency, and the ef-<br />
fectiveness of performing MOS studies with the proposed<br />
technique.<br />
Keywords: mean opinion score (MOS), subjective evalua-<br />
tion, multimedia, audio coding, listening experiments.<br />
1. INTRODUCTION<br />
Subjective evaluation of audio quality is needed to assess<br />
the performance of audio codecs. Even though there are<br />
objective measures such as signal to noise ratio (SNR), total<br />
harmonic distortion (THD), and noise-to-mask ratio [l]<br />
they would not give a true evaluation of the audio codec<br />
particularly if they use lossy schemes as in the case of many<br />
existing well known audio codecs. This is due to the fact<br />
that for example in a perceptual coder SNR is lost however<br />
the audio quality is claimed to be perceptually distortion-<br />
Thanlrs to Minanet and NSERC organizations for funding ulis project.<br />
CCECE 2003 - CCGEI 2003, Montreal, May/mai 2003<br />
0-7803-7781-8/03/$17.00 @ 2003 IEEE<br />
Sridhar Krishnan and Garabet Sinanian<br />
Dept. of Electrical and Computer Eng.,<br />
<strong>Ryerson</strong> <strong>University</strong>,<br />
Toronto, Ontario, CANADA.<br />
Email: (!aishnan)(gsinania) @ee.ryerson.ca<br />
less. In this case SNR measure may not give the correct<br />
performance evaluation of the coder.<br />
The proposed technique uses the subjective evaluation<br />
method recommended by the Intemational Telecommuni-<br />
cation Union - Radiocommunication sector (ITU-R) stan-<br />
dards. It is called a “double blind triple stimulus with hid-<br />
den reference” [l]. In this method listeners are provided<br />
with three choices A, B and C for each sample under test. A<br />
is the referencdoriginal signal, B and C are assigned to be<br />
either the referencdoriginal signal or the compressed signal<br />
under test. The selection of reference or compressed signal<br />
for B and C is completely randomized. Figure 1 explains<br />
the choices A, B and C. For each sample audio signal, sub-<br />
jects listen to reference signal A, and compare B and C with<br />
the A. After each comparison of B with A, and C with A,<br />
they grade the quality of the B and C signals with respect<br />
to A in 5 levels as shown in Table 1. Both the listener and<br />
the test performer are made unaware of the combinations<br />
B and C can take, and it is called double-blind, and since<br />
three stimulus are provided it is called double-blind triple<br />
stimulus method.<br />
Fig. 1. Block diagram explaining MOS choices A, B, and<br />
C.<br />
Audio Quality<br />
Excellent<br />
Good<br />
Fair<br />
Poor<br />
Unsatisfactory<br />
Level of Distortion<br />
lmperceptible<br />
Just perceptible but not annoying<br />
Perceptible and slightly annoying<br />
Annoying but not objectionable<br />
Very annoying and objectionable<br />
Table 1. Description of the ratings used in the Mean Opin-<br />
ion Score.<br />
- 683 -<br />
129<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.
In this paper, we propose a subjective evaluation scheme<br />
in line with the above explained "double blind triple stim-<br />
ulus with hidden reference" for an adaptive time-frequency<br />
transform (ATFT) based audio codec. The paper is orga-<br />
nized as follows. In Section 2, under methodology a brief<br />
introduction to the ATFT coder and measurement procedure<br />
are discussed. Results and discussions are covered in Sec-<br />
tion 3.<br />
2. METHODOLOGY<br />
2.1. Adaptive Time-frequency Transform<br />
(ATFT) codec<br />
The ATFT audio codec is based on the matching pursuit<br />
(MP) [2] algorithm, where any signal z(t) is decomposed<br />
into a linear combination of TF functions g(t) selected from<br />
a redundant dictionary of TF functions.<br />
where<br />
a?<br />
z(t) = an gTn (t), (1)<br />
n=0<br />
and an are the expansion coefficients. The scale factor also<br />
called the octave parameters, is used to control the width<br />
of the window function, and the parameter p , controls the<br />
temporal placement. The parameters f, and +, are the fre-<br />
quency and phase of the exponential function respectively.<br />
The signal z(t) is projected over a redundant dictionary<br />
of TF functions with all possible combinations of scaling,<br />
translations and modulations. The dictionary of TF func-<br />
tions can either suitably be modified or selected based on<br />
the application in hand. In OUT technique, we are using the<br />
Gabor dictionary (Gaussian functions) which has the best<br />
TF localization properties [31. At each iteration, the best<br />
correlated TF functions to the local signal structures are se-<br />
lected from the dictionary. The remaining signal called the<br />
residue is further decomposed in the same way at each itera-<br />
tion subdividingthem into TFfunctions. After M iterations,<br />
signal z(t) can be expressed as<br />
M-I<br />
4t) = (R"z, 97") 97" (4 + R'z(t), (3)<br />
,=O<br />
where the first part of z(t) is the decomposed TF functions<br />
till M iterations, and the second part is the residue which<br />
will be decomposed in the subsequent iterations. This process<br />
is repeated till the total energy of the signaljs decomposed.<br />
The decomposition parametem (s,. pn. f,. Qn and<br />
a,) are further processed by applying energy thresholcling<br />
and perccptual filtering followed by quantization to obtain<br />
a compact representation of the audio signal. More details<br />
on ATET audio coding technique can be found in somi: of<br />
our earlier works [4, 51. The overall block diagram of the<br />
ATFT codec is shown in Figure 4. Two versions (standard<br />
and fast) of MP algorithm based ATFT codec was evJu-<br />
ated using the proposed subjective evaluation technique. A<br />
separate subjective evaluation of the perceptual model ,and<br />
quantization stage of the ATFT codec is also included.<br />
2.2. Measurement procedure<br />
Evaluation of any audio coding technique involves perfoim-<br />
ing subjective evaluation of the compressed audio quality.<br />
The standard procedure to obtain quantitative and qualita-<br />
tive data about a coder's performance is by performing ihe<br />
mean opinion scores (MOS) studies.<br />
2.2.1. MOS in controlled environment Experimental<br />
setup consists of a Pentium 111 PC with Windows 2000.<br />
Eight sample stereo signals were played through the creative<br />
sound blaster card to a professional high quality headset<br />
(Sennehiser) with a fixed volume output. A graphical uijer<br />
interface (GUI) was designed as shown in Figure 2. %ee<br />
stimuli A, B, and C are provided as explained in Section<br />
I. Listeners are allowed to do the tests by themselves under<br />
the supervision of the research team. Ratings are automat-<br />
ically recorded in a report file as the listener proceeds with<br />
all the 8 stereo samples. Each time the listener is allowed to<br />
advance to the next sample only after he/she grades the cix-<br />
rent sample. Twenty listeners (randomly selected) with con-<br />
sent agreement participated in the MOS studies. Once the<br />
testing was finished for all the subjects, the average MOS<br />
scores were computed for each sample. Table 2 shows the<br />
average MOS values obtained for the 8 signals. Figure 3<br />
shows the distribution of the MOS scores for each of the<br />
eight sample signals. It can be observed from the Table 2<br />
that classical-like audio signals such as Harp, Harpsichord,<br />
piano, and Tubularbell received high MOS scores compared<br />
to the rock-like (Acdc, Deflep) and signals with voice seg-<br />
ments (Enya, Visit).<br />
- 684 -<br />
130<br />
Table 2. Average MOS for 20 listeners.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.
Fig. 2. Snapshot of the GUI used for MOS studies<br />
. .<br />
UOSPOm. -<br />
1 1 a . o<br />
Fig. 3. Distribution of the MOS scores for 20 listeners<br />
In order to evaluate the performance of the developed<br />
perceptual model and the quantization stage of the ATFT<br />
cods, a second MOS study was conducted with 5 listeners.<br />
The procedure was repeated but the choices A, B and C were<br />
assigned as shown in Figure 4. The output of the TF decom-<br />
position (TF modeling stage) forms the input to the percep-<br />
tual filtering module hence the reference A was assigned to<br />
the reconstructed signal at the output of the TF modeling<br />
stage. Similarly choice B was assigned to the reconstructed<br />
signal at the output of the perceptual filtering module and C<br />
to the reconstructed signal at the output of the quantization<br />
stage. Listeners were asked to rate the choices B (percep-<br />
tual filtering output) and C (quantization stage output) with<br />
the reference A (TF modeling output) on a scale of 1 to 5 as<br />
explained in Section 1.<br />
The results were averaged for the five listeners and given<br />
in Table 3. From Table 3 it can be observed on an average,<br />
MOS scores of 4.6 and 4.3 were achieved for the perceptual<br />
filtering stage and the quantization stage respectively. The<br />
MOS scores indicate that the perceptual filtering technique<br />
proposed in the ATIT codec is performing exceedingly well<br />
with the eight sample signals and the noise introduced in the<br />
process ofquantization affects the output quality minimally.<br />
Sample PFO QO<br />
Deflep<br />
Enya 4.2<br />
Ham 4.2<br />
Harpsichord<br />
Piano<br />
Visit<br />
Average<br />
Table 3. Average MOS for PFO and QO. PFO - Perceptu-<br />
ally filtered output, QO - Quantisation output.<br />
The whole ATFT audio coding process was also evaluated<br />
with a faster version of the MP algorithm [6]. The faster<br />
version of the MP technique is based upon selecting a set of<br />
hest correlated TF functions at each iteration as opposed to<br />
one function selected at each iteration as done in the stan-<br />
dard MP. MOS were obtained by testing with 9 subjects and<br />
the results arc given in Table 4.<br />
Sample<br />
Acdc<br />
Deflep<br />
Enya<br />
Harpsichord<br />
Average MOS<br />
3.8<br />
3.2<br />
3.7<br />
Tubularbell 3.8<br />
Visit 2.9<br />
Table 4. Average MOS for 9 listeners of the ATFT codec<br />
with faster algorithm.<br />
2.2.2. MOS in uncontrolled environment As most of the<br />
audio compression formats are aimed at using with Internet,<br />
it is essential to evaluate the audio ccdec performance in an<br />
uncontrolled environment using Internet. The dismbuted<br />
MOS will give the me performance rating of the audio<br />
quality in terms of acceptance level in an average Internet<br />
listener environment. Many variability are involved in this<br />
MOS approach such as the quality of the audio hardware,<br />
audio playback software and the volume of playback. How-<br />
ever the average MOS results will justify the suitability of a<br />
media format over Intemet as it is tested in a more flexible<br />
environment with variety of internet listeners.<br />
A web based MOS as shown in Figure 5 was devel-<br />
oped using the standard web design tools. Similar to the<br />
standard MOS procedure a consent form will be displayed<br />
when the listener visits the main web page. After accept-<br />
ing the conditions, the web page is redirected to the actual<br />
MOS testing page. Listeners are provided with three stim-<br />
- 685 -<br />
131<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.
Widobad<br />
TF<br />
Modcb<br />
Rc-l=,lCVd<br />
--+F-1=<br />
TF<br />
Threshold<br />
M&a<br />
PW-Vr<br />
Mmldng m<br />
channel<br />
R0mESi"g<br />
___<br />
Fig. 5. Snapshot of the GUI used for web based MOS stud-<br />
ies.<br />
uli A, B and C as explained in Section 1. The interactive<br />
web page contains a form to receive the name of the listener<br />
and his rating of each sample. A validation is performed<br />
such that only after entering the name and rating the audio<br />
sample, the listener will be able to navigate to the next sam-<br />
ple. When the listeners select the stimuli, the audio samples<br />
are played using the media players associated with their re-<br />
spective web browsers. At this point of time all our testing<br />
on the web will be using the *.au format. This means the<br />
ATFT audio codec output will be converted to *.au format<br />
for the MOS purposes. The standard *.wav and *.au format<br />
at 44100 kHz sampling rate with 16 bit resolution is con-<br />
sidered to be a nearly lossless and a gold standard for CD<br />
quality music. Hence converting the ATFT codec output to<br />
one of this formats should not affect the quality or the MOS<br />
obtained. Once all the samples are rated the results are ap-<br />
pended in a data file in the main server with the usemame.<br />
Scripts written in Perl, handle the processing of data and<br />
redirecting web pages.<br />
3. RESULTS AND CONCLUSIONS<br />
The MOS study on the ATIT codec was performed on eight<br />
stereo sample signals in the following modes: 1. Controlled<br />
(a. with standard ATIT algorithm, b. with fast ATFT al-<br />
PeraplalWIc"ng<br />
________<br />
132<br />
___<br />
Rrmnalrucled RCaniUYCLed<br />
gorithm and c. evaluation of perceptual and quantization<br />
stages with standard ATFT algorithm) 2. Uncontrolled (web<br />
based MOS).<br />
It can seen from the Tables 2,3,4, and Figure 3, the sig-<br />
nificance of the proposed MOS study in evaluation of the<br />
audio coders. A broad and clear understanding of the out-<br />
put audio quality of the codec can be obtained with respert<br />
to (I) the type of signals the codec performs well or worst,<br />
(2) the speed of the algorithm versus output quality and (3)<br />
block-based evaluation of individual parts of the coder. De-<br />
tailed subjective testing using the web based MOS will be<br />
performed to obtain statistically significant results in evalu-<br />
ating coder performances.<br />
The advantage posed by web based MOS studies such as<br />
ease of subject recruitment with diverse music backgrounds;<br />
effectiveness in data/feedbackcollection; machine and envi-<br />
ronmental flexibility; and the availability of personal corn-<br />
puters ubiquitously will make it as an attractive tool for eva-<br />
uating the performance of next generation media compres-<br />
sion techniques over Internet.<br />
References<br />
[I] Thomas ryden, "Using listening tests to assess audio codecs:'<br />
in Collected Papers on Digital Audio Bit-Rare Reducrion,<br />
AES, 1996, pp. 115-125.<br />
[Z] Stephane Mallat, A wavelel tour of signal processing, Aca-<br />
demic press, San Diego, CA, 1998.<br />
[3] L. Cohen, 'Time-frequency distributions - a review," Pin-<br />
ceedings of the IEEE, vol. 77(7), pp. 941-981, 1989.<br />
[41 Karthikeyan Umapathy and Sridhar Krishnan, "Joint time-<br />
frequency coding of audio signals," in 51h WSES International<br />
multi conference on CSCC (Circuits, System. Communica-<br />
rims and Compulers), Crete, Greece, July 2001, pp. 32-36.<br />
[5] Karthikeyan Umapatby and Sridhar Krishnan, "Low bit-rite<br />
coding of wideband audio signals:' in Proceedings of IASTED<br />
International conference - SPPRA (<strong>Signal</strong> Pmessing, Pattern<br />
Recognition and Applicationr), Rhodes, Greece, July 2OEll.<br />
pp. 101-105.<br />
161 R. Gribonval, "Fast matching pursuit with multiscale dictio-<br />
nary of Gaussian chirps," IEEE Transactions on <strong>Signal</strong> Pro-<br />
cessing, vol. 9, no. 5, May 2001.<br />
- 686 -<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:40 from IEEE Xplore. Restrictions apply.
Non-Stationary Noise Cancellation in Infrared Wireless Receivers<br />
Sridhar Krishnan, Xavier Fernando and Hongbo Sun<br />
Department of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong>,<br />
Toronto, Ontario, Canada<br />
(!aishnan)(femando)(hsun)@ee.ryerson.ca<br />
Abstract<br />
Infrared is attracting much attention for indoor<br />
wireless access due to its enormous bandwidth,<br />
inherent privacy and low cost. Intensiry modulated,<br />
directly detected infrared schemes do not experience<br />
multipath fading. However, ambient noise due to<br />
artificial lighting has been the major concern in<br />
infrared wireless systems in indoors. Conventionally,<br />
static or lowfrequency noise due to conventional light<br />
sources is removed using optical high pass filters.<br />
Nonetheless, interference from fluorescent lights<br />
equipped with electronic ballasts has periodic<br />
interference components up to I MHz and, cannot be<br />
filtered easily. In this paper, soft DSP filters are<br />
proposed to cancel the harmonics. ambient noise. and<br />
uncorrelated signal structures. Non-stationary noise is<br />
cancelled with an adaptive denoising filter, and a<br />
comb filter cancels interference from the electronic<br />
ballasts. Adaptive soft filters have the advantage that<br />
they can be easily updated and track the variations in<br />
noise characteristics. Simulation results show<br />
promising improvement in noise cancellation even<br />
under very low and varied SNR and noise source<br />
conditions.<br />
Keywords: infrared wireless. denoising, non-s tationary<br />
signals, comb filters, adaptive filters. wavelet-packets.<br />
1. INTRODUCTION<br />
Wireless communications has entered into a new<br />
phase. With each added application the demand for<br />
real-time, wideband wireless services increases. The<br />
overcrowded radio spectrum is simply unable to cope<br />
with all the demand. Infrared signal, on the other hand,<br />
is a promising new medium for wireless applications,<br />
especially at indoors. Considering the fact that the<br />
need for wideband multimedia type access is much<br />
high at indoors than at outdoors, infrared is an<br />
excellent choice. It has abundant untapped bandwidth<br />
that is freely available. Optical wireless techniques<br />
enjoy increased focus worldwide. The Wi-Fi (IEEE<br />
802.1 1) standard specifies infrared as a physical layer<br />
option.<br />
Optical energy is inherently confined within a room<br />
cavity resulting in inherent privacy. Therefore, the<br />
same infrared wavelength can be used in adjacent<br />
CCECE 2003 - CCGEI 2003, Montrkal, Maylmai 2003<br />
0-7803-7781-8/03/$17.00 0 2003 IEEE<br />
- 1945 -<br />
133<br />
room allowing device and frequency reusability.<br />
Furthermore, with Intensity Modulated Directly<br />
Detected (IMIDD) optical schemes, there is no<br />
multipath fading. The fading may degrade the signal<br />
strength by up to 30 dB in similar radio systems.<br />
However, ambient noise due to artificial lighting has<br />
been the major concem in infrared wireless systems in<br />
indoors [I]. This background light induces a white<br />
Gaussian shot noise that is 20 to 40 dB more than the<br />
signal induced shot noise. Furthermore, modem<br />
fluorescent lights with electronic ballasts generate<br />
switching noise up to 1 MHz, which introduces a much<br />
serious impairment. At times, the receiver thermal<br />
noise becomes dominant. The time-varying wireless<br />
channel determines weights of other noise sources. As<br />
a result, infrared wireless receivers experience high<br />
level of non-stationary noise.<br />
The objective of this paper is to develop signal<br />
processing algorithms to combat the high power non-<br />
stationary noise. Adaptive filters based on the least<br />
mean squares (LMS) and wavelet-packet based filters<br />
effectively cancel the noise in a non-stationary<br />
environment. A higher order comb filter cancels the<br />
periodic noise from the electronic ballast.<br />
2. NOISE AT INFRARED RECEIVERS<br />
Infrared receivers face with the challenge of a variety<br />
of noise sources, and the details of which are shown in<br />
Fig. I. There will he thermal noise from the electronics<br />
devices. This can he modeled as white Gaussian noise,<br />
and is relatively easy to tackle.<br />
Indoor infrared transmission systems are affected by<br />
interference induced by natural and dficial ambient<br />
light. The noise is directly proportional to the amount<br />
of light incident on the photo-detector, therefore, is a<br />
function of average optical power. The shot noise is<br />
due to the mean received infrared power and ambient<br />
light. However, the ambient light induced by shot<br />
noise typically has a power from 20 to 40 dl3 greater<br />
than the signal shot noise [2]. Therefore, the signal<br />
induced shot noise can be neglected. The ambient<br />
induced shot noise can be considered Gaussian and<br />
nearly white.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.
Natural ambient light noise is caused by sunlight. It<br />
can be considered steady with slow intensity variations<br />
in time. Artificial ambient light comes from several<br />
light sources: incandescent lamps, fluorescent lamps<br />
driven by conventional ballasts and fluorescent lamps<br />
geared by electronic ballasts. The use of fixed optical<br />
filters reduces out of band ambient light noise. The<br />
steady background irradiance produced by natural and<br />
artificial light sources is usually characterized by a<br />
direct current induced at the receiver photodiode that is<br />
directly proportional to that current. This current is<br />
referred as the background noise current.<br />
The interfering signal produced by incandescent lamps<br />
is an almost perfect sinusoid with a frequency of 100<br />
Hz. In addition to the 100 Hz sinusoid, only the fust<br />
harmonics (up to 2 kHz) cany a significant amount of<br />
energy, and for frequencies higher than 800 Hz, all<br />
components are 60 dB below the fundamental. So<br />
using electrical high pass filter can eliminate this<br />
interference without much signal degradation.<br />
For fluorescent lamps equipped with conventional<br />
ballasts driven at a power-line of 50 or 60 Hz, they<br />
induce interference at harmonics up to 50 kHz. This<br />
also can be eliminated by careful choice of modulation<br />
scheme to ensure there are no low frequency<br />
components and through electrical high pass filtering.<br />
For fluorescent lamps equipped with electronics<br />
ballasts, the ballast modulation frequency itself is 35<br />
Mz. Therefore, interference harmonics extending up<br />
to 1 MHz are introduced. This cannot be easily<br />
filtered. In case of interference overlapping the signal<br />
spectrum, sophisticated digital signal processing<br />
algorithms need to be developed and this is the focus<br />
of this paper.<br />
3. METHODOLGY<br />
The spectrum produced by an electronic-ballast-driven<br />
lamp consists of low and high frequency regions. The<br />
low frequency region resembles the spectrum of a<br />
conventional fluorescent lamp while the high<br />
frequency region is attributable to the electronic<br />
ballast. A deterministic expression that models the<br />
interfering signal at the output of the photodiode is [2]:<br />
where R is the photodiode responsivity (M), Pm is<br />
the mean optical power of the interfering signal, KI=<br />
CCECE 2003 - CCGEI 2003, Montreal, Mayimai 2003<br />
0-7803-7781-8/03/$17.00 0 2003 IEEE<br />
- 1946 -<br />
134<br />
5.9, K2 = 2.1, and {b}, {a} and {d} are constants, the<br />
frequency corresponding to the lamp type as jh= 37.5<br />
kHZ.<br />
...................................................................<br />
noises that are<br />
i by conventional<br />
~ ballasts<br />
~<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.<br />
~ : i . electronic ballasts<br />
/ : ~ *Thermal whitenoise ~<br />
: i * Shot noise by sun<br />
: .................................................................... , .........................................................................<br />
Fig. 1 Block diagram of Noise Removal Technique; at<br />
Infrared Wireless Receivers<br />
The even harmonics of37.5 kHz would correspond to<br />
75 kHz, 150 kHz, 225 kHz, 300 Hz, 375 IcHz, 450<br />
kHz, 525 IcHz, 600 kHz, 675 kHz, 750 kHz, and 82:j<br />
kHZ.<br />
If the low frequency and high frequency ambient<br />
signal model is the same as practical case, we can use<br />
this model and use adaptive noise cancellation method<br />
to eliminate this noise because this noise is<br />
uncorrelated to our desired signal. Also adaptive noise<br />
cancellation method can eliminate white Gaussian<br />
noise (the thermal noise model). The advantage in<br />
using adaptive noise cancellation method is that we<br />
can improve the SNR only if the reference signal is<br />
correlated to noise signal contained in primary signal<br />
but uncorrelated to desired signal. The limitation to use<br />
adaptive filtering is that: if the reference signal is<br />
completely uncorrelated with both the signal and noise<br />
components of the primary signal, the adaptive noise<br />
canceller has no effect on the primary signal, and the<br />
output signal to noise ratio remains unchanged. We<br />
used the ‘well known’ LMS algorithm [3] in removing<br />
the thermal noise from the signal. The ease of<br />
implementation of the LMS algorithm is achieved at<br />
the expense of convergence rate. To accelerate the<br />
converge rate of the LMS algorithm the step size<br />
parameter was selected as a function of the eigen<br />
values of the autocorrelation matrix of the input signal<br />
(which is dependant on the instantaneous energy ofthe<br />
signal).<br />
In case where the reference channel is not available or<br />
if the reference signal is uncorrelated with the noise in<br />
the primary channel then, signal decomposition<br />
techniques could be a better alternative. In cases where
signal and noise spectra overlap, fixed filtering or<br />
adaptive filtering of noise may not be the best<br />
approach. In such a situation noise filtering by using<br />
mathematical decomposition techniques might be the<br />
best alternative and such methods are commonly<br />
known as de-noising techniques. The de-noising<br />
approach bas been successfully applied for data such<br />
as knee sounds [4] and ultrasound signals [5]. The<br />
problem of enhancing signals degraded by<br />
uncorrelated additive noise when the noisy signal<br />
alone is available, has received much attention in the<br />
last years [6-lo]. The main problem arises when the<br />
de-noising filters cannot distinguish between noise and<br />
noise-like important signal components, and remove<br />
both thereby decreasing the intelligibility of the signal<br />
Among the mathematical transformation techniques, a<br />
time-frequency (TF) decomposition technique might<br />
be a suitable choice since it exploits the simultaneous<br />
overlap in time and frequency domain, and filters the<br />
noise accordingly.<br />
The complexity of structures present at infrared<br />
wireless receiver requires the development of adaptive,<br />
low level representations in order to provide a<br />
meaningful analysis. In Fourier, the basis functions<br />
sine and cosine are not suitable in capturing the subtle<br />
changes in speech signals because of their inability to<br />
localize time information. A better signal<br />
representation by using basis functions that can capture<br />
both temporal and spectral information would be more<br />
useful. <strong>Signal</strong> representation such as wavelet, and<br />
wavelet packet can provide this information. The<br />
signal decomposition techniques that are considered in<br />
this paper are wavelet-packets and are briefly<br />
described in the subsequent sections.<br />
In wavelets, any signal can be decomposed into<br />
components with good time and scale properties.<br />
Wavelets have the advantage to express any signal<br />
with a fewer coefficients. The design of basis functions<br />
must be optimized, so that the number of non-zero<br />
coefficients will be minimum and the input signal is<br />
approximated by projecting it over A4 basis functions<br />
selected adaptively. It is represented as follows:<br />
X(') = z (x2gm)gm<br />
me,.<br />
Where x(f) is the signal to be decomposed, and <br />
denotes the inner product between the signal and the<br />
basis functions. The basis functions are obtained by<br />
shifting and modulating the amplitude of a prototype<br />
function called mother wavelet, it is given by:<br />
CCECE 2003 - CCGEI 2003, Montrial, May/mai 2003<br />
0-7803-7781-8/03/$17.00 0 2003 IEEE<br />
- 1947 -<br />
135<br />
where s is the scale parameter, and U is the translation<br />
parameter. Wavelet analysis use long time intervals for<br />
low frequency detailed analysis and short time<br />
intervals for high frequency information. That offers<br />
good frequency resolution at low frequencies and good<br />
time resolution at high frequencies.<br />
The main difference between wavelet and<br />
wavelet packet analysis is that the later allows an<br />
adjustable resolution of frequencies through filter bank<br />
decomposition. It splits the whole spectrum into two<br />
equal bands at different levels, obtaining a general tree<br />
structure that is called the wavelet packet expansion.<br />
Basis functions are generated with an<br />
algorithm that uses quadrature mirror filter (QMF)<br />
banks, and divide the spectrum as a tree with multiple<br />
branches. Wavelet packet allows to search the<br />
optimum decomposition of the binary tree looking for<br />
the branch with the best entropy criterion of the input<br />
signal [7]. Once the wavelet or the wavelet packet<br />
decomposition of the signal is achieved, the next step<br />
is thresholding the resulting coefficients. This can be<br />
done in two ways, hard and soft thresholding.<br />
If w, denote the waveletiwavelet packet<br />
coefficients, then hard thresholding [7] is applied as:<br />
where Tis the selected threshold.<br />
For avoiding the de-noising effect of certain filters<br />
that remove the sharp features of the signals removing<br />
important components, soft-thresholding discards<br />
terms with small or insignificant contribution for the<br />
information.<br />
Soft thresholding is performed as:<br />
Different methods are used for selecting the<br />
best threshold T and also rescaling the coefficients<br />
according to the noise level.<br />
4. RESULTS and CONCLUSIONS<br />
As described in Section 3, three noise removal<br />
experiments were performed (1) removal of high<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.
frequency periodic interference due to fluorescent<br />
lamps equipped with electronic ballasts (2) removal of<br />
ambient noise of random nature with adaptive<br />
algorithms such as the LMS, and (3) automatic<br />
denoising of uncorrelated structures in a received<br />
signal by using the wavelet-packet technique.<br />
Removal of High Frequency Periodic Interference:<br />
Periodic interference in the signal due to electronics<br />
ballasts have even harmonics extending upto 1 MHz.<br />
In Fig. 2 a synthetic signal with periodic interference is<br />
shown. As the periodic interference is represented as<br />
spectral peaks in the signal's spectrum, a series of<br />
notch filters have to be designed to attenuate these<br />
spectral peaks. As spectral attenuation can be achieved<br />
by placing zeros on the unit circle or close to the unit<br />
circle in the z-plane at the exact frequency points, a<br />
linear phase fmite impulse response (FIR) filter was<br />
determined to be the best option. Detailed analysis for<br />
the right filter type revealed that a 26' order FIR filter<br />
could totally suppress the harmonics interference<br />
present in the signal caused due to electronic ballasts.<br />
The magnitude and phase response of the filter is<br />
shown in Fig. 3. It could be easily seen that the filter<br />
has linear phase response, and the magnitude response<br />
clearly represents the comb filter characteristics. The<br />
output of the filter is shown in Fig. 4, and it is evident<br />
that FIR comb filtering satisfactorily reduces the<br />
harmonics due to electronic ballast interference.<br />
Removal of Ambient Noise:<br />
In'this study, an infrared wireless system was modeled<br />
with a signal of interest, and noise was added at<br />
different Dower levels. The desired resuonse in the<br />
training stage of the filter was assumed tdbe a delayed<br />
version of the clean sienal free of ambient noise. A<br />
12* order transversal fiiter was trained with the LMS<br />
adaptive filter algorithm. The step size parameter in<br />
the LMS that gnvems the convergence rate of the<br />
algorithm was selected in an adaptive manner, in such<br />
a way that the step size is inversely proportional to the<br />
instantaneous energy of the signal. It was found that<br />
the step size selected in this manner provides an<br />
optimal convergence suitable in an infrared wireless<br />
communications environment. Fig. 5 shows the<br />
original and the denoised signal by using this<br />
approach. It could be seen in panel B, that a clear<br />
signal is obtained, but the conyergence of LMS has<br />
caused some transient-like disturbances in the filtered<br />
output. The transient like disturbance was minimized<br />
by selecting the step size based on the instantaneous<br />
energy of the signal of interest.<br />
CCECE 2003 - CCGEI 2003, Montreal, May/mai 2003<br />
0-7803-7781-8/03/$17.00 0 2003 IEEE<br />
- 1948 -<br />
136<br />
Removal of Uncorrelated Signamoise Structures:<br />
In case of noise spectra overlapping with the signal<br />
spectrum, and where the reference channel is not<br />
available or if the reference signal is uncorrelated with<br />
the noise in the primary channel then, signal<br />
decomposition techniques could be a better alternative.<br />
As explained in Section 3, wavelet-packet techniques<br />
are promising tools for removal of structures that .are<br />
not correlated to the signal of interest. Wavelet-packet<br />
picks the hest basis functions by using the entropy<br />
optimization criteria. In this study, Coiflet,<br />
Daubechies, Haar and Symmlets were tried as wavelet<br />
choices with soft thresholdmg criteria, and among<br />
them Daubechies (db6) seem to outperform the other<br />
commonly used wavelets in terms of removing<br />
structures that are not relevant in an infrared wireless<br />
receiver system. Fig. 6 shows the original and the<br />
denoised signals, and it could clearly seen that<br />
wavelet-packet has performed extremely well in<br />
removing irrelevant components from the signal of<br />
interest.<br />
.-<br />
..... . . . . ~, .............................. I<br />
* I2 e, a.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.<br />
OI I* ow 1- I<br />
Fig. 2: Fluorescent lamp geared by electronic ballas,ts.<br />
,~iwuF^nummri*rlol..girar<br />
" .............................. " ........................... :. ...<br />
:,<br />
,*,; .. , ...<br />
.......... .......... i .... ,.j. .ol. i:;; ........ ~ ............. i ...............<br />
e 91 (12<br />
ea xi 0s<br />
%-*laxrr,i.:r-<br />
~*<br />
v*<br />
...<br />
3<br />
:w,.,m"--m-- .......... "1""1<br />
w,..... _I" ......<br />
- ,<br />
4-1<br />
..w<br />
..<br />
. -<br />
, . '-': . ;,, .: . .;<br />
.<br />
. . ........ %-. ~:,- ' ' ";<br />
. . . . . '." ..'.<br />
.I. .<br />
-, I, ... ... "..ll.l"l_l"<br />
/j 3.x gz a, e4 3, "* "l rl/ 0s 1<br />
I . i C . r U p l i<br />
Fig. 3: Magnitude and phase response of FIR Comb<br />
filter of order 26.
Fig. 4: Output of the Comb filter.<br />
-***Unj--Lc<br />
........................ ~. .............<br />
.. ...~. .................<br />
aruurr^n,r---*m Ia<br />
., .................. .......... "* .............<br />
F' n: ss 6. *I U*<br />
*v=.,-:<br />
Fig. 5: Original ambient noise and LMS<br />
filtered output.<br />
; . ~~<br />
. ,<br />
CI <<br />
51<br />
> i.1 >I P3 61 1: F.3 i? 3.8 '.i i<br />
i.*,,m,<br />
Fig. 6: Original and wavelet-packet denoised.<br />
CCECE 2003 - CCGEI 2003, Montreal, Mayimai 2003<br />
0-7803-7781-8/03/$17.00 Q 2003 IEEE<br />
- 1949 -<br />
137<br />
References<br />
[I] R.Narasimhan, M.D.Audeh, J.M.Kahn, Effect of<br />
electronic-ballast fluorescent lighting on wireless<br />
infrared links, IEE Proc.-Optoelectron, Vol. 143, No.<br />
6, December 1996.<br />
[2] Moreia A.J.C., Valadas R.T, and de Oliveira<br />
Duam A.M. Optical inteiference produced by<br />
ortijicial light, Wireless Networks, Vo1.3, 1997, pp<br />
131-140.<br />
[3] Haykin, S. Adaptive filter theory, Prentice Hal!,<br />
New Jersey, 2002.<br />
[4] S. Krishnan and R. Rangayyan. Automatic de-<br />
noising of knee joint vibration signals using adaptive<br />
time-frequency representations, Medical and<br />
Biological engineering and Computing. Vol. 38, No I,<br />
pp. 2-8, January 2000.<br />
[SI S. Johnston, A. Diaz and S. Doctor. De-noising of<br />
Ultrasonic <strong>Signal</strong>s Backscattered from Coarse-<br />
Grained materials: Wavelet Processing and<br />
Maximum-Entropy Reconstruction, 67th Annual<br />
meeting of the Southeastem section of the American<br />
Physical Society.<br />
[6] X. Xie, J. Kuang. A noise canceller for mobile<br />
communications utilizing time-frequency. <strong>Analysis</strong>,<br />
Fourth Asia-Pacific conference on optoelectronics and<br />
communications. Vol. 1,pp. 504-507, October 1999.<br />
[7] D. Donoho. Nonlinear wavelet methods for<br />
recovery of signals. densities, and spectra from<br />
indirect and noisy data, Proceedings of Symphosia in<br />
Applied Mathematics.Vol.00, pp. 173-205, 1993.<br />
[8] M. Bohoura and J. Rouat. Vmelet speech<br />
enhancement based on the Teager energy operator,<br />
IEEE signal processing letters, Vol. 8, No I, Jan 2001.<br />
[9] N. Virag. Single Channel speech enhancement<br />
based on masking properties of the human masking<br />
properties of the human auditory system, IEEE<br />
Transactions on Speech and audio processing. Vol. 7,<br />
ISSUE 2, March 1999.<br />
[lo] L. Arslan, A. McCree, V. Viswanathan. New<br />
mefhoh for adaptive noise suppression. Proceedings<br />
of the Intemationa! Conference on Acoustics, Speech<br />
and <strong>Signal</strong> Processing, Vol. 1, pp. 812-815, Detroit,<br />
USA, May 1995.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:36 from IEEE Xplore. Restrictions apply.
Adaptive denoising at Infrared wireless receivers<br />
Xavier N. Fernando, Sridar Krishnan, Hongbo Sun and Kamyar Kazemi-Moud<br />
Department of Electrical and Computer Engineering, <strong>Ryerson</strong> <strong>University</strong><br />
Toronto,ON, M5B 2K3, Canada<br />
(fernando@ee.ryerson.ca)<br />
ABSTRACT<br />
This paper proposes an innovative approach for noise cancellation at infrared (IR) wireless receivers. Ambient noise due<br />
to artificial lighting and the sun has been a major concern in infrared systems. The background induced shot noise<br />
typically has a power from 20 to 40 dB more than the signal induced shot noise and varies with time. Due to these<br />
changing conditions, infrared wireless receivers experience high level of non-stationary noise. The objective of the work<br />
mentioned in this paper is to develop digital signal processing algorithms at the infrared wireless system to combat high<br />
power non-stationary noise. The noisy signal is decomposed using a joint time and frequency representation such as<br />
wavelets and wavelet packets, into transform domain coefficients and the lower order coefficients are removed by<br />
applying a threshold. Denoised version is obtained by reconstructing the signal with the remaining coefficients. In this<br />
paper, we evaluate different wavelet methods for denoising at an infrared wireless receiver. Simulation results indicate<br />
that if the noise is uncorrelated with the signal and the channel model is unavailable the wavelet denoising method with<br />
different wavelet analyzing functions improves the signal to noise ratio (SNR) from 4 dB to 7.8 dB.<br />
Keywords: optical wireless, infrared, receiver, noise, wavelet transform, denoising<br />
1. INTRODUCTION<br />
The emerging technologies like mobile portable computing and multimedia terminals at living and work environments<br />
are the main forces driving companies ,scientists and researchers to progress in the challenging field of wireless local<br />
area networks(WLAN). The need for higher speed and wider bandwidth in data communication networks is gradually<br />
replacing electrical transmission medium to optical. Wireless infrared LANs are important part of indoor transmission<br />
systems and enable high bit-rate data transferring over short distances [1]. Infrared systems occupy no radio frequency<br />
(RF) spectrum and they can be used where electromagnetic interference is critical. The infrared spectral region offers a<br />
large, virtually unlimited, bandwidth that is unregulated worldwide. Since infrared communications are confined to<br />
rooms, there is no interference between communication systems operating in different rooms, which result in secure<br />
communications. In contrary to RF transmission systems, the light is reflected diffusely on the wall surface of the rooms<br />
and the channel estimation will be a non-trivial subject for infrared systems. A non-directed wireless optical<br />
communication system can be either line-of-sight (LOS) or diffuse. A LOS system is designed under the assumption that<br />
the LOS path between transmitter and receiver is unobstructed. A diffuse system is defined as one which does not rely<br />
upon the LOS path, but instead relies on reflections from a large diffusive reflector such as the ceiling. In both cases, an<br />
optical signal in transit from transmitter to receiver undergoes temporal dispersion due to reflections from walls and<br />
other reflectors; the intersymbol interference (ISI) that results is an impediment to communication at high speeds. Single<br />
diffuse infrared links can operate with bit rates as high as 100 Mb/s [2]. Since it is possible to operate at least one<br />
infrared link in every room of a building without interference, the potential capacity of an infrared-based network is<br />
extremely high. The propagation characteristics of diffuse infrared signals resemble those of radio signals. The measured<br />
received power at different positions using a photodetector much smaller than the light wavelength will result in<br />
multipath fading like fluctuations in received power. In the real diffuse infrared systems, however, the detector size is<br />
much larger than the wavelength, so that the multipath fading like power fluctuations are averaged out effectively. While<br />
multipath propagation does not lead to fading, it causes temporal dispersion. The tail caused by higher order taps of the<br />
indoor channel impulse response induces ISI in high bit-rate communications.<br />
Infrared Technology and Applications XXIX, Bjørn F. Andresen, Gabor F. Fulop, Editors,<br />
Proceedings of SPIE Vol. 5074 (2003) © 2003 SPIE · 0277-786X/03/$15.00<br />
138<br />
199
Indoor infrared transmission suffers from a number of impairments the most important ones being shot noise<br />
from the ambient light and restricted symbol rate due to multipath dispersion. Noise plays a severe role in the<br />
performance of wireless infrared networks. Background illumination has two distinct effects in the performance of<br />
optical receivers; one is noise due to the steady and invariant irradiance from undesired light sources which results to<br />
shot noise at the photodetector, the other one is interference generated by high frequency components of some light<br />
sources. Typically, natural and artificial ambient light contribute to high levels of shot noise in a photodetector which<br />
degrades the performance of the transmission system. For data-rates up to 10 Mbps, the major degrading factor of the<br />
infrared communication systems is the shot noise induced in the receiver due to ambient light. Unfortunately, ambient<br />
light sources (sunlight and artificial light) also radiate in the same spectral wavelengths used by infrared transducers.<br />
Thus shot noise presents a strong spatial and temporal dependence. Several advanced techniques for the design of nondirected<br />
wireless infrared communication systems have been already proposed in order to minimize these signals to<br />
noise ratio (SNR) fluctuation effects. These ambient light levels to a significant degree determine the optical power<br />
required for reliable transmission. The shot noise induced by ambient light may vary over several decades during a day<br />
in a typical indoor environment.<br />
The interfering signal from the fluorescent light is periodic and deterministic. The spectrum of fluorescent<br />
lights driven by electronic ballasts may extend up to frequencies around 1MHz interference of which will cause serious<br />
degradation at infrared receivers even after high-pass electrical filtering [3-5].<br />
The objective of the work mentioned in this paper is to develop a digital signal processing algorithm at the<br />
infrared wireless system to combat uncorrelated noise without a reference channel model. In Section 2, we introduce and<br />
classify different noise sources at the infrared receivers [3]. Section 3 will focus on the definition of wavelet transform<br />
and analyzing functions which will be used in Section 4 to introduce a new methodology for noise cancellation. The new<br />
wavelet–based denoising technique and the results of wavelet denoising are discussed Section 4. The conclusions are<br />
provided in Section 5.<br />
2. NOISE AT THE RECEIVERS<br />
Noise in the infrared optical receivers is a critical parameter of performance analysis. There are different sources of noise<br />
that contribute to overall performance of the wireless network link. Thermal noise of the photodetector is dominant for<br />
weak steady background illumination. Thermal noise is critically dependent to the front-end design of the receiver (e.g.<br />
preamplifier). Shot noise is induced by the quantum nature of photons randomly arriving at the photodetector. It is<br />
proportional to the average received optical power. Natural and artificial background light may come from different light<br />
sources. Different background noise source contributions are from sun, incandescent lamps, fluorescent lamps with<br />
conventional ballasts and electronic ballasts. The slow variations in intensity of the light coming from the Sun make it a<br />
strong source of shot noise. The spectrum of natural light coming from the Sun in a shiny day is spread over entire<br />
responsivity curve of the PIN photodetector resulting to a steady background noise current of an order of a mA stronger<br />
than a well artificially illuminated room. Shot noise is larger under directional lamps and near windows exposed to<br />
sunlight. Furthermore, it can vary drastically during a normal day with the position of the sun and with the indoor<br />
lighting conditions. Due to the temporal variation and directional nature of both signal and noise, the SNR at the receiver<br />
can vary significantly.<br />
Artificial light sources also contribute to shot noise as well as interference at the infrared receiver. Incandescent<br />
lamps interference is periodic with a frequency of 100 Hz. Its spectrum has frequency components up to 2 KHz.<br />
Harmonics at the frequencies of higher than 800 Hz do not carry a significant amount of energy and they are 60 dB<br />
below the fundamental harmonic.<br />
In case of the incandescent lamps the amplitude of the interference is one tenth of the current generated by the<br />
slow variations of intensity. <strong>Research</strong>ers have already extracted an experimental interference model for typical<br />
incandescent lamps [3].<br />
200 Proc. of SPIE Vol. 5074<br />
139
Fluorescent lamps equipped with conventional ballasts driven at power-line of 50 or 60 Hz, they induce<br />
interference at harmonics up to 20 KHz. This interference is periodic with a frequency of 50 Hz and its harmonics are 50<br />
dB below the 100 Hz component for frequencies higher than 5 KHz. Interference amplitude in this case is 2 to 6 times<br />
lower than the shot noise current Interference model for the fluorescent lamps driven by conventional ballasts has also<br />
been extracted experimentally [3].<br />
The fluorescent lamps with electronics ballasts have higher power efficiencies and use the same concept of<br />
switching power supplies. Interference generated by fluorescent lamps with electronics ballasts has lower amplitude<br />
compared to other types of ambient lights but its spectrum is very broad and has frequency harmonics up to 1MHz. The<br />
spectrum produced by an electronics-ballast-driven lamp consists of low and high frequency regions. The low frequency<br />
region resembles the spectrum of a conventional fluorescent lamp while the high frequency region is attributable to the<br />
electronic ballast. These two components of the spectrum have been modeled using the same experimental approaches as<br />
the other noise sources [3].<br />
In these model equations the relative amplitude and phase of the harmonics can be easily identified. For<br />
different class of lamps all the average parameters for interference models can be easily identified [3]. Several schemes<br />
have been proposed in order to reduce the power penalty induced by ambient artificial light sources in an indoor infrared<br />
wireless system [5-7].<br />
3. WAVELET TRANSFORM<br />
Wavelets are functions that satisfy certain mathematical requirements and are used in representing data or other<br />
functions. The wavelet analysis procedure is to adopt a wavelet prototype function, called an "analyzing wavelet". The<br />
wavelet transform has become a powerful tool for signal analysis and is widely used in many applications, including<br />
signal detection and denoising.<br />
The complexity of structures presented at infrared wireless receivers requires the development of an adaptive,<br />
low level representation in order to provide a meaningful analysis of the system. In Fourier, the basis functions are sine<br />
and cosine which are not suitable in capturing the subtle changes of received signals at the infrared receivers because of<br />
their inability to localize the temporal information.<br />
The wavelet transformation is a time-frequency decomposition technique and with the choice of smooth multiresolution<br />
wavelet analyzing functions that use long time intervals for capturing low frequency information of the<br />
desired signal and short time intervals for high frequency information, one can have a joint temporal and spectral<br />
representation of that signal. Temporal analysis is performed with a contracted, high-frequency version of the prototype<br />
wavelet, while frequency analysis is performed with a dilated, low-frequency version of the prototype wavelet. Because<br />
the original signal or function can be represented in terms of a wavelet expansion (using linear combination of the<br />
coefficients and the wavelet basis functions), data operations can be performed using just the corresponding wavelet<br />
coefficients. In wavelet transformation, any signal can be decomposed into components with good time and scale<br />
properties. Wavelets have the advantages to express any signal with fewer coefficients [9].<br />
The basis functions are obtained by shifting and modulated the amplitude of the “analyzing wavelet”. The<br />
design of basis functions must be optimized, so that the number of non-zero coefficients will be minimal and the input<br />
signal is approximated by projecting it over the basis functions selected adaptively. In wavelet-based denoising, the<br />
noisy signal is decomposed into transform domain coefficients, and the lower order coefficients are removed by applying<br />
a threshold. If we assume that Ψ(t) is the analyzing wavelet function then the continuous multi-resolution wavelet frame<br />
transform, F[m,n], of a signal f(t) is defined:<br />
m ∫<br />
F[ , n<br />
m,<br />
n<br />
−∞<br />
m,<br />
n]<br />
=< Ψ ( t),<br />
f ( t)<br />
>= Ψ ( t)<br />
⋅ f ( t)<br />
⋅dt<br />
140<br />
+∞<br />
Proc. of SPIE Vol. 5074 201
The inverse wavelet transform is defined as<br />
f ( t)<br />
( t)<br />
= F[<br />
m,<br />
n]<br />
⋅ Ψm,<br />
n<br />
m∈ℑn∈ℑ here m and n belong to the set ℑ , the set of integer numbers defining each wavelet basis function, Ψ(t) , in the two<br />
dimensional wavelet space.<br />
∑∑<br />
The main difference between wavelet and wavelet packet analysis is that the latter allows an adjustable<br />
resolution of frequencies through filter bank decomposition. Filter banks split the whole spectrum into two equal bands<br />
at different frequency levels, obtaining a general tree structure that is called the wavelet packet expansion. Wavelet<br />
packet allows searching the optimum decomposition of the tree looking for the branch with the best entropy criterion of<br />
the input signal [7].<br />
<strong>Research</strong>ers in related engineering and applied mathematics areas have developed many different wavelet<br />
transform systems each with specific properties. The difference between these wavelet transforms is mainly their<br />
analyzing functions and the way that they are developed. There are two major classes of wavelet transform systems. One<br />
class consists of orthogonal wavelets and the other one consists of biorthogonal wavelets. Other wavelet transform<br />
systems, not included in the two main categories, have generally limited applications [8].<br />
4. NOISE CANCELATION METHOD<br />
In order to cancel the effect of uncorrelated Gaussian noise in the indoor infrared wireless channel we introduce the<br />
wavelet transform applied to the signal in electrical domain. Figure 1 shows the schematic diagram of the wireless<br />
infrared link and the receiver with the wavelet transform denoising block.<br />
202 Proc. of SPIE Vol. 5074<br />
Figure 1 – Schematic of the wavelet based denoising wireless infrared link<br />
141
In this system, the high pass electrical filter will reduce the interference induced by incandescent light and<br />
fluorescent light by conventional ballasts. The comb filter block will cancel the high frequency interference from the<br />
fluorescent lamps driven by electronics ballasts [11]. In the wavelet denoising block, the received signal is being<br />
transformed using pre-defined analyzing function. Once the wavelet decomposition of the signal is achieved the next<br />
step is thresholding. Thresholder block will remove the coefficients of the signal which have smaller absolute value than<br />
a predefined threshold. Different methods can be used to determine the threshold level that results in performance<br />
improvement in addition to rescaling the coefficients to the noise level. If wm denotes the wavelet coefficients of the<br />
decomposed signal and A the threshold level then the hard thresholding can be described mathematically as:<br />
wˆ<br />
m<br />
=<br />
⎧⎪<br />
⎨<br />
⎪⎩<br />
w<br />
0<br />
m<br />
In order to avoid the denoising effect of certain filters that remove the sharp features of the signals, soft thresholding will<br />
discard the coefficients with small and insignificant contribution to the information and can be performed as:<br />
wˆ<br />
where the Sgn(.) is the signum function.<br />
m<br />
=<br />
⎧⎪<br />
⎨<br />
⎪⎩<br />
Sgn(<br />
w<br />
0<br />
m<br />
)( w<br />
m<br />
w<br />
w<br />
m<br />
m<br />
− A)<br />
The remaining wavelet coefficients produce the denoised signal which will be demodulated and decoded. The<br />
aim is to alleviate the shot noise generated by incandescent light, the thermal noise from the receiver electronics by this<br />
denoising block. For simulation the denoising algorithm is applied to a pulse train with frequency of 10 KHz that passes<br />
through an infrared channel that contributes additive Gaussian noise with a SNR of 4 dB. Data signal with additive white<br />
Gaussian noise and its spectrum is shown in Figure 2-a and 2-b respectively.<br />
≥ A<br />
< A<br />
w<br />
w<br />
m<br />
m<br />
≥ A<br />
< A<br />
(a) (b)<br />
Figure 2: The received signal passed over an additive white Gaussian noise channel (a) the spectrum of the received signal (b).<br />
142<br />
Proc. of SPIE Vol. 5074 203
The simulations has been done using seven different wavelet analyzing functions and the results of SNR<br />
improvements are summarized in Table 1. The “SNR improvement” is defined as the value that SNR after denoising<br />
subtracted by SNR before denoising. Orthogonal wavelet transforms used in the simulation were Haar, Daubechies,<br />
Coiflets, Symlets and discrete Meyers’s wavelet transform.<br />
Figure 3 shows the original received noisy signal (above) and denoised version of the same signal after<br />
applying discrete Meyer’s wavelet transform (below). SNR improvement of the denoised signal in this case is 3.8 dB. In<br />
the thresholding block the wavelet coefficients obtained from signal decomposition that are lower than the threshold<br />
level are discarded. Figure 4 shows the original Gaussian noise of the channel (above) and the temporal representation of<br />
the discarded coefficients (below).<br />
Waveform SNR improvement<br />
‘Haar’ 2.3279<br />
‘db’ 3.4801<br />
‘sym’ 3.4522<br />
‘coif’ 3.5583<br />
‘bior’ 3.7485<br />
‘dmey’ 3.8281<br />
Table 1: SNR improvement of wavelet denoising method using different analyzing functions<br />
Figure 3: The original noisy 10 KHz pulse train (above) and the denoised version using the discrete Meyer’s transform (below)<br />
204 Proc. of SPIE Vol. 5074<br />
143
Figure 4: The original Gaussian noise (above) and the reconstruction of wavelet coefficients discarded by thresholder (below)<br />
Figure 5: The original noisy 10 KHz pulse train (above) and the denoised version using the Haar transform (below)<br />
144<br />
Proc. of SPIE Vol. 5074 205
Figure 6: The original Gaussian noise (above) and the reconstruction of wavelet coefficients discarded by thresholder (below)<br />
In Figure 5 shows the denoised version of the received signal using the Haar wavelet transform has been shown<br />
(below). Reconstructed signal from the discarded coefficients in the thresholder is shown in Figure 6 (below). By using<br />
Haar wavelet transform a SNR improvement of 2.3 dB has been achieved.<br />
Haar wavelet analyzing has sharp edges compared to the Meyer’s wavelet mother function which is smoother<br />
and this results to the loss of signal information over those sharp edges therefore a lower SNR improvement. Overall the<br />
use of the wavelet deoinsing method with any of the analyzing functions results to a SNR improvement of approximately<br />
3 to 4 dB which means a signal twice more powerful than the noisy one. This improvement can be achieved for a noise<br />
which is uncorrelated with the information signal, and where a reference channel for noise is not available.<br />
5. CONCLUSIONS<br />
Different noise contributions at the infrared wireless receivers have been mentioned. A new denoising method for<br />
uncorrelated noise in wireless infrared receivers was introduced using the wavelet transform. In this new method<br />
denoised version is obtained by reconstructing the signal with the remaining coefficients after passing through a<br />
thresholder. We evaluated Coiflet, Daubechies, Haar, Symmlets, Biorthogonal and Meyer wavelet analyzing functions<br />
for denoising at an infrared wireless receiver. Overall using the wavelet with any of the analyzing functions in the<br />
simulation has resulted to a SNR improvement of approximately 3 to 4 dB with the input SNR of 4 dB. If the power<br />
density function of the noise which is uncorrelated to the information signal is known and the reference channel model is<br />
unknown, the use of self-defined adaptive wavelet analyzing functions can improve the SNR of received signal whose<br />
spectrum overlaps with that of the noise.<br />
A comparison of SNR improvement for different wavelet analyzing functions has been done. Results also<br />
indicate that the smoother wavelet analyzing functions can preserve more signal information hence they will result to a<br />
higher SNR improvement. But one should consider that overall SNR improvement using the wavelet decomposition<br />
method for denoising is between 3 and 4 dB for different wavelets therefore we suggest the use of wavelets that can be<br />
implemented easier on digital signal processors (DSP) chips and have efficient calculation time in order to satisfy speed<br />
constraints of the electronics used in the lightwave system.<br />
206 Proc. of SPIE Vol. 5074<br />
145
6. REFERENCES<br />
1. F. R. Gfeller and U. Bapst,”Wireless in-house data communication via diffuse infrared radiation'' Proceedings of the<br />
IEEE, vol. 67, pp. 1474-1486, November 1979.<br />
2. Audeh, M.D.; Kahn, J.M.; “Performance simulation of baseband OOK modulation for wireless infrared LANs at<br />
100 Mb/s “, Proceedings IEEE International Conference on Selected Topics in Wireless Communications, vol: 25-<br />
26, pp: 271 -274, Jun 1992.<br />
3. Moreira, A.J.C.; Valadas, R.T.; de Oliveira Duarte, A.M.; ‘Characterisation and modeling of artificial light<br />
interference in optical wireless communication systems’, Personal, Indoor and Mobile Radio Communications,<br />
1995. PIMRC'95, Volume: 1 , 27-29 Page(s): 326 -331, Sep 1995.<br />
4. O'Farrell, T.; Kiatweerasakul, M.;’ Performance of a spread spectrum infrared transmission system under ambient<br />
light interference ‘,Symposium on Personal, Indoor and Mobile Radio Communications, The Ninth IEEE<br />
International, Volume: 2, Page(s): 703 -707, 8-11 Sep 1998<br />
5. Moreira, A.J.C.; Valadas, R.T.; de Oliveira Duarte, A.M.; “Reducing the effects of artificial light interference in<br />
wireless infrared transmission systems”, IEE Colloquium on Optical Free Space Communication Links, Page(s): 5/1<br />
-510, 19 Feb 1996.<br />
6. A. J. C. Moreira, R. T. Valadas, and A. M. de Oliveira Duarte, “Optical interference produced by artificial light“<br />
Wireless Networks, vol. 3, no. 2, pp. 131-140, 1997.<br />
7. C.J. Georgopoulos, “Suppressing background-light interference in an in-house infrared communication system by<br />
optical filtering”, Internat. J. Optoelectronics vol 3,(3) (1988).<br />
8. Donoho, D.;”Nonlinear wavelet methods for recovery of signals, densities and spectra from indirect and noisy data”,<br />
Proceedings of Symposia in Applied Mathematics, vol 00, pp. 173-205, 1993.<br />
9. Strang, G. and Nguyen, T., “Wavelets and Filter Banks.” Wellesley-Cambridge Press, Wellesley Massachusetts,<br />
1996.<br />
10. Narasimhan, R.; Audeh, M.D.; Kahn, J.M.; “Effect of electronic-ballast fluorescent lighting on wireless infrared<br />
links “,ICC 96, IEEE International Conference on Conference Record, Converging Technologies for Tomorrow's<br />
Applications, Volume: 2, , Page(s): 1213 -1219, 23-27 Jun 1996.<br />
11. Krishnan, S.; Fernando, X.; Sun, H.,” Non-stationary interference cancellation in infrared wireless receivers”, In<br />
press, Proc. IEEE Canadian conference on Electrical and Computer Engineering, Montreal, Quebec, May 2003.<br />
146<br />
Proc. of SPIE Vol. 5074 207
147<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.
148<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.
149<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.
150<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.
151<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 7, 2009 at 10:46 from IEEE Xplore. Restrictions apply.
152<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:20 from IEEE Xplore. Restrictions apply.
153<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:20 from IEEE Xplore. Restrictions apply.
154<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.
155<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.
156<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.
157<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:26 from IEEE Xplore. Restrictions apply.
158<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:43 from IEEE Xplore. Restrictions apply.
159<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.
160<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.
161<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.
162<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:45 from IEEE Xplore. Restrictions apply.
163<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:38 from IEEE Xplore. Restrictions apply.
Fixed Block-based Lossless Compression of Digit a1 Mammograms<br />
Marwan Y. Al-Saiegh and Sridhar Krishnan<br />
Dept. of Electrical and Computer Engineering,<br />
<strong>Ryerson</strong> Polytechnic <strong>University</strong>, Toronto, ON M5B 2K3, CANADA.<br />
Email : (malsaie@ee.ryerson.ca) (krishnan0ee.ryerson.ca)<br />
Abstract: Breast cancer is a leading cause of death<br />
among women in Canada. Computer-aided diagnosis of<br />
mammograms (X-ray films of breast tissue) is a non-<br />
invasive and an inexpensive way of diagnosing breast can-<br />
cer. The objective of this project is to investigate im-<br />
age compression schemes for faithful transmission and re-<br />
production of digital mammography data over a commu-<br />
nication link. A fixed block-based (FBB) near lossless<br />
compression scheme for mammograms has been devel-<br />
oped which runs in conjunction with traditional compres-<br />
sion schemes such as Huffman Coding and LZW (Lempel-<br />
Ziv Welch) coding. The algorithm codes blocks of pixels<br />
within the image that contain the same intensity value (the<br />
odds of having blocks of the same pixel values in a mam-<br />
mography image are very high), thus reducing the size of<br />
the image substantially while encoding the image at the<br />
same time. The proposed compression scheme was ap-<br />
plied on 44 mammograms (22 benign and 22 malignant),<br />
and the compression scheme provided a compression ra-<br />
tio of 1.7:l. When Huffman coding and LZW coding were<br />
used in conjunction with the FBB compression scheme, the<br />
compression ratio was 3.81:l for Huffman, and 5:l for LZW<br />
coding. The proposed FBB lossless compression technique<br />
seems to he promising for teleradiology applications.<br />
1 Introduction<br />
Breast Cancer is one of the leading cause of death in the<br />
world for women. In the US. alone in 2000, more than<br />
40,000 women died of breast cancer. Therefore, early di-<br />
agnosis is extremely important to reduce the mortality<br />
rate [l]. American cancer society guidelines for women<br />
aged 40-50 advocate screening every 1-2 years, with fre-<br />
quency based on the patient risk factor. The above pro-<br />
cedure would result in some 20 million mammograms per<br />
year. Archiving and retaining these data for at least three<br />
years will expensive and difficult, requiring sophisticated<br />
data compression techniques [2]. Also screening of mam-<br />
mograms in rural clinics is a growing concern, especially<br />
due to the scarcity of radiologists a subject has to wait<br />
for weeks to get her diagnosis result. The delay in pro-<br />
ducing the results are mainly due to infrequent visits of<br />
radiologists to rural clinics, and non-availability of an effi-<br />
- 0937 -<br />
164<br />
I<br />
1 I Digiitized Mammogram ,+! FBB Compression ;-‘Huffman Coding ’i j<br />
~ i<br />
~ 1<br />
~<br />
Didtzied Mammogram ;- hLempel-Ziv Coding, I<br />
Figure 1: Block diagram of the FBB technique using Huff-<br />
man and Lempel-Ziv Welch coding<br />
cient communication link through which a mammographic<br />
image could be faithfully transmitted to a city clinic. Tel-<br />
eradiology of digital mammograms could significantly al-<br />
leviate this problem, and may facilitate an early diagnosis<br />
and reduce the incidence of this killer disease. The above<br />
facts warrants an efficient data compression scheme for<br />
mammograms.<br />
Physicians or radiologists are reluctant to consider a<br />
technique that would discard even a small amount of in-<br />
formation from a mammographic image. By exploiting<br />
the redundancy or correlation information of pixels in an<br />
image a data compression technique can be designed to<br />
efficiently compress an image. Current compression tech-<br />
niques are based upon transform coding techniques such as<br />
discrete wavelet transform and discrete cosine transform<br />
[3]. Although transform coding techniques have claimed<br />
a compression ratio of lO:l, they are lossy compression<br />
schemes and need extensive receiver operating character-<br />
istic (ROC) studies of compressed images. The proposed<br />
near lossless compression scheme is shown in block dia-<br />
gram form in Fig. l. The proposed technique is “min-<br />
imally,’ lossy and does not require any ROC studies to<br />
evaluate compressed images. The paper is organized as<br />
follows: section 2 covers fixed block-based (FBB) com-<br />
pression scheme, Huffman coding and LZW coding are<br />
briefly covered in section 3, section 4 covers results and<br />
discussion, and finally the paper is concluded in section 5.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.
Figure 2: Block diagram of the FBB compression<br />
Go B Ck U> I I<br />
2 Fixed block-based (FBB) com-<br />
pression scheme<br />
Fixed block-based (FBB) compression scheme takes ad-<br />
vantage of the pixel correlation while scanning the image<br />
from left to right. It is known that the adjacent pixels in a<br />
mammographic image are highly correlated. The adjacent<br />
pixels can be combined to reduce the redundancy and that<br />
is what the proposed FBB algorithm is based upon. The<br />
FBB compression scheme will read the pixels one at a time<br />
and store them in a two dimensional array (i.e 448 X 448).<br />
The histogram of the mammogram is used to identify pix-<br />
els that do not appear in the image. This procedure is<br />
essential, because it introduces two redundant pixels that<br />
are used as keys through out the algorithm to avoid over-<br />
lapping, and represent blocks of zeros in the output file.<br />
2.1 Algorithm of the compression scheme<br />
The proposed FBB compression scheme is shown in block<br />
diagram form in Fig. 2. The steps involved are<br />
1. If the difference between the current pixel (i.e x[i]lj]),<br />
and the surrounding pixels x[i][j+l], x[i+l]L+l], and<br />
x[i+l][j] is O,l, or -1, then go to step 3. Otherwise go<br />
to step 2.<br />
2. Write the current pixel (i.e x[i]Ij]) to the output file,<br />
and move the sliding window to the next column of<br />
the two dimensional array and go back to step 1.<br />
3. If current pixel was not zero, then write (-l)*current<br />
pixel to the output file. If the current pixel is 0 then<br />
write the second 'key' pixel to the output file and go<br />
to step 4.<br />
4. Replace the block of pixel in the two dimensional ar-<br />
ray by the first 'key' pixel to avoid overlapping when<br />
the algorithm is implemented and go to step 5.<br />
5. Enforce the sliding window to skip one column, and<br />
go back to step1 (i.e instead of sliding the window<br />
from column two, start from column three).<br />
It is important to realize that the sliding window algorithm<br />
approach can be improved by making the window<br />
size bigger to absorb more pixels when needed, but the<br />
Figure 3: 'lock diagram of the FBB decompression tradeoff is to choose more 'key' pixels for every window size<br />
- 0938 -<br />
165<br />
to distinguish between the different window sizes. Since 0<br />
cannot be positive or negative, thus a 'key' pixel is needed<br />
for the 0 pixel. Thus, whenever the sliding window algo-<br />
rithm finds a block of 0 pixels, then the second 'key' pixel<br />
is written to the output file.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.
2.2 Algorithm of<br />
scheme<br />
the decompression<br />
The FBB decompression algorithm is shown in block dia-<br />
gram form in Fig. 3. While performing decompression the<br />
compressed file is read from the standard input and stored<br />
in a one dimensional array. A temporary two dimensional<br />
array is used to reconstruct the image back. The tem-<br />
porary array is initialized with the first ’key’ value. The<br />
alogrithim is as follows:<br />
3<br />
1. If the current pixel in the one dimensional array (i.e<br />
x[i]) is negative, then write the current pixel value as<br />
a block (i.e 4x4 matrix) in a positive form (i.e x[i]*(-<br />
1)) to the temporary two dimensional array and go to<br />
step 4.<br />
2.<br />
3.<br />
4.<br />
5.<br />
If the current pixel value is the same as the second<br />
’key’ value then write a block of 0’s (i.e 4x4 matrix)<br />
to the temporary two dimensional array and go to<br />
step 4.<br />
If the current pixel in the one dimensional array is<br />
positive, then write the current pixel value to the tem-<br />
porary array and go to step 5.<br />
skip the following column in the temporary two di-<br />
mensional array to account for the block of pixels and<br />
go to step 5.<br />
increment the index value in the one dimensional by<br />
one and go back to step 1.<br />
Once the decompression alogrithim is completed the<br />
temporary two dimensional array is written to an out-<br />
put file (i.e a decompressed version of the original file).<br />
Huffman coding and Lempel-<br />
Ziv-Welch coding<br />
After performing a FBB compression of the mammpogram<br />
image, it is further compressed by using popular lossless<br />
compression schemes such as Huffman coding and LZW<br />
coding.<br />
3.1 Huffman coding<br />
Huffman codes belong to a family of codes with a variable<br />
codeword length, which means that individual symbols<br />
which makes a message are represented (encoded) with<br />
bit sequences that have distinct length [4]. This helps to<br />
decrease the amount of redundancy in message data. De-<br />
creasing the redundancy in data by Huffman codes is based<br />
on the fact that distinct symbols have distinct probabilities<br />
of incidence. This helps to create code words, which really<br />
contribute to redundancy. Symbols with higher probabili-<br />
ties of incidence are coded with shorter code words, while<br />
- 0939 -<br />
166<br />
symbols with higher probabilities are coded with longer<br />
code words.<br />
3.2 Lempel-Ziv-Welch coding<br />
The LZW algorithm relies on re-occurrence of byte se-<br />
quences (strings) in its input [5]. It maintains a table<br />
mapping input strings to their associated output codes.<br />
The goal of LZW compression is to replace repeating in-<br />
put strings with n-bit codes. This is done by generating a<br />
string table on the fly, which is a mapping between pixel<br />
values and compression codes. This string table is built<br />
by the encoder as it processes the data, and due to the<br />
encoding method the decoder can reconstruct the string<br />
table as it processes the compressed data. This differs<br />
from other compression algorithms, such as Huffman cod-<br />
ing, where the lookup table needs to be included with the<br />
compressed data.<br />
LZW works based on the fact that many groupings of<br />
pixels are common in images: it goes through the image<br />
data and tries to encode as large a grouping of pixels as<br />
possible with an encoding from the string table, placing<br />
unrecognized groupings into the string table so they can<br />
be compressed on later occurrences. For an image with n-<br />
bit pixel values, it uses compression codes that are n + 1<br />
bits or larger. While a smaller compression code helps gain<br />
larger amounts of compression, the size of the compression<br />
code limits the size of the string table.<br />
4 Results and discussion<br />
The proposed FBB compression scheme was tested on<br />
MiniMammographic database of 44 images from Mammo-<br />
graphic Image <strong>Analysis</strong> Society (MIAS). The MIAS is an<br />
organisation of UK research groups interested in the un-<br />
derstanding of mammograms. Films taken from the UK<br />
National Breast Screening Programme have been digitized<br />
to 50 micron pixel edge with a Joyce-Loebl scanning micro-<br />
densitometer, a device linear in the optical density range<br />
0 to 3.2 and representing each pixel with an &bit word.<br />
The database also includes radiologist’s ‘truth’-markings<br />
on the locations of any abnormalities that may be present.<br />
The total number of benign and malignant images in<br />
the database are 22 and 22 respectively. A benign mam-<br />
mograhic image is shown in Fig. 4, and its compressed<br />
version is shown in Fig. 5. Perceptually there is no differ-<br />
ence in image quality between the original image and it’s<br />
compressed version. Fig. 6 illustrates a malignant image.<br />
The compressed image is shown in Fig. 7. Also in this<br />
case there is no difference between the original and the<br />
compressed images.<br />
Table 1 shows the advantage of using FBB in conjuc-<br />
tion with Huffman coding, and LZW coding. The mean<br />
compression ratio of benign and malignant images using<br />
FBB scheme alone was approximately 1.7:l. The mean<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.
Figure 4: Benign mammogram before FBB compression<br />
Figure 5: Same benign ima,ge in Fig. 4 after FBB compression<br />
- 0940 -<br />
167<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.
Figure 6: Malignant mammogram before FBB compression<br />
Figure 7: Same malignant image in Fig. 6 after FBB compression<br />
- 0941 -<br />
168<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.
.<br />
Table 1: Compression ratios of benign and malignant im-<br />
ages. Legend CR- compression ratio<br />
compression ratio of benign and malignant images using<br />
FBB with Huffman coding was approximately 3.81:l. The<br />
mean compression ratio of benign and malignant images<br />
using FBB with LZW coding was approximately 5:l.<br />
Figure 8: Bar graph for different compression schemes ap-<br />
plied to benign images<br />
The two bar graphs illustrate the usefulness of combin-<br />
ing FBB with other standard compression schemes such<br />
as Huffman coding and LZW coding. The first bar graph<br />
in Fig. 8 is for benign images, and the second bar graph in<br />
Fig. 9 is for malignant images. The y axis in the bar graph<br />
denotes the mean file size of malignant or benign images<br />
in bytes, while the x axis denotes the scheme applied on<br />
these images (e.g. Huffman with FBB).<br />
5 Conclusion<br />
In this paper a novel method of compressing mammographic<br />
images is proposed. The scheme is based upon<br />
FBB scanning of pixels of an image. The FBB codes blocks<br />
of pixels within the image that contain the same value (the<br />
odds of having blocks of the same pixel values in a mammography<br />
image are very high), thus reducing the size of<br />
the image substantially while encoding the image at the<br />
same time. The FBB compression scheme alone provided<br />
a compression ratio of 1.7:l. When Huffman coding and<br />
- 0942<br />
169<br />
Figure 9: Bar graph for different compression schemes ap-<br />
plied to malignant images<br />
LZW coding were used in conjunction with the FBB com-<br />
pression scheme, the compression ratio was 3.8:1 for Huff-<br />
man and 5:l for LZW coding. The proposed FBB lossless<br />
compression technique seems to be promising for teleradi-<br />
ology applications. Future work involves investigation of<br />
the compression scheme for transmission of mammography<br />
data over internet protocol.<br />
Acknowledgment<br />
We would like to acknowledge the Mammographic Image<br />
<strong>Analysis</strong> Society (MIAS) for granting us permission to use<br />
their database. We would also like to acknowledge <strong>Ryerson</strong><br />
<strong>University</strong> and NSERC for providing financial support.<br />
References<br />
S. A. Feig. Decreased breast cancer mortality through<br />
mammographic screening: results of clinical trials. Radiology,<br />
167:659-665, 1988.<br />
H. A. Fkazer. Computerized diagnosis comes to mam-<br />
mography. Diagnostic Imaging, pages 91-95, June<br />
1991.<br />
Z. Yang, M. Kallergi, R. A. DeVore, B. Lucier,<br />
W. Qian, R. A. Clark, and L. P. Clarke. Effect of<br />
wavelet bases on compressing digital mammograms.<br />
IEEE Engineering in Medicine and Biology Magazine,<br />
14(5):570-577, Sep/Oct 1995 1995.<br />
D. A. Huffman. A method for the construction of min-<br />
imum redundancy codes. Proc. IRE, 40:1098-1101,<br />
1952.<br />
J. Ziv and A. Lempel. Compression of individual sequences<br />
via variable-rate coding. IEEE Trans. INformation<br />
Theory, IT-24:530-536, 1978.<br />
-<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:01 from IEEE Xplore. Restrictions apply.
Instantaneous mean frequency estimation using<br />
adaptive t ime-frequency distributions<br />
Sridhar Krishnan<br />
Dept. of Electrical and Computer Engineering,<br />
<strong>Ryerson</strong> Polytechnic <strong>University</strong>, Toronto, ON M5B 2K3, CANADA.<br />
Email : krishnan@ee. ryerson .ca<br />
Abstract: <strong>Analysis</strong> of non-stationary signals is a chal-<br />
lenging task. True non-stationary signal analysis in-<br />
volves nionitoring the frequency changes of the signal over<br />
time (ie, monitoring the instantaneous frequency (IF)<br />
changes). The IF of a signal is traditionally obtained<br />
by taking the first derivative of the phase of the signal<br />
with respect to time. This poses some difficulties because<br />
the derivative of the phase of the signal may take negative<br />
values thus misleading the intrepretation of instantaneous<br />
frequency. In this paper, a novel approach to extract<br />
the IF from its adaptive time-frequency distribution is<br />
proposed. The adaptive time-frequency distribution of a<br />
signal is obtained by decomposing the signal into compo-<br />
nents with good time-frequency localization and by com-<br />
bining the Wigner distribution of the components. The<br />
adaptive time-frequency distribution thus obtained is free<br />
of cross-terms and is a positive time-frequency distribu-<br />
tion with good time and frequency localization. The IF<br />
may bc obtained as the first central moment of the adap-<br />
tive time-frequency distribution. The proposed method<br />
of IF estimation is very powerful for applications with low<br />
SNR. The proposed technique was tested with synthetic<br />
signals of known IF dynamics, and the method success-<br />
fully extracted the IF of the signals.<br />
Keywords: instantaneous frequency, non-stationary sig-<br />
nals, positive time-frequency distributions, matching pur-<br />
suit, average frequency.<br />
1 Introduction<br />
The instantaneous frequency (IF) of a signal is a param-<br />
eter of practical importance in situations such as seismic,<br />
radar, sonar, communications, and biomedical applica-<br />
tions. In all these applications the IF describes some<br />
physical phenomenon associated with them. Like most<br />
other signal processing concepts, the IF of the signal was<br />
originally used in describing FM modulation in communi-<br />
cations. In a typical radar application, the IF aids in the<br />
detection, tracking, and imaging of targets whose radial<br />
velocities change with time. When the radial velocity is<br />
- 0141 -<br />
170<br />
not constant, the radar’s Doppler induced frequency has<br />
a nonstationary spectrum, which can be tracked by IF es-<br />
timation techniques. Also, in biomedical signal analysis,<br />
IF is used in studying the electroencephalogram (EEG)<br />
signals to monitor key neural activities of the brain.<br />
The importance of the IF concept arises from the fact<br />
that in most applications a signal processing engineer<br />
is confronted with the task of processing signals whose<br />
spectral characteristics (in particular the frequency of the<br />
spectral peaks) are varying with time. These signals are<br />
often referred to as non-stationary signal. A chirp signal<br />
is a simple example of a non-stationary signal, in which<br />
the frequency of the sinusoidal changes linearly with time.<br />
It is theoretically difficult to describe the IF of a signal<br />
since most signals are multicomponent, and it is diffcult<br />
to define a unique parameter for each time instant. Also,<br />
since frequency is usually defined as a number of oscilla-<br />
tions or vibrations undergone in a unit time period, the<br />
association of the words “instantaneous” and “frequency”<br />
is still controversial.<br />
Several authors have tried to define the IF of a signal.<br />
In this paper the IF is defined by using adaptive time-<br />
frequency distribution (TFD). The paper is organized as<br />
follows: a brief review on the topic of IF is discussed<br />
in Section 2. The proposed technique of adaptive TFD is<br />
described in Section 3. Results with synthetic signals and<br />
real world signals are discussed in Section 4. The paper<br />
is concluded in Section 5.<br />
2 Review<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />
The classical definition of the IF a signal [l] is defined as<br />
Ville formulated a joint TFD of the signal energy called<br />
the Wigner-Ville distribution (WVD), and defined the IF<br />
as the first central moment (average frequency) of the<br />
WVD,
Most Cohen’s class TFD derived from WVD yield the<br />
IF by correct first moment calculation but this is often<br />
computationally expensive and is adversely affected by<br />
noise.<br />
Most TFDs such as WVD provide high signal energy<br />
concentration in time and frequency, therefore it is tempt-<br />
ing to try to use it to measure the spread of frequencies<br />
with time. Unfortunately, the spread of the IF of the<br />
WVD is only positive for certain types of signals. Even<br />
when the spread is positive some negative distribution<br />
values may appear in the calculation, and thus its useful-<br />
ness is questionable. From the literature it appears that<br />
still there are many unresolved issues regarding the IF of<br />
the signal (A detailed review on the fundamentals of IF<br />
is available in [2]). It has been shown that the usual way<br />
of interpreting the IF as the average frequency at each<br />
time brings out unexpected results with Cohen’s class of<br />
bilinear TFDs. If the IF is interpreted as the average fre-<br />
quency then the IF need not be a frequency that appears<br />
in the spectrum of the signal. If the IF is interpreted as<br />
the derivative of the phase, then the IF can extend be-<br />
yond the spectral range of the signal. It has been recently<br />
reported that the estimation of IF of a signal using a posi-<br />
tive TFD brings out meaningful interpretation about the<br />
IF of the signal [3]. The motivation behind this paper<br />
is in adaptively constructing a positive TFD suitable for<br />
estimating the IF of a signal.<br />
3 Adaptive Time-Frequency Dis-<br />
tribut ions<br />
The purpose of this paper is to explore the best available<br />
TFD for estimating the IF of a signal. For simple appli-<br />
cations, Cohen’s class TFDs or model-based TFDs may<br />
be applied. It is widely accepted that, in case of com-<br />
plex signals with multiple frequency components there is<br />
no definite TFD that will satisfy all the criteria and still<br />
give optimal performance. The purpose of this section is<br />
to construct TFDs according to the application in hand,<br />
i.e., to tailor the TFD according to the properties of the<br />
signal being analyzed. It is appropriate to call such TFDs<br />
as adaptive TFDs. In the present work, the concept of<br />
adaptive TFDs is based on signal decomposition.<br />
In practice, no TFD may satisfy all the requirements.<br />
In the method proposed in this section, by using con-<br />
straints, the TFDs are modified to satisfy certain specified<br />
criteria. It is assumed that the given signal is somehow<br />
decomposed into components of a specified mathematical<br />
representation. By knowing the components of a signal,<br />
the interaction between them can be established and used<br />
to remove or prevent cross-terms. This avoids the main<br />
drawback associated with Cohen’s class TFDs; numerous<br />
efforts have been directed to develop kernels to overcome<br />
- 0142 -<br />
171<br />
the cross-term problem [4, 5, 61.<br />
The key to successful design of adaptive TFDs lies in<br />
the selection of the decomposition algorithm. The com-<br />
ponents obtained from a decomposition algorithm depend<br />
largely on the type of basis functions used. For example,<br />
the basis function of the Fourier transform decomposes<br />
the signal into tonal (sinusoidal) components, and the<br />
basis function of the wavelet transform decomposes the<br />
signal into components with good time and scale prop-<br />
erties. For TF representation, it will be beneficial if the<br />
signal is decomposed using basis functions with good TF<br />
properties. The components obtained by decomposing<br />
a signal using basis functions with good TF properties<br />
may be termed as TF atoms. An algorithm that can<br />
decompose a signal into TF atoms is the MP algorithm<br />
described in next section.<br />
3.1 Matching Pursuit<br />
The MP algorithm decomposes the given signal using ba-<br />
sis functions that have excellent TF properties. The MP<br />
algorithm selects the decomposition vectors depending<br />
upon the signal properties. The vectors are selected from<br />
a family of waveforms called a dictionary. The signal z(t)<br />
is projected on to a dictionary of TF atoms obtained by<br />
scaling, translating, and modulating a window function<br />
dt):<br />
where<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />
00<br />
4t) = an sm (t), (3)<br />
n=O<br />
and a, are the expansion coefficients. The scale factor sn<br />
is used to control the width of the window function, and<br />
the parameter p, controls temporal placement. & is a<br />
normalizing factor that restricts the norm of gm (t) to 1.<br />
The parameters fn and 4n are the frequency and phase of<br />
the exponential function, respectively. yn represents the<br />
set of parameters (sn,p,, fn, &).<br />
In the present work, the window is a Gaussian function,<br />
i.e., g(t) = 2i exp(-.irt2); the TF atoms are then called<br />
Gabor atoms, and they provide the optimal TF resolution<br />
in the TF plane.<br />
In practice, the algorithm works as follows. The signal<br />
is iteratively projected on to a Gabor function dictionary.<br />
The first projection decomposes the signal into two parts:<br />
4t) = (~:,gYo)gYo(t) + R W), (5)<br />
where (2, gr,) denotes the inner product (projection) of<br />
s(t) with the first TF atom gTO(t). The term R1z(t) is the
esidue after approximating x(t) in the direction of gYo (t).<br />
This process is continued by projecting the residue on to<br />
the subsequent functions in the dictionary, and after AI iterations<br />
M-1<br />
x(t) = (R%gYn) 9Yn(t) + RMx(t), (6)<br />
n=O<br />
with Rox(t) = z(t), There are two ways of stopping the<br />
iterative process: one is to use a pre-specified limiting<br />
number M of the TF atoms, and the other is to check the<br />
energy of the residue RMx(t). A very high value of M<br />
and a zero value for the residual energy will decompose<br />
the signal completely at the expense of increased compu-<br />
tational complexity.<br />
3.2 Matching Pursuit TFD<br />
A signal decomposition-based TFD may be obtained by<br />
taking the WVD of the TF atoms in Eq. ??, and is given<br />
as [7]:<br />
w,w) = c2' I ( R ~ ~ wYn(t,w) x , ~ ~ ~ ) ~ ~<br />
where Wgrn (t, w) is the WVD of the Gaussian window<br />
function. The double sum corresponds to the cross-terms<br />
of the WVD indicated by W[g7n,g7m~(t,~), and should be<br />
rejected in order to obtain a cross-term-free energy distribution<br />
of z(t) in the TF plane. Thus only the first term<br />
is retained, and the resulting TFD is given by<br />
w'<br />
A4-1<br />
(t, = c I wnx, 91" > I 2 W9Yn (t, U). (8)<br />
n=O<br />
This cross-term-free TFD, also known as matching pur-<br />
suit TFD (MPTFD), has very good readability and is ap-<br />
propriate for analysis of nonstationary, multicomponent<br />
signals. The extraction of coherent structures makes MP<br />
an attractive tool for TF representation of signals with<br />
unknown SNR.<br />
3.3 Minimum Cross-Entropy Optimization<br />
of the MPTFD<br />
One of the drawbacks of the MPTFD is that it does<br />
not satisfy the marginal properties. If a TFD is pos-<br />
itive and satisfies the marginals, it may be considered<br />
to be a proper TFD for extraction of time-varying fre-<br />
quency parameters such as IF. This is because positiv-<br />
ity coupled with correct marginals ensures that the TFD<br />
- 0143 -<br />
172<br />
is a true probability density function, and the parame-<br />
ters extracted are meaningful [8]. The MPTFD may be<br />
modified to satisfy the marginal requirements, and still<br />
preserve its other important cha.racteristics. One way to<br />
optimize the MPTFD is by using the cross-entropy min-<br />
imization method [9, 101. Cross-entropy minimization is<br />
a general method of inference about an unknown prob-<br />
ability density when there exists a prior estimate of the<br />
density and new information in the form of constraints on<br />
expected values is available. If the optimized MPTFD or<br />
OMP TFD (an unknown probability density function) is<br />
denoted by M(t,w), then it should satisfy the marginals<br />
/M(t,w) dw = Ix(t)12 = m(t), and (9)<br />
Eqs. 9 and 10 may be treated as constraint equations<br />
(new information), for optimization. Now, M(t, U) may be<br />
obtained from W (t,w) (a prior estimate of the density)<br />
by minimizing the cross-entropy between them, given by<br />
As we are interested only in the marginals, OMP TFD<br />
may be written as [lo]:<br />
AI(t,W) = W'(t,w) exp{-(ao(t) + Po(w))), (12)<br />
where the CY'S and P's are Lagrange multipliers which may<br />
be determined using the constraint equations. An iter-<br />
ative algorithm to obtain the Lagrange multipliers and<br />
solve for M(t, w) is presented next.<br />
At the first iteration, we define<br />
M'(t,w) = ~ '(t,w) exp(-ao(t)). (13)<br />
As the marginals are to be satisfied, the time marginal<br />
constraint has to be imposed in order to solve for ao(t).<br />
By imposing the time marginal constraint given by Eq. 9<br />
on Eq. 13, we obtain<br />
where m(t) is the desired time marginal and m'(t) is the<br />
time marginal estimated from W' (t, U). Now, Eq. 13 can<br />
be rewritten as<br />
Ml(t,w) = W'(t,w) - m (t<br />
m' (t) '<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />
At this point, M1 (t, U) is a modified MPTFD with the de-<br />
sired time marginal; however, it need not necessarily have
the desired frequency marginal m(w). In order to obtain<br />
the desired frequency marginal, the following equation<br />
has to be solved:<br />
W(t,w) = ~ ‘(t,w). exp(-,&(w)).<br />
(16)<br />
Note that the TFD obtained after the first iteration<br />
M1(t,w) is used as the incoming estimate in Eq. 16.<br />
By imposing the frequency marginal constraint given by<br />
Eq. 10 on Eq. 16, we obtain<br />
(17)<br />
4<br />
The proposed method of extracting the IF of a signal was<br />
applied to a synthetic signal with known IF, and a real<br />
world example of knee joint sound signal.<br />
4.1<br />
Results<br />
Synthetic <strong>Signal</strong><br />
The synthetic signal “synl” is composed of nonoverlapping<br />
chirp, transient, and sinusoidal FM components, and<br />
is shown in Fig. 1. The frequency behavior of the signal<br />
are shown in Fig. 2. “synl” is an example of a monocomponent<br />
signal with linear and nonlinear frequency dy-<br />
where m(w) is the desired frequency margin?, and m‘ (U) Ilamics. To simulate noisy signa’ conditions, synl was<br />
is the frequency marginal estimated from W (t, w). Now, corrupted by adding random noise to an SNR Of lo dB.<br />
Eq. 17 can be rewritten as<br />
I I ’ I<br />
By incorporating the desired marginal constraint, the<br />
M’(t,w) TFD may be altered and need not necessarily<br />
give the desired time marginal. Successive iteration could<br />
overcome this problem and modify the desired TFD to<br />
get closer to M(t, w). This follows from the fact that the<br />
cross-entropy between the desired TFD and the estimated<br />
TFD decreases with the number of iterations [lo].<br />
As the iterative procedure is started with a positive dis-<br />
tribution W’(t, U), the TFD at the nth iteration Mn(tlu)<br />
is guaranteed to be a positive distribution. Such a class of<br />
distributions belongs to the Cohen-Posch class of positive<br />
distributions [SI. The OMP TFDs may also be taken to<br />
be adaptive TFDs because they are constructed on the<br />
basis of the properties of the signal being analyzed.<br />
A method for constructing a positive distribution using<br />
the spectrogram as a priori knowledge was developed by<br />
Loughlin et al. [ll]. The major drawback of using the<br />
spectrogram as a priori knowledge is the loss of TF resolution;<br />
this effect may be minimized by taking multiple<br />
spectrograms with different sizes of analysis windows as<br />
initial estimates of the desired distribution. The method<br />
proposed in this section start with the MPTFD, overcomes<br />
the problem of using multiple spectrograms as initial<br />
estimates, and produces a high-resolution TFD tailored<br />
to the signal properties. The OMP TFD mag be<br />
used to derive higher moments by estimating the higherorder<br />
Lagrange multipliers. Such measures are not necessary<br />
in the present work, and are beyond the scope of<br />
this paper.<br />
The IF of a signal can be computed as the first moment<br />
of TFD(t, U) along each time slice, given by<br />
-15L I I 4<br />
100 200 300 400 500 600 700 800 900 loo0<br />
!,me Samples<br />
Figure 1: Monocomponent, nonstationary, synthetic sig-<br />
nal “synl” consisting of a chirp, an impulse, and a sinu-<br />
soidal FM component (SNR = 10 dB).<br />
The MP method has given a clear picture of the IF<br />
representation: the three simulated components are per-<br />
fectly localized in the TFDs shown in Fig. 3. This is<br />
because the OMP TFD provides adaptive representation<br />
of signal components, and due to the possibility that each<br />
high-energy component is analyzed by the TF represen-<br />
tation independent of its bandwidth and duration. The<br />
good localization of transients produced by MP is because<br />
of the good TF localization properties of the basis func-<br />
tions, whereas with other techniques such as Fourier and<br />
wavelets, the transient information gets diluted across the<br />
whole basis and the collection of basis functions is not as<br />
large as compared to that in the MP dictionary.<br />
E:, wTFD(t, w)<br />
4.2 Real World Example<br />
IF(t) =<br />
(19)<br />
TFD(t, w) .<br />
The -proposed - technique was applied to real-world sig-<br />
IF characterizes the frequency dynamics of the signal. nals viz the knee sound signals. Due to the differences in<br />
- 0144 -<br />
173<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />
i
0.5<br />
100 200 300 4W 500 600 700 800 900 1000<br />
lime sBmDIes<br />
Figure 2: Ideal TFD depicting the frequency laws of the<br />
signal "synl" in Fig. 1.<br />
100 200 300 400 f<br />
limc<br />
c<br />
1<br />
f<br />
@ t<br />
600 700 800 900 1000<br />
ples<br />
Figure 3: OMP TFD of the signal synl in Fig.1.<br />
- 0145 -<br />
174<br />
the cartilage surface between normal and abnormal knees,<br />
sound signals with different IFS are produced [12]. Fig. 4<br />
shows the knee sound signal of a normal subject. The IF<br />
of the same signal is shown in Fig. 5. Automatic classifi-<br />
cation of the sound signals using IF as a feature for pat-<br />
tern classification has produced good results in screening<br />
abnormal knees from normal knees.<br />
80<br />
-40<br />
~<br />
I III<br />
I<br />
I<br />
I 1 I I<br />
300<br />
250<br />
1200<br />
E<br />
a<br />
pi50<br />
p 100<br />
50<br />
0<br />
I I l l 1<br />
1 1 1 1 I<br />
1 1 1 1 , I<br />
I<br />
1000 2000 3000 4000 5000 6000<br />
Itme samples<br />
L<br />
)O 80<br />
Figure 5: IF estimated from the OMP TFD of the normal<br />
knee sound signal in Fig. 4.<br />
5 Conclusion<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.<br />
A novel method of extracting the IF of a signal is pro-<br />
posed in this paper. The extraction of IF is based on
constructing an adaptive TFD and extracting the IF as<br />
first central moment for each time slice. The method was<br />
tested on synthetic signals with known IF, and the results<br />
were found to be satisfactory even for low SNR cases.<br />
[ll] P. Loughlin, J. Pitton, and L. Atlas. Construction of<br />
positive time-frequency distributions. IEEE Trans.<br />
<strong>Signal</strong> Processing, 42 : 2697-2 705, 1994.<br />
Acknowledgment<br />
[12] S. Krishnan, R.M. Rangayyan, G.D. Bell, and<br />
C.B. Frank. Adaptive time-frequency analysis of<br />
knee joint vibroarthrographic signals for non-invasive<br />
screening of articular cartilage pathology. IEEE<br />
We would like to acknowledge Micronet and NSERC for Transactions on Biomedical Engineering, page in<br />
providing financial support. press, 2000.<br />
References<br />
[l] J. Carson and T. Fry. Variable frequency electric<br />
circuit theory with application to the theory of fre-<br />
quency modulaion. Bell System Technical Journal,<br />
16:513-540, 1937.<br />
[2] B. Boashash. Estimating and interpreting the in-<br />
stantaneous frequency of a signal - Part 1: Funda-<br />
mentals. Proc. IEEE, 80(4):519-538, April 1992.<br />
[3] P. J. Loughlin. Comments on the interpretation of in-<br />
stantaneous frequency. IEEE <strong>Signal</strong> Processing Let-<br />
ters, 4(5):123-125, May 1997.<br />
[4] H. I. Choi and W. J. Williams. Improved time-<br />
frequency representation of multicomponent signals<br />
using exponential kernels. IEEE Trans. Acoustics,<br />
Speech, and <strong>Signal</strong> Processing, 37(6):862-871, 1989.<br />
[5] Z. Guo, L. G. Durand, and H. C. Lee. The<br />
time-frequency distributions of nonstationary signals<br />
based on a Bessel kernel. IEEE Trans. <strong>Signal</strong> Pro-<br />
cessing, 42:1700-1707, 1994.<br />
[6] R. G. Baraniuk and D. L. Jones. <strong>Signal</strong>-dependent<br />
time-frequency representation: optimal kernel de-<br />
sign. IEEE Trans. <strong>Signal</strong> Processing, 41:1589-1602,<br />
1993.<br />
[7] S. G. Mallat and Z. Zhang. Matching pursuit with<br />
time-frequency dictionaries. IEEE Trans. <strong>Signal</strong><br />
Processing, 41( 12):3397-3415, 1993.<br />
[8] L. Cohen and T. Posch. Positive time-frequency dis-<br />
tribution functions. IEEE Trans. Acoustics, Speech,<br />
and <strong>Signal</strong> Processing, 33:31-38, 1985.<br />
[9] J. Shore and R. Johnson. Axiomatic derivation of<br />
the principle of maximum entropy and the principle<br />
of minimum cross-entropy. IEEE Trans. Information<br />
Theory, 26(1):26-37, 1980.<br />
[lo] J. Shore and R. Johnson. Properties of cross-entropy<br />
minimization. IEEE Trans. Information Theory,<br />
27(4):472-482, 1981.<br />
- 0146 -<br />
175<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 17:03 from IEEE Xplore. Restrictions apply.
Proceedings of the 22"d Annual EMBS International Conference, July 23-28,2000, Chicago IL.<br />
Sonification of Knee-joint Vibration <strong>Signal</strong>s<br />
Sridhar Krishnanl, Rangaraj M. Ranga~yan~?~,<br />
G. Douglas Bel12,314, and Cyril B. F'rank2~3~4<br />
'Dept. of Electrical and Computer Engineering, <strong>Ryerson</strong> Polytechnic <strong>University</strong><br />
Toronto, Ontario, M5B 2K3, CANADA. (Email: krishnan@ee.ryerson.ca)<br />
2Dept. of Electrical and Computer Engineering, 3Dept. of Surgery, 4Sport Medicine Centre<br />
<strong>University</strong> of Calgary, Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca)<br />
Abstract: Sounds generated due to rubbing of<br />
knee-joint surfaces may be a potential tool for non-<br />
invasively assessing articular cartilage degenera-<br />
tion. In this paper, an attempt is made to per-<br />
form computer-assisted auscultation of knee joints<br />
by auditory display (AD) of the vibration sig-<br />
nals (also known as vibroarthrographic or VAG<br />
signals) emitted during active movement of the<br />
leg. The AD technique is based on a sonifica-<br />
tion algorithm, in which the instantaneous mean<br />
frequency and envelope of the VAG signals were<br />
used in improving the potential diagnostic qual-<br />
ity of VAG signals. Auditory classification experi-<br />
ments were performed by two orthopedic surgeons<br />
with a database of 37 VAG signals that includes<br />
19 normal and 18 abnormal cases. Sensitivities of<br />
31% and 83% were obtained with direct playback<br />
and the sonification method, respectively.<br />
1 Introduction<br />
Auscultation, the method of examining functions and con-<br />
ditions of physiological systems by listening to the sounds<br />
they produce, is one of the ancient modes of diagnosis.<br />
The first use of vibration or acoustic emission as a diag-<br />
nostic aid for bone and joint disease is found in Laennec's<br />
treatise on mediate auscultation, as cited by Mollan et<br />
al. [l]. Laennec was able to diagnose fractures by aus-<br />
cultating crepitus caused by the moving broken ends of<br />
bone. As quoted by Mollan et al. [l], Heuter, in 1885,<br />
used a myodermato-osteophone (a type of stethoscope) to<br />
localize loose bodies in human knee joints. In 1929, Wal-<br />
ters reported on auscultation of 1600 joints and detected<br />
certain sounds before symptoms become apparent [2]; he<br />
suggested that the sounds might be an early sign of arthri-<br />
tis.<br />
After 1933, most of the works reported on kneejoint<br />
sounds have been on objective analysis of the sound or vi-<br />
bration signals (also known as vibroarthrographic or VAG<br />
signals) for non invasive diagnosis of kneejoint pathology<br />
0-7803-6465-1/00/$10.00 02000 IEEE<br />
13, 4, 5, 6, 71. Although auscultation of knee joints using<br />
stethoscopes is occasionally practised by clinicians, there<br />
is no published evidence of their diagnostic value. Also,<br />
no study has been reported on computer-aided ausculta-<br />
tion of knee-joint sounds. This paper proposes methods for<br />
computer-aided auscultation of knee-joint sounds based on<br />
an auditory display (AD) technique.<br />
2 Data Acquisition<br />
Each subject sat on a rigid table in a relaxed position<br />
with the leg being tested freely suspended in air. The<br />
VAG signal was detected at the mid-patella position of<br />
the knee by using vibration sensors (accelerometers) as the<br />
subject swung the leg over an approximate angle range of<br />
135O + Oo + 135O in 4s. Informed consent was obtained<br />
from every subject. The experimental protocol has been<br />
approved by the Conjoint Health <strong>Research</strong> Ethics Board<br />
of the <strong>University</strong> of Calgary.<br />
The VAG signal was prefiltered (10 Hz to 1 kHz) and<br />
amplified before digitizing at a sampling rate of 2 kHz.<br />
The details of data acquisition may be found in Krish-<br />
nan et al. [7]. The database consists of 37 signals (19<br />
normal and 18 abnormal). The abnormal signals were col-<br />
lected from symptomatic patients scheduled to undergo<br />
arthroscopy, and there was no restriction imposed on the<br />
type of pathology.<br />
3 Sonification<br />
AD may be defined as aural representation of a stream<br />
of data. The field of AD is emerging, and has recently<br />
drawn attention in the areas of geophysics, biomedical en-<br />
gineering, speech signal analysis, image analysis, aids for<br />
the handicapped, and computer graphics [8]. AD has to<br />
be performed in such a manner as to take advantage of<br />
the psychoacoustics of the human ear. The AD technique<br />
considered in the present work is a sonification technique.<br />
In sonification, features extracted from the data are used<br />
176<br />
1995<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.
Proceedings of the 22"d Annual EMBS International Conference, July 23-28,2000, Chicago IL.<br />
to control a sound synthesizer. The sound signal gener-<br />
ated does not bear a direct relationship to the data being<br />
analyzed. A simple example of a sonification technique is<br />
mapping of parameters derived from a data stream to AD<br />
parameters such as pitch and loudness.<br />
4 Motivation for AD of VAG<br />
Prior to graphical recording and analysis of VAG sig-<br />
nals, auscultation of knee joints was the only noninva-<br />
sive method available to distinguish normal joints from<br />
degenerative joints. Significant success has been claimed<br />
by several researchers using the auscultation technique 111.<br />
However, classification of knee joints by auscultation is a<br />
highly subjective technique. Further, a significant portion<br />
of VAG signal energy lies below the threshold of auditory<br />
perception of the human ear in terms of frequency and/or<br />
intensity. The situation may be ameliorated by developing<br />
AD methods for computer-aided auscultation of knee-joint<br />
vibrations. The main motivating factors in applying AD<br />
techniques to VAG are:<br />
0 It has been established through objective signal anal-<br />
ysis of VAG that sounds generated by abnormal knees<br />
are distinctive and different from those produced by<br />
normal knees [3, 4, 5, 6, 91. Sounds of diagnostic<br />
value may be made prominent by applying suitable<br />
AD techniques to VAG.<br />
0 AD of VAG obtained using vibration sensors may<br />
facilitate relatively noise-free and localized auditory<br />
analysis when compared to direct auditory analysis<br />
of the acoustic sensor data.<br />
The work described in this paper hypothesizes that<br />
auditory analysis of VAG data may aid an orthopedic sur-<br />
geon in making diagnostic inferences. In the next section,<br />
a sonification technique for AD of VAG data is developed.<br />
This study is the first attempt to listen to knee sounds<br />
detected by vibration sensors.<br />
5 Sonification of VAG <strong>Signal</strong>s<br />
In an effort to facilitate AD of only the important charac-<br />
teristics of VAG signals, a sonification algorithm is pro-<br />
posed. The sonification algorithm involves amplitude<br />
modulation (AM) and frequency modulation (FM). The<br />
instantaneous mean frequency (IMF) FP(t) is an impor-<br />
tant parameter in characterizing multicomponent nonsta-<br />
tionary signals such as VAG [7]. The IMF of a signal could<br />
be extracted from a positive time-frequency distribution<br />
(TFD) of the signal [7]. The FM part of the sonified signal<br />
is obtained by frequency modulating a sinusoidal waveform<br />
with the IMF'of the signal. The auditory characteristics<br />
of the FM part alone will be tonal, which could quickly<br />
U)<br />
30<br />
10<br />
0<br />
4 -10<br />
1-20<br />
0-7803-6465-1/00/$10.00 02000 IEEE 1996<br />
177<br />
-30<br />
I -60<br />
I<br />
Figure 1: An abnormal VAG signal of a patient with chon-<br />
dromalacia patella.<br />
P<br />
E<br />
P<br />
Figure 2: Spectrogram of the VAG signal in Fig. 1.<br />
cause boredom and fatigue. To obviate this problem, an<br />
AM part a(t) is obtained as the absolute d ue of the an-<br />
alytic version of the VAG signal. The AM part provides<br />
an envelope to the signal and contributes to the frequency<br />
deviation (bandwidth) about the IMF.<br />
For the sake of illustration, plots of an abnormal VAG<br />
signal with chondromalacia patella (a type of cartilage<br />
pathology), and the processed versions of the signal are<br />
presented. Fig. 1 shows the original VAG; the spectro-<br />
gram (a joint time-frequency representation) of the signal<br />
is shown in Fig. 2. The related entities of the sonified<br />
versions of the signal are shown in Figs. 3 to 5. The en-<br />
velope and the IMF of the signal are shown in Figs. 3 and<br />
4, respectively. The spectrogram shown in Fig. 5 clearly<br />
illustrates the envelopeIMF behlavior of the sonified signal<br />
with a time-scale factor of two.<br />
The advantages of the IMF-based sonification method<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.
Proceedings of the 22"' Annual EMBS International Conference, July 23-28,2000, Chicago IL.<br />
time Sam*<br />
I I '<br />
Figure 3: Envelope of the VAG signal in Fig. 1.<br />
Figure 4: IMF of the VAG signal in Fig. 1 estimated using<br />
its positive TFD.<br />
0-7803-6465-1/00/$10.00 02000 IEEE 1997<br />
178<br />
time Samples<br />
time samp(es<br />
Figure 5: Spectrogram of the IMF-based sonified version<br />
of the VAG signal in Fig. 1. A time-scale factor of 2 was<br />
used. Note that the figure window has been divided into<br />
two parts to show the time-scale expansion.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.<br />
I lo'
are:<br />
Proceedings of the 22"d Annual EMBS International Conference, July 23-28,20:00, Chicago IL.<br />
0 It helps in auditory analysis of a multicomponent non-<br />
stationary signal in terms of its main features such as<br />
FP(t) and a(t).<br />
0 FP(t) takes high values for transients and noise.<br />
However, by making use of the envelope (intensity)<br />
information, noise can be made less audible as com-<br />
pared to transients.<br />
0 Integration of FP(t) ensures a continuous phase, and<br />
the method does not require any phase unwrapping.<br />
0 Integration of FP(t) makes the method insensitive to<br />
noisy FP(t) estimates.<br />
The IMF-based method has the following disadvantages:<br />
0 In case of a noisy signal, FP(t) will have an almost<br />
uniform waveform, and does not provide much infor-<br />
mation unless the envelope can contribute some in-<br />
formation. In the present study, this problem is over-<br />
come by processing denoised versions of the VAG sig-<br />
nals [lo].<br />
0 It is obvious that the method may not be applicable to<br />
information-rich signals such as speech: the formant<br />
structure of voiced speech cannot be represented by<br />
the relatively simple IMF.<br />
6 Experiments and Results<br />
Auditory analysis of VAG signals was performed by two<br />
orthopedic surgeons (GDB and CBF) with significant ex-<br />
perience in knee-joint auscultation and arthroscopy. The<br />
experiment was conducted in two stages: In the fist stage,<br />
familiarization and training were provided through the re-<br />
sults of application of the IMF-based sonification methods<br />
to a speech signal and four VAG signals (two normals and<br />
two abnormals). In the second stage, the methods were<br />
tested with the database of 37 VAG signals.<br />
From the initial evaluation (first stage), GDB selected<br />
the two-times time-scaled IMF-based sonification method<br />
for the test (second) stage. The purpose of the classifi-<br />
cation experiment in the test stage was to determine the<br />
diagnostic improvement provided by the processed sounds<br />
when compared to direct playback. The test stage in-<br />
cluded auditory classification experiments performed with<br />
the same database three times: twice by GDB with a time<br />
gap of 45 days between the repeat experiments and once<br />
by CBF. The direct playback of VAG signals provided a<br />
- sensitivity of 31% and a specificity of 74%, whereas au-<br />
ral analysis of the sonified signals provided a sensitivity of<br />
83% and a specificity of 32%.<br />
The results suggest that computer-aided auscultation<br />
of VAG signals may be a potential tool for improved diag-<br />
nosis of knee-joint cartilage pathology. The specificity and<br />
sensitivity may be increased with more auditory training.<br />
0-7803-6465-1 /00/$10.00 02000 IEEE 1998<br />
179<br />
Acknowledgements<br />
We gratefully acknowledge supplort from the Alberta Her-<br />
itage Foundation for Medical <strong>Research</strong> and the Faculty of<br />
Engineering, <strong>Ryerson</strong> Polytechnic <strong>University</strong>.<br />
References<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:59 from IEEE Xplore. Restrictions apply.<br />
R. A. B. Mollan, G. C. McCullagh, and R. I. Wilson.<br />
A critical appraisal of auscultation of human joints.<br />
Clinical Orthopaedics and Elelated <strong>Research</strong>, 170:231-<br />
237, 1982.<br />
C. F. Walters. The value of joint auscultation. Lancet,<br />
1~920-921, 1929.<br />
M. L. Chu, I. A. Gradisar, amd R. Mostardi. A nonin-<br />
vasive electroacoustical evalution technique of carti-<br />
lage damage in pathological knee joints. Medical and<br />
Biological Engineering and Computing, 16:437442,<br />
1978.<br />
Y. Nagata. Joint-sounds in gonoarthrosis - clinical application<br />
of phonoarthrography for the knees. Journal<br />
of UOEH, 10(1):47-58, 1988.<br />
N. P. Reddy, B. M. Rothschild, M. Mandal, V. Gupta,<br />
and S. Suryanarayanan. Noninvasive acceleration<br />
measurements to characterize knee arthritis and<br />
chondromalacia. Annals o,F Biomedical Engineering,<br />
23:78-84, 1995.<br />
R. M. Rangayyan, S. Krishnan, G. D. Bell, C. B.<br />
Frank, and K. 0. Ladly. Parametric represen-<br />
tation and screening of knee joint vibroarthre<br />
graphic signals. IEEE %ns. Biomedical Engineer-<br />
ing, 44(11):1068-1074, Nov. 1997.<br />
S. Krishnan. Adaptive signal processing techniques<br />
for analysis of knee joint vibroarthrographic signals.<br />
Ph. D. dissertation,<strong>University</strong> of Calgary, Calgary,<br />
AB, Canada, June 1999.<br />
G. Kramer. An introduction to auditory display. In<br />
G. Kramer, editor, Auditory Display: Sonification,<br />
Audafication, and Auditor;y Interfaces, pages 1-78.<br />
Addison Wesley, Reading, ILIA, 1994.<br />
S. Krishnan, R.M. Rangayyan, G.D. Bell, and C.B.<br />
Frank. Adaptive time-frequency analysis of knee joint<br />
vibroarthrographic signals for non-invasive screening<br />
of articular cartilage pathology. IEEE Zkansactions<br />
on Biomedical Engineering, page in press, 2000.<br />
S. Krishnan and R.M. Rangayyan. Automatic de-<br />
noising of knee joint vibration signals using adaptive<br />
time-frequency representat ions. Medical and Biologi-<br />
cal Engineering and Computing, page in press, 2000.
Proceedings of the 1999 IEEE Canadian Conference on Electrical and Computer Engheering<br />
Shaw Conference Center, Edmonton, Alberta, Canada May 9-12 1999<br />
Denoising Knee Joint Vibration <strong>Signal</strong>s Using Adaptive<br />
Time-Frequency Representations<br />
Sridhar Krishnan and Rangaraj M. Rangayyan<br />
Dept. of Electrical and Computer Engineering, Univ ersity of Calgary,<br />
2500 <strong>University</strong> Drive NW, Calgary , Alberta T2N 1N4, CANAR.<br />
Email: (krishnan)( ranga)@enel.ucalgary .ca<br />
Abstmct - A novel denoising method for improv-<br />
ing the signal-to-noise ratio (SNR) of knee joint vibra-<br />
tion signals (also known as vibroarthrographic or VAG<br />
signals) is proposed. The denoising methods consid-<br />
ered are based on signal decomposition techniques such<br />
as wavelets, wavelet packets, and the matching pur-<br />
suit method. Performance evaluation with synthetic<br />
signals simulated with characteristics expected of VAG<br />
signals indicated good denoising results with the match-<br />
ing pursuit method. Nonstationary signal features ex-<br />
tracted and identified from time-frequency distributions<br />
of denoised VAG signals have shown good potential in<br />
screening for articular cartilage pathology.<br />
Keywords: denoising, time-frequency distributions,<br />
matching pursuit, knee joint sounds, vibroarthrography.<br />
I. INTRODUCTION<br />
Vibration signals sensed using an accelerometer at<br />
the mid-patella position of the knee joint during normal<br />
leg movement could be used to develop a non-invasive<br />
tool for monitoring and screening of articular cartilage<br />
pathology. The knee joint vibration signals are referred<br />
to as vibroarthrographic or VAG signals.<br />
VAG signals have the following important charac-<br />
teristics:<br />
They are nonstationary and multicomponent in na-<br />
ture.<br />
Although the accelerometer placed at the mid-<br />
patella position has excellent immunity to back-<br />
ground noise, random noise is expected to com bine<br />
with VAG signal during leg mowment and data<br />
acquisition.<br />
e There is no underlying model available as yet for<br />
VAG signal generation from which the signal-to-<br />
noise ratio (SNR) could be determined a priori.<br />
In order to analyze VAG signals and to extract<br />
discriminant features, nonstationary and multiconpo-<br />
nent signal analysis tools such as timefrequency dis-<br />
tributions (TFDs) could be used. TFDs give the sig-<br />
nal energy distribution at different time instants and<br />
frequencies. The features extracted from a TFD will<br />
contain the combined time-frequency (TF) dynamics of<br />
the given signal as opposed to features along either the<br />
0-7803-5579-2/99/$10.000 1999 BEE 1495<br />
time or the frequency axis alone as provided by con-<br />
ventional techniques. However, TFD features may be<br />
biased due to the presence of random noise. Because<br />
of random behavior and wide-frequency range, a noise<br />
process is localizable neither in time nor in frequency,<br />
and appears all over the TF plane.<br />
Filtering of noise from VAG signals may help in<br />
extracting and identifying significant TF features use-<br />
ful in screening applications. In circumstances where<br />
the SNR of a signal is not known a priori, optimal lin-<br />
ear filtering techniques such as Wiener filtering may<br />
not be the best solution. In such cases, approaches<br />
based on signal decomposition using orthogonal or<br />
non-orthogonal bases may be an interesting alterna-<br />
tive. This paper is a first attempt to automatically<br />
denoise VAG signals using signal decomposition. The<br />
common1 y-used denoising methods sum as wavelets and<br />
wavelet packets are compared with an adaptive TF de-<br />
composition method such as matching pursuit with a<br />
Gabor dictionary.<br />
In Section I1 the methodology is described. Section<br />
I11 presents the results and discussion on the perfor-<br />
mance of the denoising methods studied with synthetic<br />
and real VAG signals. The paper is concluded in Sec-<br />
tion IV with a brief summary.<br />
11. METHODS<br />
The Wiener filter is an optimal filter for removing<br />
Gaussian random noise provided the noise statistics are<br />
available a priori. In real-world situations, signals ac-<br />
quired from an unknown system may have an unknown<br />
SNR. In cases where the SNR is not known a priori, sig-<br />
nal decomposition using an appropriate basis may help<br />
in extracting the coherent structures of a signal with<br />
respect to the basis dictionary. In the following sec-<br />
tions, methods for linear and nonlinear approximation<br />
or decomposition of signals are briefly described.<br />
A. Linear Approximation<br />
In linear approximation, the given signal is pro-<br />
jected over M orthogonal basis vectors that are chosen<br />
a priori. Linear approximation of a discrete signal z(n)<br />
180<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.
may be written as<br />
M-1<br />
= (z,grn)gm, (1)<br />
m=O<br />
where (z,gm) denotes the inner product of z(n) with<br />
the orthogonal basis vectors gm’s that are selected a<br />
priori. It has been shown that an optimal linear ap-<br />
proximation is provided by the Karhunen-Loitve basis<br />
[l]. The approximation may be improwd by choosing<br />
the M orthogonal basis vectors depending on the prop-<br />
erties of the given signal rather than selecting them be-<br />
fore hand. The selection of signal-adaptive basis func-<br />
tions leads to the concept of nonlinear decomposition.<br />
B. Nonlinear Approximation<br />
In the case of nonlinear approximation, the given<br />
signal is approximated with M vectors selected adap-<br />
tively. The nonlinear decomposition of a signal z(n)<br />
may be written as<br />
md&j<br />
where IM denotes a group of basis functions from a dic-<br />
tionary that provides the first M inner product values<br />
(z,gm) arranged in decreasing order. The M vectors<br />
in IM are the basis vectors that correlate best with<br />
x(n), and may be interpreted as the main features of<br />
z(n). One such possible approximation is the wavelet<br />
transform, where the basis vectors are obtained by di-<br />
lating and translating a prototype function (also known<br />
as wavelets), given by<br />
1 t-U<br />
= -J 9 (8)<br />
1 (3)<br />
where s denotes the dilation parameter and U is the<br />
translation parameter. Nonlinear decomposition based<br />
on wavelets outperforms linear decomposition because<br />
the former is equivalent to the construction of an irreg-<br />
ular sampling grid adapted to the local sharpness of the<br />
signal variations. Efficient denoising may be performed<br />
using wavelets by approximating the signal with a small<br />
number of non-zero wavelet coefficients; thresholding of<br />
the wavelet coefficients may be hard or soft [2].<br />
To further optimize nonlinear signal approxima-<br />
tion, one could adaptively choose the basis depending<br />
upon the given signal. This approach of selecting the<br />
“best” basis among a dictionary of bases by minimiz-<br />
ing a cost function or entropy is known as the method<br />
of wavelet packets (WP) [3]. The WP approac h uses<br />
a large family of orthogonal bases that include differ-<br />
ent types of local TF functions (also known as TF<br />
atoms). The bases are computed using a quadrature<br />
0-7803-5579-2/99/$10.00 0 1999 IEEE 1496<br />
181<br />
mirror filter-bank algorithm. WP decomposes the sig-<br />
nal into TF atoms that are adapted to the TF structures<br />
present in the signal. A denoised version of a signal may<br />
be obtained by soft thresholding or hard thresholding<br />
the WP coefficients.<br />
Another way to optimize a TF decomposition is<br />
by using non-orthogonal basis functions. An example<br />
of such a decomposition is the matching pursuit (MP)<br />
algorithm [4]. In this case, the non-orthogonal basis<br />
functions are Gaussian functions with good time and<br />
frequency localization characteristics. In MP, the signal<br />
is first projected onto the dictionary, and the Gabor<br />
TF atom with the highest correlation with the signal<br />
is selected. The residue of the signal is then projected<br />
onto the dictionary, and the component with the highest<br />
correlation is selected. The decay parameter, denoted<br />
by<br />
may be used as the stopping criterion of the decompo-<br />
sition process. .In Eq. 4, 11 Rmx denotes the residual<br />
energy level at the mth iteration. The decomposition<br />
is continued until the decay parameter does not reduce<br />
any further. At this stage, the selected components rep<br />
resent coherent structures and the residue represents<br />
incoherent structures in the signal with respect to the<br />
dictionary. The residue may be assumed to be due to<br />
random noise, since it does not show any TF localiza-<br />
tion.<br />
111. RESULTS AND DISCUSSION<br />
Before denoising VAG signals for feature extrac-<br />
tion, the best available denoising method was selected<br />
on the basis of performance with synthetic signals sim-<br />
ulated with characteristics similar to those expected of<br />
VAG signals. The synthetic signal used for illustra-<br />
tion in the present paper includes a linear frequency-<br />
modulated (FM) componen t, a nonlinear FM compo-<br />
nent, and a transient. The synthetic signal possesses<br />
the multicomponent and nonstationary characteristics<br />
typical of VAG signals. The reason to use FM compe<br />
nents in synthetic signals in the present study is that<br />
dominant pole analysis of VAG signals has indicated<br />
timevarying frequency characteristics [5]. Transients<br />
may depict joint clicks produced during movement of<br />
the knee. Random noise at different levels was added<br />
to the synthetic signal to simulate good and poor signal<br />
recording conditions.<br />
To evaluate the performance of the denoising rneth-<br />
ods chosen for the present study, the normalized root<br />
mean squared (NRMS) error measure w as used. NRMS<br />
is given by<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.<br />
(4)
A . : . : . ;. ... I... . . .... .. ' .. . .:. .. ..; 4<br />
a<br />
-4-<br />
Fig. 1. Multicomponent, nonstationary, synthetic signal corn-<br />
posed with a linear FM component, a nonlinear FM cornp-<br />
nent, and a transient.<br />
I ' ' ' I ' " ' ' ' 1<br />
0 50 100 152 200 250 300 350 400 450 500<br />
llme samples<br />
Fig. 2. The synthetic signal in Fig. 1 with noise added (SNR =<br />
0 dB).<br />
where s(n) is the original signal without noise, d(n) is<br />
the denoised signal, and N is the number of samples<br />
in the signal. A small NRMS measure indicates good<br />
denoising performance.<br />
A. Results with Synthetic <strong>Signal</strong>s<br />
The denoising methods were applied to the syn-<br />
thetic signal with two levels of Gaussian random noise<br />
added. The noise levels were such that the resulting<br />
signals had an SNR of 10 dB and 0 dB. The symmlet 4<br />
wavelet [l] was used for wavelet-based denoising. A soft<br />
0-7803-5579-2/99/$10.00 0 1999 IEEE 1497<br />
182<br />
Fig. 3. Wavelet-based denoised version of the noisy signal in<br />
Fig. 2.<br />
't I 1<br />
Fig. 4. Wavelet packet-based denoised version of the noisy signal<br />
in Fig. 2.<br />
me SamFleO<br />
Fig. 5. Matching pursuit-based denoised version of the noisy<br />
signal in Fig. 2. ,<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.
e<br />
1<br />
os<br />
07-<br />
Ohzo5-<br />
04-<br />
09-<br />
02-<br />
01-<br />
OD<br />
SNR - 10 10<br />
Fig. 6. Comparison of the NRMS error values of the denoised<br />
versions of the synthetic signal with SNR = 10 dB.<br />
threshold was applied to the wavelet coefficients; coef-<br />
ficients that did not pass the soft threshold test were<br />
discarded. In case of the WP method, the “best” basis<br />
was selected on the basis of the Schur concavity cost<br />
function [3], and the denoised version was obtained by<br />
soft thresholding the WP coefficien ts. Gaussian func-<br />
tions were used for the MP method. Gaussian functions<br />
provide the optimal TF resolution and satisfy the equal-<br />
ity criteria of the uncertainty principle. The threshold<br />
for denoising was based on the decay parameter as given<br />
by Eq. 4.<br />
Fig. 1 shows the original synthetic signal, and<br />
Fig. 2 shows the signal with noise added to an SNR<br />
of 0 dB. The denoised versions of the signals using the<br />
wavelets method, the WP method, and the MP method<br />
are shown in Figs. 3, 4, and 5 respectively. Visual com-<br />
parison indicates that the MP-denoised result has pre-<br />
served most of the important characteristics, especially<br />
the transient component.<br />
Fig. 6 shows a bar graph comparing the NRMS<br />
values of the results of the three denoising methods ap-<br />
plied to the synthetic signal with SNR of 10 dB. From<br />
the Fig. 6 it is evident that adaptive denoising using<br />
MP provides good denoising for a moderately high SNR<br />
case. The case of the low SNR of 0 dB was simulated<br />
to depict very poor signal recording conditions (not ex-<br />
pected in VAG signals). From the bar chart in Fig. 7 we<br />
can deduce that the MP technique has again provided<br />
the best denoising result (lowest NRMS value) of the<br />
three methods studied.<br />
It is worthwhile to mention that the denoising re-<br />
sults with wavelets and WP are highly dependent on<br />
the selection of the threshold value for the coefficients.<br />
In the case of MP, the decay parameter is a more ob-<br />
jective measure. Fig. 8 shows the reduction of the de-<br />
cay parameter with the number of TF atoms used for<br />
0-7803-5579-2/99/$10.00 0 1999 IEEE 1498<br />
183<br />
03.<br />
025<br />
1<br />
os<br />
07-<br />
Ob-<br />
io5; 04<br />
05-<br />
02 -<br />
01 -<br />
0 0<br />
SNR-OdB<br />
Fig. 7. Comparison of the NRMS error d ues of the denoised<br />
versions of the Synthetic signal with SNR = 0 dB.<br />
.. ..I .. ...... .. ..... ..~ .... . . .: . . ..... , . .<br />
MmbrOl w slornl<br />
Fig. 8. Plot of the decay parameter versus the number of TF<br />
atoms for the synthetic signal with SNR = 10 dB and SNR<br />
= 0 dB.<br />
the synthetic signal with SNR = 10 dB and SNR = 0<br />
dB. It is clearly evident that, in denoising the signal<br />
with SNR = 0 dB, the MP method has been able to<br />
extract fewer coherent structures as compared to the<br />
10 dB case. This result indicates that the higher level<br />
of noise has destroyed some of the low-energy coherent<br />
structures in the 0 dB version of the signal.<br />
The WP method may give better results if the<br />
threshold is selected in an optimal manner. The per-<br />
formance of the WP method for denoising cannot be<br />
appreciated much in the present application, since for<br />
highly nonstationary signals such as the synthetic sig-<br />
nals shown, the WP method produces a mismatch be-<br />
tween the “best” orthogonal basis and many local signal<br />
components. On the contrary, MP is a “greedy” algo-<br />
rithm that locally optimizes the choice of the wavelet<br />
packet function for the signal residue at each stage. The<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.
' I I<br />
1<br />
40<br />
0 05 1 15 2 25 3 35 4<br />
lime m s<br />
Fig. 9. Abnormal VAG signal of a subject with cartilage pathol-<br />
ogy.<br />
-80 '<br />
0 05 1 15 2 25 3 35 4<br />
it- m D<br />
Fig. IO. MP-denoised version of the VAG signal in Fig. 9.<br />
Fig. 11. Difference between the original VAG signal in Fig. 9 and<br />
the MP-denoised version in Fig. 10.<br />
0-7803-5579-2/99/$10.00 0 1999 IEEE 1499<br />
I<br />
1 I<br />
184<br />
Fig. 12. TFD of the abnormal VAG signal in Fig. 9 computed<br />
using the spectrogram.<br />
Tim<br />
0 05 1 15 2 25 3 35 4<br />
Tim<br />
Fig. 13. TFD of the MP-denoised VAG signal in Fig. 10 com-<br />
puted using the spectrogram.<br />
good optimization property of MP is achieved at the ex-<br />
pense of increased computational load as a result of the<br />
greedy approach. Also, in the case of a multicompo-<br />
nent signal where different types of energy structures<br />
are located at different times but in the same frequency<br />
interval, there is no WP basis that is well adapted to all<br />
of them. WP-based decomposition using an orthogonal<br />
basis lacks translation invariance and is thus difficult<br />
to use for pattern recognition. MP is a translation-<br />
invariant method if a translation-invariant dictionary<br />
such as a Gabor dictionary is used. Based on these ob-<br />
servations, the MP technique was selected for denoising<br />
VAG signals.<br />
B. Results with VAG signals<br />
The MP technique was applied to 90 VAG signals<br />
(51 normal and 39 abnormal). TFDs were constructed<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.
using the denoised signals. As an illustration, an abnormal<br />
VAG signal of a subject with cartilage pathology<br />
is shown in Fig. 9, its MP-denoised version is shown in<br />
Fig. 10, and the difference between the original and the<br />
denosied versions of the abnormal VAG signal is shown<br />
in Fig. 11. From Fig. 11, we could observe that a signif-<br />
icant amount of random noise has been removed from<br />
the original signal by the MP-denoising method.<br />
The TFD computed using the spectrogram of the<br />
original signal in Fig. 9 is shown in Fig. 12. The TFD of<br />
the MP-denoised version of the same VAG signal com-<br />
puted using the spectrogram is shown in Fig. 13. The<br />
spectrograms of the original and the denoised VAG sig-<br />
nals were computed using the same short-time Fourier<br />
transform parameters. Tonal and FM components are<br />
clearly seen in the TFD of the denoised VAG signal,<br />
thus facilitating enhanced feature identification. In-<br />
stantaneous features based on energy and frequency pa-<br />
rameters were computed as marginal v alues of the TFDs<br />
[6] of the MP-denoised VAG signals; pattern analysis of<br />
the features indicated screening accuracy of up to 70%.<br />
IV. CONCLUSION<br />
A novel approach to denoise VAG signals for en-<br />
hanced feature extraction and identification was pro-<br />
posed. The denoising methods considered were based<br />
on nonlinear decomposition of signals. The MP method<br />
of denoising is more promising for application to non-<br />
stationary signals such as VAG than the commonly-<br />
used wavelet-based denoising and WP-based denoising<br />
techniques. The wavelet techniques are best adapted<br />
to global signal properties, whereas the MP method is<br />
based on local optimization. Nonstationary signal fea-<br />
tures extracted from the TFDs of MP-denoised VAG<br />
signals have shown good potential for screening normal<br />
knees from abnormal knees.<br />
Acknowledgements: W e gratefully ahowledge support<br />
from the Alberta Heritage Foundation for Medical Re-<br />
search and the Natural Sciences and Engineering Re-<br />
search Council of Canada.<br />
REFERENCES<br />
[l] S. Mallat. A wavelet tour of signal processing. Academic<br />
Press, San Diego, CA., 1998.<br />
[2] D. Donoho. Unconditional bases are optimal bases for data<br />
compression and for statistical estimation. Journal of Appl.<br />
and Cornput. Harmonic <strong>Analysis</strong>, 1:lOO-115, 1993.<br />
[3] M.V. Wickerhauser. Adapted wavelet analysis from theory to<br />
software. IEEE press, Piscataway, NJ., 1994.<br />
[4] S.G. Mallat and Z. Zhang. Matching pursuit with timefrequency<br />
dictionaries. IEEE Tmns. on <strong>Signal</strong> Processing,<br />
41( 12):3397-3415,1993.<br />
[5] R.M. Rsngayyan, S. Krishnan, G.D. Bell, C.B. Frank, and<br />
K.O. Ladly. Parametric representation and screening of knee<br />
joint vibroarthrographic signals. IEEE Trans. on Biomedical<br />
Engineering, 44(11):1068-1074, Nov. 1997.<br />
0-7803-5579-2/99/$10.00 0 1999 IEEE 1500<br />
[6] S. Krishnan, R.M. Rsngayyan, G.D. Bell, C.B. Frank, and<br />
K-0. L~IY. Time-frequency signal feature extraction and<br />
screening of knee joint vibroarthrographic signals using the<br />
matching pursuit method. CDROMproceedings, lgth A,,nual<br />
International Conference of The IEEE Engineering in<br />
Medicine and BiO[OgY Society, Chicago, IL, October 1997.<br />
185<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:27 from IEEE Xplore. Restrictions apply.
Comparative <strong>Analysis</strong> of the Performance of the Time-Frequency<br />
Distributions with Knee Joint Vibroart hrographic <strong>Signal</strong>s<br />
Rangaraj M. Rangayyan and Sridhar Krishnan<br />
Dept. of Electrical and Computer Engineering, <strong>University</strong> of Calgary,<br />
2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />
Email: (ranga)( krishnan)Oenel.ucalgary.ca<br />
Abstract ~ Vibroarthrographic (VAG) signals emitted<br />
by human knee joints can be used to develop a non-invasive<br />
diagnostic tool t,o detect articular cartilage degeneration.<br />
VAG signals are nonstationary and multicomponent in nature;<br />
time-frequency tlistribut,ions (TFDs) provide powerful<br />
means to analyze such signals. The objective of this paper is<br />
to determine the TFD suitable for identification and extraction<br />
of VAG signal features of clinical significance. The TFDs<br />
considered are: autoregressive (AR) model-based TFD; the<br />
reassigned, smoothed. pseudo-Wigner-Ville (RSPWV) distribution;<br />
and a TFD based on signal decomposition using<br />
the matching pursuit (MP) algorithm. As the true TFD of a<br />
VAG signal is not known, the results of the TFDs were compared<br />
based on tmhe expected characteristics using synthetic<br />
signals. The MP TFD shows good potential in analyzing<br />
multicomponent signals with low signal-to-noise ratio when<br />
compared to the AR model-based TFD and the RSPWV<br />
method. The TFD techniques were also tested on VAG signals<br />
with additional information provided by auscultation<br />
and arthroscopy. The results indicate that the MP TFD is<br />
the best, available TFI) to analyze VAG signals.<br />
I. INTRODUCTION<br />
Vibroarthrography (VAG) , the recording of human knee<br />
joint vibration/acoustic signals during active movement of<br />
the leg, can be used as a non-invasive diagnostic tool to<br />
detect articular ca.rtil;ige degenerat,ion. The currently used<br />
“gold standard” for assessment of cartilage surface degeneration<br />
is arthroscopy, where the cartilage surface is inspected<br />
and palpated with a cable. The disadvantage with<br />
arthroscopy is that it cannot be applied to patients whose<br />
knees are in a highly degenerated stmate due to osteoarthritis,<br />
ligamentous insta.bility, meniscectomy, or patellectomy.<br />
The drawbacks with arthroscopy and the limitations of<br />
imaging techniques have motivated researchers to look for<br />
tools such as VAG. In our work, the VAG signal is detected<br />
at the mid-patella position on the surfa.ce of the knee as the<br />
leg is swung over the angle range of 135’ + 0’ -+ 135’ in a<br />
t8ime period of 4s. The signals a.re filtered to the range 10 Hz<br />
t,o 1 kHz and amplified before sampling at a rate of 2 kHz.<br />
The cartilage surfaces of a normal knee are smooth and<br />
slippery. Vibrat,ions generated due to friction between articulating<br />
surfaces of ‘degenerated cartilage are expected to<br />
be different in aniplitude and frequency from those of normal<br />
knees. The important characteristics of VAG signals are<br />
listed below.<br />
0-7803-5073-1/98/’$10.00 01998 IEEE 273<br />
186<br />
The VAG signal is expected to be a multicomponent<br />
signal due to the possibility that during movement, of<br />
the knee, the rubbing of the femoral condyle on the<br />
patella surface provides multiple sources of vibration,<br />
and also due to the possibility that the signal from a<br />
single source can propagate through different channels<br />
of tissue to the mid-patella position, thus giving rise to<br />
multiple energy components at different frequencies for<br />
a given time.<br />
VAG signals are nonstationary due to the fact that the<br />
quality of joint surfaces coming into contact may not be<br />
the same from one angular position (point of time) to<br />
another during articulation of the joint.<br />
Due to the differences in cartilage structures in normal<br />
and abnormal knees, VAG signals with different<br />
frequency law components may be generated. Identification<br />
of such frequency dynamics may help in classification<br />
of normal and abnormal knees.<br />
Our previous approaches tackled the nonstationarity of VAG<br />
signals by adaptively segmenting trhe signals into stationary<br />
components; each segment was parametrically represented<br />
using a separate set of autoregressive coefficients,<br />
dominant poles, or cepstral coefficients. Dominant poles<br />
(poles corresponding to dominant spectral peaks in the signal)<br />
of each segment have provided good discriminant information<br />
for classifying signals into normal and abnormal<br />
groups [l], validating the assumption that the frequency dynamics<br />
of normal VAG signals differ from those of abnormal<br />
signals. A major drawback of the segmentation-based technique<br />
lies in associating the clinical information obtained<br />
during arthroscopy or auscultation with the segments of a<br />
signal. This problem can be overcome by using nonstationary<br />
signal analysis tools such as time-frequency distributions<br />
(TFDs) . TFDs reveal frequency and temporal information<br />
simultaneously, and are particularly attractive for analysis<br />
of multicomponent signals, depiction of frequency laws, and<br />
noise suppression. The purpose of this work is to identify<br />
the best available TFD for objective identification and extraction<br />
of TF structures in VAG signals.<br />
II. TIME-FREQUENCY DISTRIBUTIONS<br />
The right TFD would be one t
signals are: 1) model-based TFD, 2) Cohen's class TFDs,<br />
and 3) TFD based on decomposition of signals.<br />
il. Autowgressave Model-based TFD<br />
In the model-based TFD, the autoregressive (AR) model<br />
coefficients of t,he signal segments are used in estimating the<br />
power spectral density of each segment. In our work, the<br />
model coefficients were estimated using the Burg method.<br />
Fixed segment length was used for synthetic signals, and<br />
adaptive segment length was used for real VAG signals.<br />
B. Reassigned Smoothed Pseudo Wigner- Ville Distribution<br />
The Wigner-Ville distribution (WVD) is the most pop-<br />
ular TFD of Cohen's class. The main drawback with the<br />
WVD is that, in the case of multicomponent signals, cross-<br />
terms are generated in the TFD. Cross-terms can be min-<br />
imized by using two-dimensional low-pass filtering in the<br />
ambiguity domain, and the smoothed version of the WVD<br />
can be obtained. In this paper, the most commonly used<br />
smoothed version of the WVD, namely the smoothed pseudo<br />
Wigner-Ville distribution (SPWVD) , is considered. The SP-<br />
WVD reduces cross-terms significantly. The extent of reduc-<br />
tion in cross-terms depends upon the type of signal being an-<br />
alyzed. In our applications with synthetic and VAG signals,<br />
the smoothing windows used are Gaussian funct,ions.<br />
The smoothing windows suppress cross-terms in the<br />
WVD but smear localized components, leading to less ac-<br />
curate TF localization of signal components as compared to<br />
the WVD. Recently, a reassignment method has been pro-<br />
posed by Auger and Flandrin [a] to improve TF localization<br />
in smoothed TFDs such as SPWVDs.<br />
In the reassignment method, the window is moved from<br />
the geometric center (t,w) to the energy center (i!,ij) of the<br />
TFD. The reassigned SPWVD (RSPWVD) improves the TF<br />
localization of smeared components and provides good read-<br />
ability in the TFD.<br />
C. Matching Pursuit<br />
The TFD generated by the matching pursuit (MP)<br />
method is based on signal decomposition. The MP algo-<br />
rithm selects the decomposition vectors depending upon the<br />
signal properties. The vectors are selected from a family of<br />
waveforms called a dictionary. The signal z(t) is projected<br />
on to a dictionary of Gabor atoms obtained by scaling, trans-<br />
lating, and modulating a Gaussian window function g(t):<br />
where<br />
n=O<br />
and an are the expansion coefficients. The scale factor s,<br />
is used to control the width of the window function, and<br />
the parameter p, controls temporal placement. & is a<br />
274<br />
187<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:19 from IEEE Xplore. Restrictions apply.<br />
normalizing factor which keeps the norm of g,, (t) equal to 1.<br />
The parameters f, and 4, are the frequency and phase of the<br />
exponential function, respectively. In our application, the<br />
window is a Gaussian function, i.e., g(t) = 21/4exp(-7rt2);<br />
the TF atoms are then called Gabor functions<br />
In practice, the algorit,hm works a.s follows: The signal<br />
is iteratively projected on tro a Gabor function dictionary.<br />
The first projection decomposes the signal into two parts:<br />
where (x, grn) denotes the inner product (projection) of z(t)<br />
with the first TF atom gro (t). The term R1x(t) is the residue<br />
after approximating z(t) in the direction of grn (t). This process<br />
is continued by projecting the residue on to the subsequent<br />
functions in the dictionary, and aft,er M iterat,ioiis<br />
M-1<br />
n=O<br />
with Roz(t) = z(t). There are two ways of stopping the it,-<br />
erative process: one is to use a pre-specified limiting number<br />
M of the TF atoms, and the other is to check the energy of<br />
the residue RMx(t). A very high value of M and a zero value<br />
for the residual energy will decompose the signal complet#ely<br />
at t8he expense of increased computational complexity.<br />
In this work, M was limited to 1000 atoms and the resid-<br />
ual energy limit was set to be 0.5% of the total energy. For<br />
VAG signals, the maximum octave length given by log, N<br />
(where N is the number of samples) was set to 11 due t.o t,he<br />
nonstationary nature of the signal. Also, in NIP analysis,<br />
only coherent structures [3] of the signals can be extraded:<br />
the residual components that do not have a high correlation<br />
with the vectors in the dictionary are rejected.<br />
The Wigner distribution of x(t) based on the TF atoms<br />
is given as [3]:<br />
where Wgyn(t,W) is the Wigner transform of the Gaussian<br />
window function. The double sum corresponds to<br />
the cross-terms of the Wigner distribution indicated by<br />
W[grn ,g7ml (t, U) , and should be rejected in order to obtain<br />
a cross-term-free energy distribution of ~(t) in the TF plane.<br />
Thus only the first term is computed, and the resulting TFD<br />
is given by<br />
M-1<br />
The cross-term free TFD W'(t, w) has very good readability<br />
and is appropriate for multicomponent signal analysis. The<br />
extraction of coherent structures makes MP an attractive<br />
tool for TF representation of signals with unknown SNR.
a- .. ............ L .. . :.<br />
-U) ~ .......................<br />
; ........... : .................................... I ............ : .....<br />
-a,<br />
0 1 w o 2 o o o 3 w o u a m m w s o a , ” s w o<br />
umr vmwr Fig. 1. (a) VAG signal of a. normal subject. Grinding sound was heard<br />
during auscultation at an angle range of 50° -+ 0’ (3400 to 4000<br />
samples) during extension of the knee. au- acceleration units.<br />
Fig. 2. TFDs of the signal in Figure 1. (a) AR model-based TFD. X’s<br />
denote dominant poles. (b) RSPWV distribution. (c) MP TFD.<br />
~<br />
275<br />
188<br />
111. RESULTS<br />
A. Results with Synthetic <strong>Signal</strong>s<br />
Before applying the TFDs to real VAG signals, the<br />
TFDs were evaluated with synthetic signals. For exam-<br />
ple, one of the synthetic signals “syn” was composed with<br />
overlapping chirp, impulse, and sinusoidal FM components.<br />
The signal “syn” is a nonstationary signal since its spectrum<br />
varies with time. Transients such as an impulse represent<br />
clicks heard during knee movement. Chirp and sinusoidal<br />
FM components are examples of linear and nonlinear fre-<br />
quency dynamics; their physiological relevance needs to be<br />
studied. AS VAG classification experiments based on dom-<br />
inant poles (adaptive pole or spectral peak tracking) have<br />
provided very good accuracy [l], we believe it is worthwhile<br />
to study VAG signals in terms of their frequency dynamics<br />
with improved time tracking.<br />
To simulate noisy signal conditions the synthetic signals<br />
were corrupted by adding Gaussian noise to an SNR of 10<br />
dB, and to simulate worse signal recording conditions the<br />
synthetic signals were corrupted by adding Gaussian noise<br />
to an SNR of 0 dB.<br />
The results obtained with synthetic signals are pre-<br />
sented below in a summarized version for the sake of brevity.<br />
The synthetic signals were segmented into fixed segments,<br />
and each segment was AR modeled using the Burg lattice<br />
method; a model order of 15, determined empirically, was<br />
used for each segment. The advantage of this method as<br />
compared with other segment-based methods such as the<br />
spectrogram is that the model presumes that the signal out-<br />
side the segment is nonzero as opposed to the spectrogram,<br />
where the signal outside the window is assumed to be zero.<br />
The TFD was free of cross-terms with reasonable TF local-<br />
ization, a,nd the TFD did not include the impulse component;<br />
this is because a transient of very short duration cannot be<br />
modeled by prediction inherent in the AR model. In the<br />
case of low SNR of 0 dB, AR modeling failed to give good<br />
spectral estimates.<br />
Although cross-terms were suppressed in the RSP-<br />
WVDs, the TFDs generated by RSPWVD had negative val-<br />
ues, and may not be suitable for feature extraction as an<br />
accurate estimate of the mean frequency or spread for each<br />
time instant cannot be reliably obtained. The method of<br />
reassignment improved the localization of the components<br />
significantly, but the problem of negative distribution values<br />
exists. In the case of the lower SNR of 0 dB, it was hard to<br />
distinguish the components of interest from cross-terms.<br />
The MP method gave a clear picture of the TF rep-<br />
resentation; the three simulated components were perfectly<br />
localized in the TFDs. This is because the MP TFD pro-<br />
vides adaptive representation of signal components, and due<br />
to the possibility that each high-energy component is ana-<br />
lyzed by the TF representation independent of its bandwidth<br />
and duration. The poor localization of transients by other<br />
techniques such as Fourier and wavelets is due to the fact<br />
that the transient information gets diluted across the whole<br />
basis and t,he collection of ba.sis functions is not as large as<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:19 from IEEE Xplore. Restrictions apply.
compared to that in the MP dictionary.<br />
In the lower SNR, the MP TFD was better than those<br />
obtained using the other techniques. The MP TFD could be<br />
made more readable by extracting only the coherent struc-<br />
tures of the signal. The MP technique has the facility to<br />
include automatic denoising of the signal, which is useful in<br />
situations where the SNR of the signal is not known.<br />
B. Results with VAG signals<br />
The TFD methods were tested on ten VAG signals.<br />
For computing the AR model-based TFDs, the signals were<br />
adaptively segmented into quasi-stationary segments using<br />
the recursive least-squares lattice (RLSL) algorithm [4]. The<br />
segments were AR modeled using the Burg-lattice algorithm<br />
and the model order used was 40 [4]. For the sake of illustra-<br />
tion, the VAG signal (“vagl”) of a normal subject is shown<br />
in Figure l(a). Grinding sound was heard during auscul-<br />
tation at an angle range of 50’ + 0’ (approximately in the<br />
range of 2500 to 4000 samples) for this subject. From the AR<br />
model-based TFD of t,he signal “vagl” in Figure 2(a), we can<br />
observe that the TF representation is cross-term-free. The<br />
grinding sound is shown as a high-frequency activity. The<br />
localization of the component, corresponding to the grinding<br />
sound is coarse, and the precise angle (or time) at which the<br />
sound was heard cannot, be readily det,ermined. Because of<br />
the coa,rse estimation of components, the AR model-based<br />
TFD may not be appropriate for instantaneous parameter<br />
extraction. The most dominant pole indicated by the ‘X’<br />
marks superimposed on t,he AR model-based TFD in Figure<br />
2(a), indicat,es the dominant spectral peaks in the signal. As<br />
t,he dominant, poles are selected on a segment-by-segment<br />
basis, the dominant poles are also not suitable for instanta-<br />
neous parameter tracking.<br />
Figure 2(b) shows the TFD obtained using the RSPWV<br />
method. The TFD is obviously not readable except for the<br />
component corresponding to the “grinding” sound. Further,<br />
the TFD has negative values due to cross-terms. The neg-<br />
ative values may mislead parameter calculation, and hence<br />
the RSPWVD may not be appropriate for feature extraction.<br />
The MP TFD is shown in Figure 2(c). The TFD was<br />
constructed using the coherent structures of the signal only,<br />
and the number of TF atoms was 441. The TFD has clearly<br />
represented the “grinding” sound with very good localiza-<br />
t,ion. The TFD obtained is a positive distribution and is free<br />
of cross-terms, and is suitable for feature extraction.<br />
C. Classification Experiments<br />
A database of 90 VAG signals was compiled, including<br />
51 normal and 39 abnormal signals. Although the MP TFD<br />
does not satisfy true marginal properties, time-varying pa-<br />
rameters with discriminant information can be computed as<br />
marginal values of an MP TFD. The time-varying parame-<br />
ters that were extracted from the MP, TFD were:<br />
1. Energy Parameter: the mean of W- (t,w) along each time<br />
slice, which gives a measure of energy variation with time.<br />
2. Energy Spread Parameter: the standard deviation of<br />
W’ (t, U) along each time slice.<br />
276<br />
189<br />
3. Frequency Parameter: the first moment along each time<br />
slice as given by the expression<br />
4. Frequency Spread Parameter: the second central moment<br />
along each time slice as given by<br />
IMFS(t) =<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:19 from IEEE Xplore. Restrictions apply.<br />
cy::[w - IMF(t)12W’(t, w)<br />
cy=”,” Mi’ (t, U)<br />
(7)<br />
The mean and standard deviation of the parameters over<br />
the duration of each signal were comput,ed, and each VAG<br />
signal was represented by a set of eight features. Statistical<br />
pattern classification based on stepwise logistic regression<br />
analysis [5] of the features of the 90 VAG signals as nor-<br />
mal/abnormal was achieved with an accuracy of 74.4%. The<br />
frequency parameter significantly contributed towards accu-<br />
rate classification of VAG signals. This gives motivation to<br />
search for linear and nonlinear frequency components in the<br />
TF plane of VAG signals.<br />
IV. CONCLUSION<br />
Segmentation-based analysis of VAG signals has limita-<br />
tions in correlating the estimated angle range of pathology as<br />
observed during arthroscopy with the segments of the signal.<br />
The problem of segmentation can be avoided by using non-<br />
stationary signal analysis tools such as TFDs. It is difficult<br />
to interpret VAG TFDs, and even harder t80 train clinicians<br />
in interpreting TFDs. Therefore, TFDs should be selected so<br />
as to facilitate objective feature identification and extrahon.<br />
<strong>Analysis</strong> of the performances of different TFDs shows that<br />
the MP TFD is the most suitable TFD for VAG signal anal-<br />
ysis. Preliminary results with 90 VAG signals suggest that<br />
the parameters extracted from the MP-based TFD provide<br />
good discriminant information. Compared with our previ-<br />
ous methods, the proposed method does not need any joint<br />
angle and clinical information, and shows good potential for<br />
noninvasive diagnosis of articular cartilage pathology.<br />
Acknowledgements: We gratefully acknowledge support from<br />
the Alberta Heritage Foundation for Medical <strong>Research</strong>.<br />
REFERENCES<br />
R.M. Rangayyan, S. Krishnan, G.D. Bell, C.B. Frank, and K.O.<br />
Ladly. Parametric representation and screening of knee joint vi-<br />
broarthrographic signals. IEEE Transactions on. Biomedical Engi-<br />
neering, 44(11):1068-1074, NOV. 1997.<br />
F. Auger and P. Flandrin. Improving the readability of time-<br />
frequency and time-scale representations by the reassignment<br />
method. IEEE Transactions on <strong>Signal</strong> Processing, 43(5):1068-<br />
1089, May 1995.<br />
S.G. Mallat and Z. Zhang. Matching pursuit with time-frequency<br />
dictionaries. IEEE Transactions on <strong>Signal</strong> Processing, 41( 12):3397-<br />
3415, 1993.<br />
S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank, and K.O.<br />
Ladly. Adaptive filtering, modelling, and classificat,ion of knee joint<br />
vibroarthrographic signals for non-invasive diagnosis of articular<br />
cartilage pathology. Medical and Biological Engineering and Com-<br />
puting, 35577-684, Nov. 1997.<br />
A.P. Afifi and S.P. Azen. Statistical <strong>Analysis</strong>: il computer Oriented<br />
approach. Academic Press, Inc., New York, NY., 2nd edition, 1979.
DETECTION OF NONLINEAR FREQUENCY-MODULATED<br />
COMPONENTS IN THE TIME-FREQUENCY PLANE USING AN<br />
ARRAY OF ACCUMULATORS<br />
Sridhar Krishnan and Rangaraj M. Rangayyan<br />
Dept. of Electrical and Computer Engineering, <strong>University</strong> of Calgary,<br />
2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />
Email: (krishnan) (ranga)@enel.ucalgary.ca<br />
Abstract - We propose a novel approach to detect<br />
nonlinear frequency-modulated (FM) components such<br />
as sinusoidal and hyperbolic FM components in multi-<br />
component, nonstationary signals in the time-frequency<br />
(TF) plane. The approach, based upon the use of an<br />
array of accumulators, can be used to detect nonlinear<br />
FM components with varying energy in low signal-to-<br />
noise ratio environments.<br />
I. INTRODUCTION<br />
Instantaneous frequency (IF) is an important pa-<br />
rameter in characterizing the nonstationary behavior<br />
of a signal. IF could be frequency modulated (FM)<br />
as a linear component (e.g. chirp) or as a nonlinear<br />
component (e.g. quadratic FM) with time. Detection<br />
of linear and nonlinear FM components in a nonsta-<br />
tionary signal has been studied extensively by using<br />
time-frequency (TF) representations [l], [2] and poly-<br />
nomial phase transforms (PPT) [3]. In PPT, the FM<br />
component is detected by estimating the phase coeffi-<br />
cients of the given complex signal. The disadvantages<br />
with PPT are that it can only be applied to signals<br />
whose amplitude variations are slower than their phase<br />
variations, and that reliable estimates of the phase co-<br />
efficients are not guaranteed under low signal-to-noise<br />
ratio (SNR) conditions. Barbarossa and Lemoine es-<br />
timated nonlinear FM parameters by using the reas-<br />
signed, smoothed, pseudo- Wigner-Ville representmation<br />
and the Hough transform. Although the method is at-<br />
tractive, accurate estimation of FM parameters is not<br />
possible in the presence of cross-terms.<br />
In our work, the nonlinear frequency parameters of<br />
a signal are estimated via its TF representation. The<br />
TF representation is treated as an image, where each<br />
pixel corresponds to the energy present at a particular<br />
time and frequency.<br />
11. TIME-FREQUENCY DISTRIBUTIONS<br />
The main conditions under which a TF distribution<br />
(TFD) can be treated as an image are:<br />
The TFD should be positive.<br />
0-7803-5073- 1/98/$10.00 01998 IEEE 557<br />
The TFD should satisfy the marginal properties.<br />
Cross-terms should be negligible in order to avoid<br />
false search.<br />
The widely used Cohen’s class TFDs do not satisfy the<br />
above requirements as the kernel used is functionally<br />
independent of the signal. TFDs based on linear combinations<br />
of the Wigner distributions of TF atoms, as<br />
given by a decomposition algorithm such as matching<br />
pursuit [4], are positive distributions and are cross-term<br />
free; however, they do not satisfy the marginal properties.<br />
TFDs that are positive and satisfy the marginals<br />
do exist, and one can obtain an infinite number of them<br />
for any signal. Such TFDs are nonlinear functions of the<br />
signal; the kernels for these TFDs are generally signaldependent,<br />
and are known as Cohen-Posch TFDs [5].<br />
Accordingly, while the Cohen-Posch TFDs can, in theory,<br />
be obtained from Cohen’s general formulation, the<br />
signal-dependence of the kernel, coupled with its possible<br />
unbounded nature, calls for alternative formulations<br />
of practical implementation of the Cohen-Posch<br />
TFDs. One formulation which is particularly tractable<br />
and readily demonstrates the positivity and marginal<br />
conditions is:<br />
P(t,w) = P(t)P(w) Q(u(t), vb)), (1)<br />
where P(t) = ls(t)I2 and P(w) = IS(w)I2 are the<br />
marginal densities with s(t) being the given time-<br />
domain signals and S(w) the Fourier transform of the<br />
signal, and Q(u, w) is any positive function of the variables<br />
( U, w) over 0 5 ( U, w) 5 1, normalized to one:<br />
s s<br />
In Eq.1, we have<br />
Q(U, w)du = Q(U, v)dv = 1. (2)<br />
U = u(t) = fm P(t’)dt’,<br />
‘U =.(U) = J:mP(w’)dw’.<br />
(3)<br />
It is obvious that the density P(t,w) is positive, and<br />
straightforward to show that the marginals are satis-<br />
190<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.
fied. In addition to being positive and yielding the cor-<br />
rect marginals, the Cohen-Posch TFDs are also shift-<br />
invariant and scale-invariant. An algorithm to effi-<br />
ciently compute the Cohen-Posch TFDs has been pro-<br />
posed by Loughlin et al. [ti]. The algorithm is based<br />
on minimum cross-entropy (MCE) optimization of the<br />
density functions.<br />
USING AN ARRAY OF<br />
111. TFD ANALYSIS<br />
ACCUMULATORS<br />
The approach of the present work is to apply the<br />
Hough transform based on an array of accumulators<br />
to positive TFDs obtained using the MCE method to<br />
detect energy-varying quadratic FM components such<br />
as sinusoidal and hyperbolic FM components with un-<br />
known parameters. Sinusoidal and hyperbolic FM sig-<br />
nals are common in synthetic aperture radar, multi-<br />
path communication channels , helicopter recog nit ion,<br />
and sonar.<br />
The detection algorithm is based upon the use of<br />
an array of accumulators; the dimensionality of the ar-<br />
ray depends upon the number of parameters to be es-<br />
timated. The TFD is treated as an image with gray<br />
values corresponding to the normalized energy values<br />
of the components.<br />
Let us first consider the procedure for detecting a<br />
sinusoidal FM component in the TF plane. In practice,<br />
a sinusoidal FM component may occur at any location<br />
in the TF plane, and hence a generalized expression of<br />
a sine wave is considered: fk = A + m sin(2.rr fo k + 4),<br />
where A is the frequency shift in the TF plane, fk is the<br />
frequency at time L, fo is the number of cycles of the<br />
sinusoidal FM, 4 is the phase shift, and m is the am-<br />
plitude. In practice, a sinusoidal FM component may<br />
not have a constant amplitude. In order to minimize<br />
the effect of amplitude variations, an attempt is made<br />
to make the waveform continuous by using edge-linking<br />
techniques based upon a gradient method (i.e., by ap-<br />
plying an image processing algorithm to the TFD).<br />
The algorithm works as follows:<br />
1. Each parameter is bounded by a minimum and<br />
a maximum value. For each point (fk, IC) in the<br />
TF plane carrying a nonzero value, we let the pa-<br />
rameters m, fo, and q5 equal each of the allowed<br />
(quantized or binned) values and solve for the<br />
corresponding A using the equation A = fk -<br />
m sin(2.rr fo k + 4). The parameter A is rounded<br />
to the nearest allowed quantized value.<br />
2. If the choice of the parameters results in a nonzero<br />
value for A, we increment the corresponding four-<br />
dimensional array (initialized to zero) and the en-<br />
ergy value corresponding to the cell. It is obvious<br />
that sinusoidal FM components will correspond to<br />
high-intensity hypersurfaces in the Hough param-<br />
eter domain.<br />
3. A threshold is then applied to the total energy<br />
value and the number of points. This facilitates the<br />
detection of energy-varying sinusoidal FM compo-<br />
nents of significant duration.<br />
4. The mean and standard deviation of the army in-<br />
dices are computed using those accumula.t.or cells<br />
whose values have passed t>he threshold t,est,. A<br />
high value for st.anda.rd deviation of t.he parame-<br />
ters indicates the presence of multiple sinusoidal<br />
FM components in the TF plane.<br />
For the detection of hyperbolic FM, the parametric<br />
equation fk = A + is considered, where b is a con-<br />
stant related to the time shift. The two parameters A<br />
and b can provide a generalized representat,ion of any<br />
hyperbolic FM phenomenon. The est.imation procedure<br />
is sinii1a.r to that for sinusoidal FM; however, t,he dimen-<br />
sion of the array is two (.4 and b) instlead of four.<br />
IV. RESULTS<br />
The proposed method was tested on synthetic sig-<br />
nals containing sine and hyperbolic FM components.<br />
The TFDs were obtained using the MCE algorithm with<br />
narrowband and wideband spectrograms as the a pri-<br />
orz estimate. The TFDs were of high TF resolution and<br />
free of cross-terms.<br />
Figure l(a) shows a nonstationary signal wit,h a<br />
sinusoidal FM component with SNR = 0 dB. The TFD<br />
obtained using the MCE method is shown in Figure<br />
l(b). The Hough parameter domain (which cannot be<br />
displayed due to its dimensionality of four) indicated<br />
the presence of a sinusoidal FM component, and the<br />
estimation of the parameters -4, fk, fo, 4, and m of<br />
the sinusoidal FM was accurate. The threshold for the<br />
accumulator cells was set to be 3% of the total energy<br />
value of the signal.<br />
Figure 2(a) shows a nonstationary signal with a hy-<br />
perbolic FM component embedded in white noise at an<br />
SNR of 0 dB. Figure 2(b) shows the TFD of the signal<br />
computed using the MCE method. The Hough param-<br />
eter space is shown in Figure 2(c) with the co-ordinates<br />
corresponding to A and b. The threshold was set to 9%<br />
of the total energy value of the signal. From the Hough<br />
parameter space we can infer that the highest intensity<br />
point corresponding to A = 23 and b = 507 are the<br />
parameters of the hyperbolic FM.<br />
A multicomponent nonstationary signal consisting<br />
of a sinusoidal FM component and a hyperbolic FM<br />
component corrupted by random noise to an SNR of<br />
0 dB is shown in Figure 3(a). The MCE-based TFD<br />
is shown in Figure 3(b). The sinusoidal FM detection<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.<br />
558<br />
191
1<br />
08<br />
-08C I I<br />
100 203 300 400 5W 6W 700 800 900 1000<br />
flme 5amDIe5<br />
(a)<br />
I<br />
100 200 3W 400 500 600 7W 800 900<br />
I<br />
1OW<br />
t,me samples<br />
(b)<br />
Fig. 1. (a) Nonstationary signal with a sinusoidal FM component<br />
at an SNR of 0 dB. (b) TFD of the signal in (a) computed<br />
using the MCE method.<br />
method successfully indicated the presence of a sinu-<br />
soidal FM component, and the hyperbolic FM detec-<br />
tion method indicated the presence of a hyperbolic FM<br />
component. The Hough parameter space of hyperbolic<br />
FM detection is shown in Figure 3(c).<br />
V. CONCLUSION<br />
The proposed method successfully detected non-<br />
linear FM component,s, and the parameters estimated<br />
were accurate within *lo% of their actual values even<br />
at a low SNR of -5 dB. The nonlinear FM components<br />
were not detected at an SNR of -10 dB. Better esti-<br />
mates of the parameters under low-SNR conditions can<br />
be achieved by increasing the number of quantization<br />
levels for each parameter and by denoising the signals<br />
before computing their TFDs. A difficulty with the pro-<br />
posed method lies in the selection of a suitable thresh-<br />
old (lower thresholds have to selected for low SNR).<br />
J<br />
8<br />
::I . t<br />
l<br />
Fig. 2. (a) Nonstationary signal with a hyperbolic FM compo-<br />
nent at an SNR of 0 dB. (b) TFD of the signal in (a) computed<br />
using the MCE method. (c) Hough parameter space.<br />
The performance of the method needs to be evaluated<br />
in comparison with the existing methods under differ-<br />
ent SNR conditions. The method can be extended to<br />
detect any pattern (signature) in the TFD provided the<br />
pattern can be expressed in a parametric form.<br />
The proposed method may find application in the<br />
detection of the presence of nonlinear FM components<br />
in biomedical signals such as knee joint sound signals,<br />
and facilitate screening of normal and abnormal signals.<br />
559<br />
Acknowledgements: We gratefully acknowledge support<br />
192<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.
08<br />
[3] S. Peleg and B. Friedlander. Multicomponent signal analysis<br />
06 using the polynomial-phase transform. IEEE Transactzons<br />
on Aerospace and Electronic Systems, 32( 1):378-387, January<br />
04<br />
1996.<br />
02 [4] S.G. Mallat and 2. Zhang. Matching pursuit with timej<br />
0<br />
$4,<br />
frequency dictionaries. IEEE Transactzons on Szgnal Proceeszng,<br />
41(12):3397-3415, 1993.<br />
[5] L. Cohen and T. Posch. Positive time-frequency distribution<br />
fuctions. IEEE Transactzons on Acoustzcs, Speech, and Szgnal<br />
-0 4<br />
-0 e<br />
Processzng, 33:31-38, 1985.<br />
[6] P. Loughlin, J. Pitton, and L. Atlas. Construction of positive<br />
4* time-frequency distributions. IEEE Transactzons on Szgnal<br />
Processzng, 42:2697-2705, 1994.<br />
100 200 300 400 so0 500 700 800 900 ,000<br />
r,me samples<br />
(a)<br />
(C)<br />
Fig. 3. (a) Nonstationary signal with a sinusoidal FM component<br />
and a hyperbolicFM component at an SNR of 0 dB. (b) TFD<br />
of the signal in (a) computed using the MCE method. (c)<br />
Hough parameter space of hyperbolic FM detection.<br />
from the Natural Sciences and Engineering <strong>Research</strong><br />
Council of Canada (NSERC) and the Alberta Heritage<br />
Foundation for Medical <strong>Research</strong> (AHFMR).<br />
REFERENCES<br />
[I] S. Barbarossa. <strong>Analysis</strong> of multicomponent LFM signals by<br />
combined Wigner-Hough transform. IEEE Transactions on<br />
<strong>Signal</strong> Processing, 43(6):1511-1515, June 1995.<br />
[2] S. Barbarossa and 0. Lemoine. <strong>Analysis</strong> of nonlinear FM sig-<br />
nals by pattern recognition of their time-frequency represen-<br />
tation. IEEE <strong>Signal</strong> Processing Letters, 3(4):112-115, April<br />
1996.<br />
193<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:21 from IEEE Xplore. Restrictions apply.<br />
560
Proceedings - 19th international Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />
TIME-FREQUENCY SIGNAL FEATURE EXTRACTION AND<br />
SCREENING OF KNEE JOINT VIBROARTHROGRAPHIC<br />
SIGNALS USING THE MATCHING PURSUIT METHOD<br />
Sridhar Krishnan' , Rangaraj M. Rangayyan1v2, G. Douglas Bel11*213, Cyril B.<br />
'Dept. of Electrical and Computer Engineering, 2Dept. of Surgery, 3Sport Medicine Centre<br />
The <strong>University</strong> of Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca)<br />
Abstmct-Nonstationary features of knee joint vi-<br />
broarthrographic (VAG) signals were extracted from<br />
their time-frequency distributions (TFDs) obtained us-<br />
ing the matching pursuit method. Features computed<br />
as marginal calculations of the TFDs were instantaneous<br />
energy, instantaneous energy spread, instantaneous mean<br />
frequency, and instantaneous mean frequency spread.<br />
The features carry information about the combined time-<br />
frequency dynamics of the signals. The mean and stan-<br />
dard deviation of the features were also computed, and<br />
each VAG signal was represented by a set of just eight<br />
parameters. The method was tested on 37 VAG signals<br />
(19 normal and 18 abnormal) with no restriction on the<br />
type of articular cartilage pathology. Discriminant analy-<br />
sis of the parameters showed an accuracy of 89.5% at the<br />
training stage and 77.8% at the test stage. Compared<br />
to our previous methods, the proposed method does not<br />
need any joint angle and clinical information, and shows<br />
good potential for noninvasive diagnosis and monitoring<br />
of articular cartilage pathology.<br />
Keywords: Vibroarthrography, Knee sounds, Time-<br />
frequency analysis, Articular cartilage, Matching pursuit.<br />
I. INTRODUCTION<br />
Knee joint vibration or sound signals, also known as<br />
vibroarthrographic (VAG) signals, emitted during active<br />
movement of the leg are expected to be associated with<br />
pathological conditions of the articular cartilage. VAG<br />
signal analysis could lead to a clinical tool for diagno-<br />
sis and monitoring of true articular cartilage pathology<br />
such as chondromalacia of the patella. A variety of VAG<br />
signal analysis techniques have been proposed in the lit-<br />
erature [I], [2], [3], [4], [5]. All of the previous methods<br />
used standard signal processing techniques based on the<br />
Fourier transform or autoregressive modeling, by assum-<br />
ing the signal to be either stationary or by segmenting<br />
the signal into quasi-stationary parts.<br />
In the present work, the nonstationarity of VAG sig-<br />
nals is taken into consideration, which arises due to the<br />
fact that different joint surfaces come in contact during<br />
movement, and the nature and quality of the joint sur-<br />
faces coming in contact may not be same from one posi-<br />
tion to the next. Hence both intra- and inter-subject vari-<br />
(O-7803-4262-3/97/$10.00 (C) 1997 IEEE)<br />
1309<br />
ability of signal characteristics are expected. Although<br />
our previous approaches [4], 151 addressed nonstationar-<br />
ity to some extent by using robust adaptive segmentation<br />
algorithms, there was a difficulty in labeling individual<br />
segments as normal or abnormal. This is because an<br />
accurate estimation of the joint angle corresponding to<br />
pathology as observed during arthroscopy could not be<br />
achieved. The problem could be completely avoided by<br />
using nonstationary signal analysis tools such as time-<br />
frequency (TF) and wavelet transforms. The objective<br />
of our current work is to extract and identify relevant<br />
features in the TF plane which could discriminate abnor-<br />
mal knees from normal knees based solely on VAG signal<br />
features.<br />
A. Data Acquisition<br />
11. METHODS<br />
Each subject sat on a rigid table in a relaxed position<br />
with his/her leg freely suspended in air. The VAG signal<br />
was recorded at the mid-patella position of the knee as<br />
the subject swung his/her leg over an approximate an-<br />
gle range of 135' + 0' -+ 135' in 4s. The signal was<br />
prefiltered and amplified before digitizing at a sampling<br />
rate of 2kHz. A database of 37 signals was prepared, in-<br />
cluding 18 signals of symptomatic patients scheduled to<br />
undergo arthroscopy. There was no restriction imposed<br />
on the type of pathology, and the abnormal signals in-<br />
cluded chondromalacia of different grades at the patella,<br />
meniscal tear, tibial chondromalacia, and anterior cruci-<br />
ate ligament injuries.<br />
B. Time-Fwquency <strong>Analysis</strong><br />
Features of VAG signals were extracted from their<br />
time-frequency distributions (TFDs) obtained using the<br />
matching pursuit (MP) method [SI. In MP analysis,<br />
the given signal is decomposed into a linear expansion<br />
of waveforms, known as TF atoms, selected from a re-<br />
dundant dictionary of functions. The TF atoms in the<br />
dictionary are generated by scaling, translating, and fre-<br />
quency modulating a normalized window function g7(t).<br />
194<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.
Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />
The MP method represents a signal z(t) as:<br />
where<br />
and a, are the expansion coefficients. The scale factor s,<br />
is used to control the width of the envelope of grn(t), and<br />
the parameter p, controls the temporal placement. -&<br />
is a normalizing factor, which keeps the norm of gyn(t)<br />
equal to 1. The parameters fn and 4, are the frequency<br />
and phase of the exponential function, respectively. In<br />
our application, the envelope function is a Gaussian func-<br />
tion, i.e., g(t) = 2'14 exp(--Kt2); the TF atoms are then<br />
called Gabor functions.<br />
In practice, the algorithm works as follows: First,<br />
the signal is projected on to a Gabor function dictionary.<br />
The projection decomposes the signal into two parts:<br />
4t) = (",gro)g-Yo(t) + R14% (3)<br />
where (2, gyo) denotes the inner product (projection) of<br />
z(t) with the first TF atom gro(t). The second term<br />
R'z(t) is the residual vector after approximating z(t)<br />
in the direction of gyo(t). This process is continued by<br />
projecting the residue on to the dictionary, and after M<br />
iterations<br />
M-1<br />
42) = (R"z,g,,)g,,(t) + R'4% (4)<br />
n=O<br />
with Roz(t) = ~ (2). There are two ways of stopping the<br />
iterative process: one is to use a pre-specified limiting<br />
number A4 of TF atoms, and the other is to verify the<br />
energy of the residue RMz(t). A very high value of M<br />
and a zero value for the residual energy will decompose<br />
the signal completely at the expense of increased compu-<br />
tational complexity.<br />
In this work, M was chosen to be 1000 atoms, the<br />
residual energy limit was set to be zero, and only coherent<br />
structures were extracted (i.e., components determined to<br />
be noise by the MP algorithm were rejected).<br />
Now, the Wigner distribution of t(t) based on the<br />
TF atoms is given as [6]:<br />
M-1<br />
W(t,w) = I(Rnz,g,n)12 Wgyn(t,u)+<br />
n=O<br />
-<br />
i<br />
P o<br />
-20<br />
-40<br />
Time samples (1 to 8 I92 samples)<br />
@)<br />
Fig. 1. (a) A normal VAG signal. (b) TFD of (a) obtained using<br />
the matching pursuit method. au: Acceleration units.<br />
Time samples ( I to R 192 samples)<br />
(b)<br />
M-1 M-1<br />
Fig. 2. (a) The VAG signal of a pathological knee with chon-<br />
(R"z,grn) (Rmwrm)* ~bYn,gYml(t>4, dromalacia of the patella. (b) TFD of (a) obtained using the<br />
n=O m=o matching pursuit method. au: Acceleration units.<br />
m#n<br />
1310<br />
195<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.
Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />
where Wgr,(t,w) is the Wigner transform of the Gaus-<br />
sian window function. The double sum corresponds to<br />
the cross-terms of the Wigner distribution, and should<br />
be removed in order to obtain an interference-free energy<br />
distribution of z(t) in the TF plane. Thus only the first<br />
term is retained, and the interference-free TFD is given<br />
by<br />
M-1<br />
W’(t,4 = I(Rn~lSrn)IZwgrn(tlw). (5)<br />
n=O<br />
Figs.l(b) and 2(b) show the TFDs of the normal<br />
VAG signal in fig.l(a) and the abnormal VAG signal in<br />
fig.2(a). The TFD of the normal signal was computed<br />
using 392 TF atoms, and the abnormal signal’s TFD<br />
was computed using 441 TF atoms. The TFDs obtained<br />
are of very high resolution, and with synthetic signals of<br />
known TF dynamics, the TFDs based on the MP algo-<br />
rithm showed very good localization in both time and<br />
frequency. The bright spots in the TFD figures corre-<br />
spond to the TF atoms; the brightness increases with<br />
energy. In the illustration, the TFD of the abnormal sig-<br />
nal shows more high-frequency activity than the TFD of<br />
the normal signal. However, this need not be always true<br />
(especially in the case of normal noisy knees), and mere<br />
visual interpretation of the TFDs will not help in discrim-<br />
inating pathological knees from normal knees. In order<br />
to differentiate the signals, features of diagnostic value<br />
need to be extracted from the TF plane.<br />
C. Time-Frequency <strong>Signal</strong> Features<br />
As the TFD obtained using the TF atoms is an<br />
interference-free distribution and is always positive, fea-<br />
tures derived from the TF plane will posses a high de-<br />
gree of accuracy as compared to features obtained with<br />
Cohen’s class TF transforms. The features used in the<br />
present work were computed as marginal calculations of<br />
the TFDs. The four TF features of relevance derived<br />
from the TFDs of VAG signals were:<br />
Instantaneous energy (IE): As the TFDs were ob-<br />
tained using TF atoms that were coherent with the<br />
signal structure and the signal components that were<br />
determined to be noise by the MP algorithm were re-<br />
jected, the IE obtained as afunction of time will have<br />
a high signal-to-@se ratio. The IE was computed<br />
as the mean of W (t,w) along each time slice, which<br />
gives a measure of energy variation with time. Sig-<br />
nals generated by pathological knees will be highly<br />
time-variant (i.e., they are highly nonstationary) be-<br />
cause of the differences in cartilage roughness and<br />
nonuniformity. Thus the IE of an abnormal signal is<br />
expected to show large variations with time.<br />
Instantaneous energy spread (IES): IES measures<br />
the spread of energy over frequency for each time<br />
slice. This was computed as the standard deviation<br />
131 1<br />
of W’(t,w) along each time slice. This is a good<br />
measure if the signal is multicomponent in nature.<br />
Abnormal VAG signals generated as a result of fric-<br />
tion between rough cartilage surfaces may have more<br />
components because of the nonuniformity of the sur-<br />
faces, and a high signal energy spread is expected<br />
around the IE.<br />
Instantaneous mean frequency (IMF): IMF was com-<br />
puted as the first moment along each time slice:<br />
IMF measures the frequency dynamics of the sig-<br />
nal. The movement of the knee during signal acqui-<br />
sition may cause some linear or nonlinear frequency<br />
modulation of the signal, with the modulation in-<br />
dex depending on the state of lubrication, stiffness,<br />
and roughness of the cartilage surfaces. Pathologi-<br />
cal knees have less lubricated and rougher cartilage<br />
surfaces than normal knees, and hence the IMF of<br />
pathological knees will be different from that of nor-<br />
mal knees.<br />
Instantaneous mean frequency spread (IMFS): IMFS<br />
is given by the second central moment along each<br />
time slice:<br />
IMFS gives the spread of frequency about the mean<br />
frequency for each time instant. The spread of fre-<br />
quency at a time instant arises as a result of am-<br />
plitude modulation. Amplitude modulation is pos-<br />
sible in VAG signals, and may be dependent on the<br />
quality and intensity of sound produced due to joint<br />
vibration. IMFS could be an excellent feature in<br />
identifying noisy knees,<br />
The four featires derived above are dependent on the<br />
functional state of the cartilage surfaces in the knee joint,<br />
and are expected to be suitable for discriminating patho-<br />
logical knees from normal knges.<br />
D. Pattern Classification<br />
The features discussed in previous section are time<br />
dependent. This could be easily observed in the wave-<br />
forms shown in fig.3, which were derived from the TFD<br />
shown in fig.2(b). In order to facilitate a global deci-<br />
sion on the signal, the mean and variance of the features<br />
were calculated. Therefore, a given VAG signal will have<br />
eight parameters. The classification of knees as normal or<br />
pathological was achieved using a statistical pattern clas-<br />
sifier based on discriminant analysis of the parameters [7].<br />
The database was randomly split into two (almost) equal<br />
parts. The features of signals in one part of the database<br />
196<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.
Proceedings - 19th International Conference - IEEE/EMBS Oct. 30 - Nov. 2, 1997 Chicago, IL. USA<br />
Fig. 3. Features of the abnormal signal in fig.?. (a) IE waveform.<br />
(b) IES waveform. (c) IMF waveform. (d) IMFS waveform.<br />
were used to train the classifier. The classifier was then<br />
tested on the second part of the database. The signals<br />
used in the test stage were different from those in the<br />
training stage. The classification accuracy is given as the<br />
percentage of the number of correctly classified signals to<br />
the number of signals in the group.<br />
111. RESULTS AND DISCUSSION<br />
The classifier was trained with the TF parameters of<br />
19 signals (9 normal and 10 abnormal), and was tested<br />
with 18 signals (10 normal and 8 abnormal). The classifi-<br />
cation accuracy in training was 89.5%, and the test stage<br />
accuracy was 77.8%.<br />
Among the features used, IE and IES contributed<br />
significantly to the classification accuracy. This shows<br />
that abnormal VAG signals possess highly time-varying<br />
energy as compared to normal VAG signals. The perfor-<br />
mance of IES leads to the conclusion that VAG signals<br />
of pathological knees are more multicomponent in na-<br />
1312<br />
197<br />
ture than those of normal knees. This may be due to<br />
the possibility that the roughness of cartilage surfaces in<br />
pathological knees give rise to multiple sources of vibra-<br />
tion signals. Though IMF did not show high discrimina-<br />
tion, it can be used to detect linear frequency-modulated<br />
components or chirp signals (if any) in VAG signals. We<br />
are currently working on detection of chirp components<br />
in TFDs from an image processing perspective. While<br />
IMFS may not be a very good feature for classification of<br />
knees as normal or abnormal, it could be used to study<br />
the sound patterns of knees, and to investigate whether<br />
the sound patterns of pathological knees differ from those<br />
of normal knees.<br />
The use of the variance of the features could allow<br />
this method to be applied on a larger set of signals, and<br />
could avoid any bias as a result of variations in transducer<br />
placement, amplifier-filter settings, etc.<br />
1V. CONCLUSIONS<br />
A novel approach to analyze nonstationary VAG sig-<br />
nals was used. The method does not require any joint an-<br />
gle information to label the components of a VAG signal,<br />
and is independent of patient information such as age,<br />
activity level, and gender. <strong>Signal</strong> features of diagnostic<br />
relevance were extracted from their TFDs obtained using<br />
the MP method. The TF features extracted have shown<br />
good potential of this method for screening of VAG sig-<br />
nals.<br />
Acknowledgements: We gratefully acknowledge support<br />
from the Alberta Heritage Foundation for Medical Re-<br />
search and the Arthritis Society of Canada.<br />
REFERENCES<br />
M.L. Chu, LA. Gradisar, and R. Mostardi. A noninvasive electroacoustical<br />
evalution technique of cartilage damage in pathological<br />
knee joints. Medical and Biological Engineering and<br />
Computing, 16:437-442,1978.<br />
W.G. Kernohan, D.E. Beverland, G.F. McCoy, A. Hamilton,<br />
P. Watson, and R.A.B. Mollan. Vibration arthrometry. Acta<br />
Orthop. Scand., 61(1):70-79,1990.<br />
N.P. Reddy, B.M. Rothschild, M. Mandal, V. Gupta, and<br />
S. Suryanarayanan. Noninvasive acceleration measurements<br />
to characterize knee arthritis-and chondromalacia. Annals of<br />
Biomedical Engineering, 23:78-84, 1995.<br />
Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. Frank,<br />
K.O. Ladly, and Y.T. Zhang. Screening of vibroarthrographic<br />
signals via adaptive segmentation and linear prediction modeling.<br />
IEEE Transactions on Biomedical Engineering, 43:15-23,<br />
1996.<br />
S. Krishnan,R.M. Rangayyan, G.D. Bell, C.B. Frank, andK.0.<br />
Ladly. Screening of knee joint vibroarthrographic signals by<br />
statistical analysis of dominant poles. In CDROM proceedings,<br />
18th Annual International Conference of The IEEE Engineering<br />
in Medicine and Biologg Society, Amsterdam, The Netherlands,<br />
October 1996.<br />
S.G. Mallat and Z. Zhang. Matching pursuit with timefrequency<br />
dictionaries. IEEE Trans. on <strong>Signal</strong> Processing,<br />
41(12):3397-3415,1993.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:11 from IEEE Xplore. Restrictions apply.<br />
SPSS Inc., Chicago, IL. SPSS Advanced Statistics User's<br />
Guide, 1990.
138<br />
Detection of Chirp and Other Components in the Time-Frequency Plane<br />
using the Hough and Radon Transforms<br />
Sridhar Krishnan and Rangaraj M. Rangayyan<br />
Dept. of Electrical and Computer Engineering, The <strong>University</strong> of Calgary,<br />
2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />
Email: (krishnan) (ranga)@enel.ucalgary.ca<br />
Abstract ~ We propose a novel approach to detect chirp<br />
(linear frequency modulated) components in multicomponent<br />
nonstationary signals in the time-frequency (TF) plane.<br />
The approach, based on the Hough and Radon transforms<br />
of TF distributions, can be used to detect chirp components<br />
with varying energy in unknown signal-to-noise ratio environments.<br />
In addition to detection of chirps, the proposed<br />
technique could also be used as a tool to evaluate the TF<br />
resolution provided by different TF analysis methods.<br />
I. INTRODUCTION<br />
Time-frequency distributions (TFD) give the energy dis-<br />
tribution of a signal in the time-frequency (TF) plane, and<br />
are suitable for analyzing nonstationary signals. In particu-<br />
lar, a TFD gives information about the time, frequency, and<br />
combined TF dynamics of a signal. Stationary, Dirac, and<br />
chirp (linear frequency modulated or FM) components of a<br />
signal appear as directional components in the TF plane.<br />
The directional components may be narrow or broad in the<br />
TF plane depending upon the resolution of the TF transfor-<br />
mation used and the energy spread of the component. If the<br />
signal energy is oriented only horizontally in the TF plane<br />
(i.e., a stationary component) or only vertically (i.e., a Dirac<br />
component), then signal detection is easy, and optimal de-<br />
tection can be achieved by computing the marginal densities<br />
of the TFD. However, in practice, chirp components may<br />
occur with arbitrary TF orientations.<br />
Detection of chirp components helps in understanding<br />
the underlying TF dynamics of a signal. Many methods of<br />
chirp detection have been proposed in the literature; typ-<br />
ical applications of chirp detection are found in synthetic<br />
aperture radar, communication over time-varying multipath<br />
channels, and seismology. A method for optimal detection<br />
of chirp components based on a maximum likelihood ap-<br />
proach was proposed by Kay and Boudreaux-Bartels [l].<br />
This method of chirp detection is equivalent to the Radon<br />
transform (RT) of the TFD obtained using the Wigner dis-<br />
tribution. The Radon-Wigner method of chirp detection is<br />
computationally expensive, and an efficient implementation<br />
based on a dechirping method was proposed by Wood and<br />
Barry [a]. The Hough transform (HT) could be used instead<br />
of the RT to detect arbitrary shapes in TF planes which<br />
are not necessarily straight lines (chirps). A Wigner-Hough<br />
method to detect chirp and nonlinear FM components was<br />
0-7803-39O5-3/97/$1O.OO01997 IEEE<br />
198<br />
Nonstationary<br />
Sipal --+<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.<br />
Time-frequency (TF)<br />
atoms<br />
- - Matching pwsuit Wiper distribution HoughXadon (HR)<br />
algorithm ofTF atoms (TFD) transform<br />
Deteciion of<br />
chirps<br />
c- lhieshold<br />
Fig. 1. Block diagram of the proposed method.<br />
proposed by Barbarossa and Lemoine [3], [4].<br />
The motivation of this work is to apply a combined<br />
Hough-Radon (HR) transform (HRT) on TFDs obtained us-<br />
ing the matching pursuit (MP) method to detect energy-<br />
varying chirps with unknown parameters. The block dia-<br />
gram of the proposed method is shown in Fig.1. The use of<br />
the MP technique facilitates application of this method in<br />
environments with unknown signal-to-noise ratio (SNR) .<br />
11. THE HOUGH-RADON TRANSFORM<br />
The TFD of a multicomponent nonstationary signal can<br />
be obtained using the MP method proposed by Mallat and<br />
Zhang [5]. In MP, the given signal is decomposed into a linear<br />
expansion of waveforms, known as TF atoms, selected<br />
from a large dictionary of Gabor functions. The TF atoms<br />
corresponding to only the coherent structures of the signal<br />
can be extracted, and the SNR of the signal with unknown<br />
noise power can be increased. The TFD obtained as a summation<br />
of the Wigner transforms of TF atoms is of high TF<br />
resolution and free of interference.<br />
To detect chirp components (straight lines at arbitrary<br />
orientations in the TF plane), the HT may be used. The HT<br />
is a common technique to detect lines and curves that satisfy<br />
a parametric constraint [6].<br />
The HT is most commonly used as follows: Consider a<br />
point (zi, yi) in the image plane (please note that the image<br />
plane here denotes the TF plane). The general equation of<br />
a straight line in the slope-intercept form is yi = mzi + b,<br />
where m is the slope, 6 is the intercept with the ?J axis, and
the x and y axes correspond to the t and w axes respectively.<br />
There are an infinite number of lines that pass through a<br />
point (xi, yi) and still satisfy the equation yi = mxi + b, for<br />
varying values of the parameters m and b. Parameterizing<br />
the TF plane into the (m, b) parameter space poses a problem<br />
because of the unbounded nature of m and b. One way to<br />
avoid this problem is to use the normal representation of a<br />
line given by<br />
zcosB+ysinB=p. (1)<br />
The parameter space (pie), also known as the Hough do-<br />
main, is now bounded in 0 to the interval [O,.ir] and in p<br />
by the Euclidean distance to the farthest point in the image<br />
from the centre of the image.<br />
From Eqn.1, for a specific point in the TF plane (ti, wi),<br />
we obtain a sinusoidal curve in the Hough domain. All of<br />
the sinusoids resulting from a mapping of a line in the TF<br />
plane have a common point of intersection in the Hough<br />
domain. Thus, chirps in the TF plane will correspond to<br />
high-intensity points in the Hough domain.<br />
A. Hough-Radon Algorithm for Chirp Detection<br />
139<br />
way, chirp components appear as high-intensity points in the<br />
HR domain, and the brightness increases with the energy of<br />
the chirp.<br />
The HRT is a powerful way of determining directional<br />
elements (such as chirps) in gray-level images (such as TFD),<br />
but lacks by itself the capability to eliminate components<br />
that do not contribute coherently to a particular directional<br />
pattern. The high-intensity components (features of int,er-<br />
est) in the HR domain can be extracted by using gradient<br />
operators. A gradient operator that may be used in the<br />
Hough domain is the simple 3x3 mask shown below [7]:<br />
0 -2 0<br />
1 2 1 .<br />
0 -2 0<br />
A drawback with this approach is that the filter was designed<br />
for detecting lines of one pixel width, and cannot be used to<br />
detect broad directional components. As chirps are normally<br />
broad components because of the inherent tradeoff between<br />
time and frequency resolution of TF transforms, the above<br />
filter may often fail.<br />
The problem discussed above can be overcome by not<br />
using a convolution mask, but instead using a method to<br />
detect the peaks corresponding to broad components by ap-<br />
plying a suitable threshold in the HR domain. Values in cells<br />
greater than the threshold will be retained, and values lower<br />
than the threshold will be set to zero. The threshold is se-<br />
lected based on local statistics in the HR domain. First, the<br />
histogram (probability density function) of the HR image is<br />
The computational attractiveness of the HT arises from<br />
subdivision of the Hough domain into accumulator cells.<br />
The cell at coordinates (i,j), with accumulator value<br />
A(i, j), corresponds to the square associated with the<br />
parameter coordinates (0i , pj). Initially, the cells are<br />
set to zero.<br />
For every point (tk,wk) in the TF plane, we let the parameter<br />
0 equal each of the allowed subdivision values<br />
on the 0 axis and solve for the corresponding p using<br />
Eqn.1. The resulting p's are then rounded off to the<br />
nearest allowed value on the p axis. If a particular 0i<br />
value results in the solution pj , the ccrresponding accumulator<br />
A(i, j) is incremented.<br />
9 h(g)<br />
calculated, and the mean is computed as M = T*c ,<br />
where h(g) is the frequency of occurrence of the gth gray<br />
level of the HR image with T rows and c columns. Then, the<br />
threshold is computed as<br />
At the end of the procedure, a value of M in A(i, j) corresponds<br />
to M points in the TF plane lying on the line<br />
threshold = signal factor * M. (3)<br />
t cos 0i f w sin 0i = pj. It is evident that more subdivi- The signal factor is dependent on the type of the signal being<br />
sions in the Hough domain will lead to a more accurate analyzed.<br />
estimate of collinear points, but at the expense of additional<br />
computational complexity. In this work, the full<br />
B. Mathematzcal Proof<br />
ranges of 0 and p were used.<br />
It can be mathematically shown that the HRT (or more<br />
The main drawback of the HT is that it is usually performed<br />
on binary images, and hence may not be appropriate<br />
generally, the RT) of a TFD provides maximum likelihood<br />
(ML) detection of a chirp signal. Wood and Barry [a] have<br />
for gray-level images and TFDs. As energy values of chirps stated that the RT of the general Wigner TFD is equivavary<br />
in the TF plane, they occupy different gray levels, with lent to the ML estimate of a chirp. In this paper, the above<br />
255 corresponding to the highest scaled energy component. statement is mathematically proved, and the results can be<br />
It is not appropriate to binarize the TF image, since the HT directly extended to the interference-free TFD obtained uswill<br />
then not be able to detect energy-varying chirp components.<br />
This drawback can be avoided by using the combined<br />
HRT.<br />
With the HRT, the algorithm is exactly similar to the<br />
ing MP. The ML detection of a linear chirp is given by [l]:<br />
CO<br />
L = ma&,,,, 1, W(t, WO + mt) dt > 17, (4)<br />
one discussed earlier, except that instead of counting the<br />
number of collinear points in a cell, the gray values of<br />
collinear points are added to each cell and then multiplied<br />
with the total number of collinear points in that cell. In this<br />
where<br />
7- 7-<br />
W(t,w) = - x*(t + -)x(t 2 - -) 2<br />
2n Jm<br />
exp-jwr dr (5)<br />
199<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.
140<br />
x<br />
0<br />
%<br />
E<br />
Lr,<br />
Time<br />
Fig. 2. Graphical illustration of the HRT of a TFD.<br />
is the Wigner distribution of signal x(t)] and m is the slope<br />
of the path of integration in the TF plane as shown in Fig.2.<br />
Geometrically, the line integration in Eqn.4 is similar to the<br />
RT of W(t,w) given bv<br />
00<br />
~j~(t,u)) = S__~i-jrcos~- ssinB,rsinB+scosB)ds,<br />
where R is the Rador, aperator. Using Eqn.5 in Eqn.6,<br />
R (W(t, w)) J-”, J-”, z*(rcosB - ssin B + r/2)<br />
exp(-j(rsin0 + scosB)r) drds.<br />
=<br />
x(r cos B - s sin B - r/2) (7)<br />
From Fig.2 m = -A and WO = 5. For MI, estimation<br />
of a chirp, w = WO + mt from Eqn.4. Therefore<br />
wo+mt = __- s &(T sin R cos B + s sine)<br />
- r-rcosZO+ssin RcosO<br />
sin 0<br />
= rsinB+scosB.<br />
Now, transforming Eqn.7 to Cartesian co-ordinates and us-<br />
ing Eqn.8, we get<br />
(W(t,<br />
m a ?<br />
U)) = J-, J-, z* (t + ./a) x(t - 7-P)<br />
(6)<br />
(8)<br />
exp(-j(wo + mt)r) drdt (9)<br />
= J-”, W(t,wo + mt)dt.<br />
This proves that the RT or the HRT of a general Wigner<br />
TFD (or the TFD obtained by MP) is equivalent to ML<br />
detection of chirps.<br />
C. Analysts of TF Resolution<br />
The proposed technique can also be used to evaluate<br />
the TF resolution provided by different TFDs. A test signal<br />
composed of a sine and a Dirac component is passed through<br />
the TFD generator (e.g., Choi-Williams, spectrogram), and<br />
the HRT of the output is computed. The TF resolution of<br />
200<br />
(a) E 0.6-<br />
-<br />
.$0.4<br />
0 z0.2-<br />
I / j 1<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.<br />
I<br />
Fig. 3. Testing the thresholdedHR method with an image. (a) Original<br />
image, (b) HR image, (c) After applying the gradient mask operator<br />
to (b), and (d) After thresholding (b).<br />
angle in degrees<br />
Fig. 4. Directional concentration plots provided by two methods. (a)<br />
Gradient mask method, (b) Threshold method.<br />
the TFD can be evaluated by observing the peaks at 0 and 90<br />
degrees in the HR domain. A TFD with good time resolution<br />
will have a narrow component at 90 degrees, whereas a TFD<br />
with good frequency resolution will have a narrow component<br />
at 0 degree. At present there is no technique available to<br />
check the TF resolution provided by different TFDs. The<br />
proposed method should be a good tool to evaluate the TF<br />
resolution performance of different TF methods.<br />
111. RESULTS<br />
Experiment 1: The proposed method was tested on the non-<br />
TF image shown in Fig.S(a). The image has broad direc-<br />
tional features, comparable to what is expected with chirp<br />
signals in the TF plane. The HR image is shown in Fig.3(b),<br />
from which it is evident that the broad directional compo-<br />
nents in the test image correspond to bright, broad com-<br />
ponents in the HR domain. The HR image also has other<br />
less intense components that do not relate to the directional
Fig. 5. Results with synthetic signal 1. (a) Synthetic signal (z axistime<br />
samples, y axis- amplitude), (b) TFD of (a) (.: axis- time, y<br />
axis- frequency), (c) HR image (after thresholding), 2 axis- 0 (0 to<br />
T), y axis- p.<br />
features. The less-intense components are reduced to some<br />
extent by using the 3x3 mask. The mask did not perform<br />
well in extracting the broad components. By thresholding<br />
the parameter space using the threshold as in Eqn.3, the<br />
broad components were extracted, as illustrated in Fig.3(d).<br />
Figs.4(a) a.nd 4(b) show the directional concentration of the<br />
HR distributions in Figs.S(c) and 3(d). From Fig.4b it is evi-<br />
dent that the directional components are well resolved by the<br />
threshold method as compared to the gradient mask method.<br />
Experiment 2: The proposed method of chirp detection was<br />
tested on two synthetic signals of known time and frequency<br />
dynamics. The synthetic signals were computed using a si-<br />
nusoid (frequency dynamics), and two chirps (TF dynamics)<br />
with some random noise. Both the synthetic signals had the<br />
above components, but with different time durations. Each<br />
signal was decomposed into TF atoms (Gabor functions) by<br />
using the MP algorithm, with the maximum number of TF<br />
atoms allowed being 200. The TFDs of the signals were com-<br />
puted by adding the Wigner distributions of the TF atoms.<br />
The TFDs of the two signals are shown in Figs.5 and 6. It<br />
is interesting to note that frequency dynamics (sine com-<br />
ponent) and TF dynamics (chirp components) are treated<br />
equally well by this representation. The TFD obtained us-<br />
ing MP is interference-free, and gives the best possible TF<br />
resolution among all the TFD methods available.<br />
The HRT was applied to the TFDs in Figs.5(b) and<br />
6(b). The two chirp components appear as bright spots in<br />
the HR image at angles of about 60' and 120'. The TFD<br />
of Fig.5(b) has broader components than that in Fig.G(b);<br />
this is because of the lower TF resolution of shorter-duration<br />
signals. This difference can be seen in Figs.5(c) and 6(c).<br />
(C)<br />
201<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:05 from IEEE Xplore. Restrictions apply.<br />
(e)<br />
141<br />
Fig. 6. Results with synthetic signal 2. (a) Synthetic signal, (b) TFD<br />
of (a), (c) HR image (after thresholding).<br />
IV. CONCLUSION<br />
A novel approach to detect chirps in unknown SNR<br />
environments has been proposed in this paper. The com-<br />
bined Hough and Radon transform of a TFD obtained using<br />
the MP method provides maximum likelihood detection of<br />
chirps. The problem of identifying directional components<br />
and their dynamics in the TF plane simplifies to locating<br />
high-intensity spots in the HR plane. The method could be<br />
extended to detect arbitrary patterns in the TF plane by<br />
using the generalized HT.<br />
Acknowledgements: We gratefully acknowledge support from<br />
the Alberta Heritage Foundation for Medical <strong>Research</strong> and<br />
the Natural Sciences and Engineering <strong>Research</strong> Council of<br />
Canada.<br />
REFERENCES<br />
[l] S. Kay and G.F. Boudreax-Bartels. On the optimalityof the Wigner<br />
distribution for detection. IEEE ICASSP, pages 2331-2334, 1986.<br />
[2] J.C. Wood and D.T. Barry. Radon transformation of time-<br />
frequency distributions for analysis of multicomponent sig-<br />
nals. IEEE Transactions on <strong>Signal</strong> Processing, 42( 11):3166-3177,<br />
November 1994.<br />
[3] S. Barbarossa. <strong>Analysis</strong> of multicomponent LFM signals by com-<br />
bined Wigner-Hough transform. IEEE Transactions on <strong>Signal</strong> Pro-<br />
cessing, 43(6):1511-1515, June 1995.<br />
[4] S. Barbarossa and 0. Lemoine. <strong>Analysis</strong> of nonlinear FM signals<br />
by pattern recognition of their time-frequency representation. IEEE<br />
<strong>Signal</strong> Processing Letters, 3(4):112-115, April 1996.<br />
[5] S.G. Mallat and 2. Zhang. Matching pursuit with time-frequency<br />
dictionaries. IEEE Transactions on <strong>Signal</strong> Proceesing, 41( 12):3397-<br />
3415, 1993.<br />
[6] R.C. Gonzalez and P. Wintz. Digital Image Processing. Addison-<br />
Wesley, Inc., Reading, MA, second edition, 1987.<br />
[7] W.A. Rolston. Directional image analysis. Master's thesis, Dept.<br />
of Elect. and Comp. Engg., The Univ. of Calgary, Calgary, AB,<br />
Canada, 1994.
22<br />
Spatial-Temporal Decorrelating Decision-Feedback Multiuser Detector<br />
for Synchronous Code-Division Multiple- Access Channels<br />
Sridhar Krishnan and Brent R. Petersen<br />
Dept. of Electrical and Computer Engineering,'l'he <strong>University</strong> of Calgary,<br />
2500 <strong>University</strong> Drive NW, Calgary, Alberta T2N 1N4, CANADA.<br />
Email: (krishnan)(bp)@enel.ucalgary.ca<br />
Abstract - In this paper, a new multiuser detector<br />
for synchronous code-division multiple-access channels is<br />
developed, and the performance is compared with other<br />
multiuser detectors. The proposed multiuser detector is<br />
based on spatial-temporal filtering and decision-feedback<br />
techniques, hence the name spatial-temporal decorrelat-<br />
ing decision-feedback (STDF) detector. An optimum<br />
STDF det.ector is expected to have an exponential com-<br />
plexity as he IIUIII~IX of users grow. A suboptimum<br />
STDF detector shows a better performance in terms of<br />
probability of error (or SNR) and asymptotic efficiency<br />
as compared to the other suboptimum detectors. Simu-<br />
lation results under diverse channel conditions show that<br />
STDF is a bandwidth efficient technique, which is an es-<br />
sential requirement for modern wireless communications.<br />
Also the results indicate that STDF performance gains<br />
are more significant for relatively weak users.<br />
I. INTRODUCTION<br />
Multiuser communications has been an active area<br />
of research and has numerous applications, especially in<br />
the area of wireless communications. There are several<br />
different ways in which multiple users can communicate<br />
through the channel to the receiver. One method is to<br />
allow more than one user to share a channel or a sub-<br />
channel by use of a unique code sequence or signature<br />
waveforms that allows the user to spread the informa-<br />
tion signal across the assigned bandwidth. This multiple-<br />
access method is called the code-division multiple-access<br />
(CDMA). The objective of this work is to develop an effi-<br />
cient multiuser detector based on spatial filtering (beam-<br />
formers) and decision feedback for synchronous CDMA<br />
channels.<br />
In a CDMA system, the receiver may have an idea<br />
about the assigned signature waveforms and observes the<br />
sum of the transmitted signals in additive white Gaussian<br />
noise. The conventional single-user detector consists of a<br />
bank of single-user ma,tched filters followed by thresh-<br />
old detectors. If the assigned signature waveforms are<br />
orthogonal and if the powers of the users are not very<br />
different then the conventional detector would achieve<br />
optimum demodulation [l]. It has been shown that the<br />
optimum maximum likelihood receiver employing a gen-<br />
eralization of the Viterbi algorithm significantly outper-<br />
0-7803-3905-3/97/$ 10.OOO 1997 IEEE<br />
202<br />
forms the conventional single-user detector at the expense<br />
of a high computational complexity [a]. Since these conditions<br />
are often difficult to satisfy in practice, several<br />
suboptimum detectors have been proposed in literature<br />
PI, [3i1 ~41, 151.<br />
A. The Lanear Decorrelatang Detector<br />
The linear decorrelating detector also known as the<br />
decorrelator can significantly outperform the conven-<br />
tional single user detector [l]. This is because the decor-<br />
relator takes into account the outputs of all the matched<br />
filters in making a decision as against a single matched<br />
filter output used in conventional type. As the outputs of<br />
all the matched filters are considered in making decisions,<br />
the multiuser interference among users can be easily ex-<br />
ploited. The signal at the output of the matched filter is<br />
given as<br />
y = RWb + n,<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.<br />
(1)<br />
where R is the crosscorrelation matrix of signature waveforms,<br />
W is a diagonal matrix with Wk,k = &, k =<br />
T<br />
1,. . . , I
the users and is explained in next section. Also, a decor-<br />
relating decision-feedback (DF) detector for synchronous<br />
and asynchronous CDMA channels have been proposed<br />
in the literature [3], [4]. Performance gains of DF detec-<br />
tors are more significant than decorrelator, especially for<br />
relatively weak users [3], [4].<br />
B. Spatial- Temporal Decorrelating (STD) Detector<br />
Integrated spatial-temporal processing of the re-<br />
ceived signal has been shown to provide significant per-<br />
formance improvement over the decorrelator [5]. In most<br />
cases the multiple users are distributed in space in such<br />
a way that they are intercepted at the detector from var-<br />
ious directions. By exploiting the signals' spatial dis-<br />
tribution (direction matrix) and the temporal properties<br />
(crosscorrelation matrix R) followed by linear decorrela-<br />
tion (decorrelator) a uniformly superior performance has<br />
been achieved. The signal at the output of the matched<br />
filter is given as<br />
y = MWb + n. (3)<br />
The above equation is similar to Eqn.1 except the<br />
matrix filter M. Where M here is the spatial-temporal<br />
crosscorrelation matrix. The next stage of processing is<br />
similar to the decorrelator and the output of the matrix<br />
filter in this case is<br />
y = M-'y = Wb + 6, (4)<br />
where n is a Gaussian noise vector with the autocorrela-<br />
tion matrix M(n) = (r2M-l. Comparative results with<br />
other suboptimum detectors have shown superior perfor-<br />
mance gains of STD detectors [5]. The superior perfor-<br />
mance of STD detectors has been the motivation of this<br />
work, in which spatial filtering is combined with decision-<br />
feedback (STDF). The STDF detector is derived in next<br />
section and its performance is compared with the other<br />
suboptimum detectors.<br />
11. SPATIAL-TEMPORAL DECORRELATING<br />
DECISION-FEEDBACK (STDF) DETECTOR<br />
It has already been shown that for STD the complex-<br />
ity of the detector grows exponentially as the number of<br />
users increase [5]. Hence, the same complexity can be ex-<br />
pected for optimum STDF detectors. Here only a subop-<br />
timum STDF is considered. The suboptimum STDF de-<br />
tector is derived by exploiting the spatial-temporal cross-<br />
correlation matrix M given in Eqn.3. The matrix M can<br />
be written as the Hadamard or Schur product of matrices<br />
A and R.<br />
M = R O (A~A), (5)<br />
where o denotes the Hadamard product of matrices [6].<br />
A is the directions matrix comprising the direction vec-<br />
torsal,aa,...,ak,i.eA= [ a1 a2 ... aK ],andthe<br />
2-<br />
v-<br />
3+<br />
: Spatid<br />
Pl,<br />
'3<br />
MF2 (cjl '<br />
1 Filter I<br />
I :: FLll<br />
@J-<br />
w%<br />
I<br />
I<br />
: CL :: ::<br />
I I<br />
2Kl<br />
I<br />
.. n<br />
1- - -<br />
I<br />
I<br />
I<br />
12<br />
@L :<br />
I K I ,<br />
direction vector ak = [ 1 ejsxl ... ejokp 1'. The direction<br />
vector ak expresses the phases and gains of the P<br />
sensors relative to a reference sensor in the direction of<br />
arrival of the wavefront of user k. AH is the conjugate<br />
transpose of A. <strong>Analysis</strong> shows that R is a positivedefinite<br />
matrix.<br />
Also, it can be easily shown that AHA is always a<br />
positive definite matrix, that implies matrix M will also<br />
be a positive definite matrix. If M is a positive definite<br />
matrix, then it can be decomposed as<br />
M = L ~L, (6)<br />
where LH is an upper triangular matrix and L is a lower<br />
triangular matrix. The above method of matrix decom-<br />
position is known as the Cholesky decomposition [7]. The<br />
matrices LH and L correspond to causal and stable ma-<br />
trix filters. If the filters are represented as a spectrum<br />
(frequency domain representation), then the Cholesky de-<br />
composition can be viewed as a spectral factorization the-<br />
orem [8].<br />
In STDF, the sampled output y of the matched fil-<br />
ter is passed through the feedforward filter (LH)-l, and<br />
the resulting output vector of the matrix filter (LH)-' is<br />
given as<br />
y = (L*)-'y = LWb + n, (7)<br />
where n is a white Gaussian noise matrix with the di-<br />
agonal elements being the variance u'. Therefore, the<br />
feedforward filter (LH)-' is nothing but a whitening fil-<br />
ter. The model given in Eqn.7 is a white noise model of<br />
the CDMA channel. Also the expression given in Eqn.7<br />
makes analysis simpler. The kth component of the vector<br />
203<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.<br />
bK
24<br />
y can be written as Now the SNR of kth user can be given as<br />
From the above equation it can be seen that the kth<br />
user term has a desired term, an interference term, and<br />
a white noise t,erm.<br />
The above model gives rise to the decision-<br />
feedback technique used in STDF. The users signals<br />
in STDF are arranged in decreasing order of energy<br />
(E1 2 & 2 . . . 2 &). Where, user 1 will be the strongest<br />
user and the I
,-t<br />
1 2 J 4 5 e 7 e e 10<br />
SNR (I" de.)<br />
Low corr, Low sep<br />
Fig. 2. Probability of error curves for weaker user in a two-user system for different combinations of signal cross correlations and angular<br />
separations. Legend: corr- correlation, sep- separation, SD- spatial decorrelator, STDF- spatial-temporal decorrelating decision<br />
feedback, STD- spatial-temporal decorrelating, DF- decision feedback.<br />
received energy of the user divided by the power spectral<br />
density level (No) of the background thermal white Gaus-<br />
sian noise (not including interference from other users).<br />
In essence, the efficiency represents the performance loss<br />
due to multiuser interference. The desirable figure of<br />
merit is the asymptotic efficiency Ij)k, where the back-<br />
ground Gaussian noise level goes to zero, i.e.,<br />
Ij)k = lim ~<br />
SNR,i<br />
No-+O SNR,, '<br />
which characterizes the underlying performance loss<br />
when the dominant impairment is the existence of other<br />
users rather than the additive channel noise.<br />
Since for the STDF detector, from the expression in<br />
Eqn.17, the ideal probability of error does not depend on<br />
the noise level or the power of the interferers, and the<br />
ideal asymptotic efficiency of STDF is<br />
In the following simulation experiments, the perfor-<br />
mance figures of the proposed STDF detector have been<br />
compared with that of the other suboptimum detectors.<br />
A. <strong>Signal</strong>-to-Noise ratio<br />
IV. SIMULATION RESULTS<br />
For comparing SNR of the multiuser detectors two<br />
experiments were performed. In experiment 1, a two-user<br />
205<br />
25<br />
system was considered. For the two-user system different<br />
types of channel and direction matrix combinations were<br />
tried. The two channels considered were<br />
R1 has low crosscorrelation factor, whereas cross-<br />
correlations between the users signature waveforms are<br />
relatively high in case of R2. In simple words R2 simu-<br />
lates a bandwidth-efficient channel. Also the two different<br />
spatial distributions of the users were considered<br />
A1 corresponds to a low angular (13') separation<br />
between the two users, whereas in case of A2 the users<br />
are separated by an angle of 67.5'.<br />
Fig.2 shows the probability of error graphs for weaker<br />
user. In Fig.2(a) the users have low signal crosscorrela-<br />
tions and low angular separation between them. STDF<br />
performs slightly better than STD. Also DF performs bet-<br />
ter than decorrelator. The advantage of spatial filtering is<br />
clearly evident from the superior performances of STDF<br />
and STD over DF and decorrelator. The same observa-<br />
tions can be found in Fig.2(b), but the spatial decorrela-<br />
tor (SD) [5] shows some performance improvement which<br />
is obvious in case of highly separated users.<br />
Figs.2(c) and 2(d) corresponding to highly corre-<br />
lated users signature waveforms show interesting results.<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.
Fig. 3. Probability of error curves for stronger user in a two-user system for different combinations of signal cross correlations and<br />
angular separations. Legend same as in Fig.2.<br />
The STDF detector clearly outperforms the other detec-<br />
tors. The results indicate that STDF can be used in<br />
bandwidth-efficient CDMA channels where the signature<br />
waveforms have significant crosscorrelations. In case of<br />
Fig.2(d) the SD shows a slight improvement over the<br />
decorrelator.<br />
Fig.3 shows the graphs for the stronger user of the<br />
two. The graphs clearly indicate that there is no perfor-<br />
mance difference between STDF and STD, and also be-<br />
tween DF and decorrelator. The only factor which makes<br />
STDF or STD better is the spatial filtering. As there is<br />
no feedback involved for strongest user in STDF and DF,<br />
the error rates are identical to that of STD and decor-<br />
relator respectively (this confirms with theory). Also<br />
in Fig.S(d) the SD shows a slightly better performance<br />
than the decorrelator, further confirming that in case of<br />
highly correlated and highly separated users, spatial fil-<br />
tering will make a significant contribution towards SNR<br />
improvement.<br />
In experiment 2, a four-user system was considered.<br />
The signal crosscorrelation matrix was given by<br />
r 1 0.5 0.4 0.2 1<br />
R3 = I 0.5 1 0.8 0.6 I<br />
0.4 0.8 1 0.3<br />
1 0.2 0.6 0.3 1 1<br />
and the direction matrix of a sensor with respect to a<br />
reference sensor in the direction of arrival of the users<br />
206<br />
wavefronts is<br />
Fig.4 shows the probability of error curves for the<br />
above four-user system. The four-user system simulates a<br />
multiuser environment better than the two-user examples<br />
considered earlier. From the graphs, it is clearly evident<br />
that the STDF detector shows a better performance than<br />
the other suboptimum detectors considered in this work.<br />
As the users become stronger the performance difference<br />
between STDF and STD narrows down. The results con-<br />
firm that STDF is a very powerful detection technique<br />
for relatively weak users. Also, in case of the strongest<br />
user there is no performance difference between STDF<br />
and STD, which coincides with our earlier observations.<br />
B. Asymptotzc Eficzency<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 16:08 from IEEE Xplore. Restrictions apply.<br />
In experiment 3, the asymptotic efficiency of differ-<br />
ent detectors were compared. As asymptotic efficiency<br />
basically measures the performance loss of a detector due<br />
to multiuser interference, a four-user system (R3 and<br />
A3) was considered. The reason is because a four-user<br />
system with different signal crosscorrelations will better<br />
simulate an hostile environment with different levels of<br />
multiuser interference than when compared to a two-user<br />
system. Fig.5 shows the histogram plot of asymptotic ef-<br />
ficiencies for all the four users. The plot shows that STDF<br />
is always more efficient than the other suboptimum de-<br />
tectors, except in case of the strongest user, where STDF
'il! (c)<br />
(a) Weakea User<br />
2nd Strongest User<br />
SNR (in dB)<br />
10-1<br />
1 2 3 4 a e 7 B Q ,o<br />
SNR (In OB)<br />
18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam 1996<br />
4.2.2: Time-varying <strong>Analysis</strong> of Various <strong>Signal</strong>s<br />
Screening of Knee Joint Vibroarthrographic <strong>Signal</strong>s by Statistical<br />
Pattern <strong>Analysis</strong> of Dominant Poles<br />
S. KRISHNAN’, R.M. RANGAYYAN112, G.D. BELL’l3, C.B. FRANK2$3, K.O. LADLY3<br />
‘Dept. of Electrical and Computer Engineering, ’Dept. of Surgery, 3Sport Medicine Centre<br />
The <strong>University</strong> of Calgary, Alberta, T2N 1N4, CANADA. (Email : ranga@enel.ucalgary.ca)<br />
Abstract-<strong>Analysis</strong> of human knee joint vibration signals<br />
or vibroarthrographic (VAG) signals could lead to a non-<br />
invasive method for the diagnosis of cartilage pathol-<br />
ogy. In this study, the nonstationary VAG signals<br />
were adaptively segmented into locally stationary seg-<br />
ments. Autoregressive (AR) model coefficients were de-<br />
rived from the stationary segments by using the Burg-<br />
lattice method. The dominant poles of the models ex-<br />
tracted from the AR polynomials and a signal variability<br />
parameter were used as VAG signal features. The VAG<br />
signal features with a few relevant clinical parameters<br />
were used as feature vectors in statistical pattern classifi-<br />
cation experiments based on logistic regression analysis.<br />
The results indicated a classification accuracy of 81.7% in<br />
screening 90 VAG signals with no restriction imposed on<br />
the type of abnormal signals, and an accuracy of 93.7%<br />
in classifying 71 VAG signals with abnormal signals re-<br />
stricted to a specific type of articular cartilage pathology<br />
known as chondromalacia patella.<br />
I. INTRODUCTION<br />
Vibration signals emitted from human knee joints<br />
during normal movement of the leg, known as vi-<br />
broarthrographic (VAG) signals, are expected to be as-<br />
sociated with roughness, softening, or the state of lubri-<br />
cation of the cartilage surfaces, and may be useful indi-<br />
cators of early joint degeneration or disease. VAG aig-<br />
nal analysis could decrease the need for diagnostic use of<br />
arthroscopy. A variety of VAG signal analysis techniques<br />
have been proposed in the literature (for a review of pre-<br />
vious publications, please see Moussavi et al. [l]). The<br />
present work investigates, with a large database of sig-<br />
nals, the diagnostic potential of VAG based on pattern<br />
classification experiments performed using signal model<br />
parameters and a few clinical parameters as features.<br />
A. Data Acquistion<br />
11. METHODS<br />
In order to detect the VAG signal, a Dytran ac-<br />
celerometer (model 3115a) was placed on the surface of<br />
the skin at the mid-patella position of the knee, and the<br />
signal was recorded during swinging movement of the leg<br />
from 135’ to 0’ to 135’ in a total time period of 4 s. An<br />
electronic goniometer was placed on the lateral side of the<br />
knee to measure the angle of motion. Before digitizing the<br />
signal at a sampling rate of 2.5 kHz and 12 bits/sample,<br />
the signal was amplified and conditioned using a 10 Hz<br />
to 1 kHz bandpass filter.<br />
Auscultation was performed during swinging move-<br />
ment of the leg by placing a stethoscope on the lateral,<br />
medial, and anterior surfaces of the knee. The sounds<br />
heard were categorized and coded along with the ap-<br />
proximate corresponding joint angle for use as features in<br />
classification experiments. For subjects who underwent<br />
arthroscopy, the location of the pathology was used to es-<br />
timate the joint angle at which the pathological surfaces<br />
would come in contact and contribute to the VAG signal.<br />
Two databases were used in this study: 1). Database<br />
A consists of 51 normal and 39 abnormal signals, with<br />
no restriction on the type of cartilage pathology, and 2).<br />
Database B, extracted from database A, consists of 51<br />
normal and 20 abnormal’ signals, with the abnormals re-<br />
stricted to chondromalacia patella only. (Chondromala-<br />
cia patella is a common type of articular cartilage pathol-<br />
ogy in which the cartilage softens, fibrillates, and finally<br />
the bone is exposed.)<br />
208<br />
0-7803-3811-1/97/$10.00 OIEEE 968<br />
B. Feature Extmctzon<br />
Like many other biological signals, VAG signals are<br />
nonstationary. Hence, in order to apply standard sig-<br />
nal processing methods such as parametric modeling or<br />
spectral analysis, the signals have to be segmented into<br />
quasi-stationary segments. En this work VAG signals were<br />
adaptively segmented into stationary segments by using a<br />
recursive least-squares lattice algorithm [2]. An example<br />
of a VAG signal of an abnormal subject with chondroma-<br />
lacia patella, along with the final segment boundaries, is<br />
illustrated in figure 1. The VAG signal segments were au-<br />
toregressive (AR) modeled using the Burg-lattice method<br />
[2]. The transfer function of the AR or the “all pole” filter<br />
may be written as<br />
where M is the model order, and Csk are the AR coeffi-<br />
cients [3]. By factorizing the denominator, Eq. 1 can be<br />
rewritten as<br />
1<br />
H(r) = (2 - Pl)(t - P2)(. - P3)...(z. - PM) ’<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:57 from IEEE Xplore. Restrictions apply.<br />
(2)
18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Amsterdam 1996<br />
4.2.2: Time-varying <strong>Analysis</strong> of Various <strong>Signal</strong>s<br />
Fig. 1. VAG signal of an abnormal subject with chondromalacia<br />
patella. The vertical lines represent segment boundaries. au:<br />
Arbitrary units.<br />
where pl ,pa, ..., p~ are the complex poles of the model. A<br />
model order of 40 was used [2]. Since the model order was<br />
an even number, the poles occurred in conjugate pairs.<br />
The distance T of a pole from the origin in the complex<br />
z- plane determines its spectral bandwidth, fB, as<br />
fs = cos-l [ (1 + r2) - 2(1- r)<br />
2r<br />
Poles with a large T contribute to the dominant peaks<br />
in the signal spectrum [4]. The superior performance of<br />
poles in tracking the frequency or spectral behavior of a<br />
signal makes them an appropriate choice for parametric<br />
representation of signals with multi-peaked spectra, such<br />
as VAG signals.<br />
C. Pattern <strong>Analysis</strong><br />
From the twenty poles (complex conjugate pole pairs<br />
were represented by only one pole from each pair) of the<br />
model of each VAG signal segment, six poles with the<br />
highest r were selected as the dominant poles. The six<br />
dominant poles; a signal variability parameter computed<br />
as the variance of the means (VM) of the segments of a<br />
VAG signal record; and a few clinical parameters such as<br />
the type of sound heard during auscultation, age, gender,<br />
and activity level of the subject were used to form feature<br />
vectors for use in classification experiments.<br />
The accuracy rate in classification of VAG signal<br />
segments into normal and abnormal groups was deter-<br />
mined by applying logistic regression analysis [5] on ran-<br />
dom splits of the databases into training and test sets.<br />
The VAG signal segments used in the test sets were tu<br />
tally different from and independent of the VAG signal<br />
segments used in the training sets. The overall accuracy<br />
rate was calculated as the percentage of the correctly-<br />
classified segments in the test set.<br />
111. RESULTS<br />
Several random split experiments were conducted<br />
with the database A and the database B. Table 1 shows<br />
969<br />
TABLE I<br />
CLASSIFICATION ACCURACY WITH DATABASE A AND DATABASE B<br />
- Database<br />
A<br />
B<br />
Correct Classification Accuracy Rate<br />
Normal Segments Abnormal Segments Overall<br />
201/211 36 f '79 2371290<br />
95.3% 45.6% 81.7%<br />
1881195 35/43 2231238<br />
96.4% 81.4% 93.7%<br />
the best test results obtained. The use of poles instead<br />
of the AR model coefficients [2] has provided an increase<br />
in classification accuracy of about 2 to 3%.<br />
IV. DISCUSSION<br />
The results confirm that VAG signal analysis is in-<br />
deed a potential tool for noninvasive diagnosis of artic-<br />
ular cartilage pathology. The results with database B<br />
further indicate that the proposed methods have poten-<br />
tial in detecting chondromalacia patella with noninvasive<br />
procedures.<br />
The use of AR model poles has the advantage that<br />
the pole frequencies could be directly related to domi-<br />
nant frequency components present in the signals. Such a<br />
parametric representation of signals should facilitate bet-<br />
ter description and understanding of signal and system<br />
characteristics than the use of more abstract parameters<br />
such as the AR model coefficients.<br />
Future work will be directed towards wavelet analysis<br />
for improved feature analysis of the nonstationary VAG<br />
signals, which may overcome some of the approximations<br />
involved in our current segmentation-based approach.<br />
ACKNOWLEDGEMENT<br />
We gratefully acknowledge support of this project<br />
with grants from the Arthritis Society of Canada and the<br />
Alberta Heritage Foundation for Medical <strong>Research</strong>.<br />
REFERENCES<br />
[l] Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B. Frank,<br />
K.O. Ladly, and Y.T. Zhang. Screening of vibroarthrographic<br />
signals via adaptive segmentation and linear prediction model-<br />
ing. IEEE Transactions an Biomedical Engineering, 43:15-23,<br />
1996.<br />
[2] S. Krishnan. Adaptive filtering, modeling, and classification of<br />
knee joint vibroarthrographic signals. Master's thesis, Dept. of<br />
Electrical and Computer Engineering, The <strong>University</strong> of Cal-<br />
gary, Calgary, AB, Canada, April 1996.<br />
[3] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood<br />
Cliffs, N.J., 2nd edition, 1990.<br />
[4] 0. Paiss and G.F. Inbar. Autoregressive modeling of surface<br />
EMG and its spectrum with application to fatigue. IEEE<br />
Transactiona on Biomedical Engineering, BME-34( 10):761-<br />
769, 1987.<br />
[5] SPSS Inc., Chicago, IL. SPSS Advanced Statrattcs User's<br />
Guide, 1990.<br />
209<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:57 from IEEE Xplore. Restrictions apply.
339<br />
RECURSIVE LEAST-SQUARES LATTICE-BASED ADAPTIVE<br />
SEGMENTATION AND AUTOREGRESSIVE MODELING OF KNEE JOINT<br />
VIBROARTHROGRAPHIC SIGNAILS<br />
S. Krishnanl, R.M. Rangayyan1T2, G.D. Bel1273, C.B. F'~-ank~?~, K.O. Ladly3<br />
'Department of Electrical and Computer Engineering,2 Department of Surgery, Sport Medicine Centre<br />
The <strong>University</strong> of Calgary, Alberta, T2N 1N4, Canada. (Email : rangaQenel.ucalgary.ca)<br />
Abstract : Vibration signals emitted during movement<br />
of the knee, known as vibroarthrographic (VAG) sig-<br />
nals, may bear diagnostic information. We propose a<br />
new technique for adaptive segmentation based on the<br />
recursive least-squares lattice algorithm to segment the<br />
non-stationary VAG signals into locally-stationary com-<br />
ponents, which were then autoregressive modeled using<br />
the Burg-Lattice method. Classification of 90 VAG sig-<br />
nals as normal or abnormal using the signal and clini-<br />
cal parameters provided an accuracy of 71.1% with the<br />
leave-one-out method. When the abnormal signals were<br />
restricted to chondromalacia patella only, the classifica-<br />
tion accuracy increased to 80.3%. The results indicate<br />
that VAG is a potential tool for non-invasive screening<br />
for chondromalacia patella.<br />
1 Introduction<br />
Based on the many investigations that have been carried<br />
out on vibroarthrographic (VAG) signal analysis in the<br />
past few years, there is a good evidence to suggest that<br />
the VAG or knee joint sound signal has an exciting poten-<br />
tial for distinguishing between normal and abnormal car-<br />
tilage surfaces [l]. However, in previous studies on VAG,<br />
signal classification experiments were performed on a lim-<br />
ited number of signals. Using different adaptive signal<br />
processing techniques, the present work closely investi-<br />
gates the diagnostic potential of VAG based on extensive<br />
pattern classification experiments.<br />
In this paper, utilizing a reasonably large database<br />
of 90 subjects, the following approaches and techniques<br />
are addressed :<br />
0 Improved adaptive segmentation of the non-<br />
stationary VAG signals using the recursive least-<br />
squares lattice (RLSL) algorithm;<br />
0 Improved autoregressive (AR) modeling of VAG sig-<br />
nal segments using the Burg-Lattice method; and<br />
0 Classification of VAG signals into two groups - Nor-<br />
mal and Abnormal - using logistic regression analysis<br />
and the leave-one-out method.<br />
CCECE'96 0-7803-3143-5 /96/$4.00 0 1996 IEEE<br />
210<br />
The proposed methods should be useful as clinical<br />
tools for diagnosis of cartilage pathology or as tests before<br />
arthroscopy or major surgery.<br />
2 Clinical Data Acquisitioin<br />
Subjects sat on a rigid table with both legs suspended,<br />
and repeatedly extended and flexed their legs at an ap-<br />
proximate angular speed of 67'/s; the range of motion was<br />
approximately 135' to 0' to 135' in a total time period<br />
of 4 s[l]. It has been found that this rate of movement is<br />
the most comfortable rate for subjects to move their legs<br />
smoothly with consistency [2].<br />
Auscultation was performed during swinging move-<br />
ment of the leg by placing a stethoscope on the medial,<br />
lateral, and anterior surfaces of the knee. Sounds such as<br />
pops, clicks, grinds, and clunks heard during auscultation<br />
were coded along with the approximate corresponding<br />
joint angle for use as discriminant features in classification<br />
experiments. For patients who underwent arthroscopy,<br />
the position of the observed pathology was used to esti-<br />
mate the joint angle at which the affected surfaces could<br />
come into contact and contribute to VAG or sound sig-<br />
nals. For all subjects who participated in the study, the<br />
following information was also documented : age, gender,<br />
and number of times the subject exercised per week.<br />
3 VAG <strong>Signal</strong> Recording Setup<br />
The VAG signal was detected by a Dytran (Dytran,<br />
Chatsworth, CA) miniature accelerometer (model 3115a)<br />
placed on the skin over the mid-patella of the subject dur-<br />
ing dynamic movement of the knee. The signal was am-<br />
plified and conditioned by a bandpass filter of bandwidth<br />
10 Hz to 1 kHz using Gould (Gould, Cleveland, OH) iso-<br />
lation pre-amplifiers (model 11-5407-58) and (Gould uni-<br />
versal amplifiers (model 13-4615-18), and recorded on a<br />
Hewlett Packard (Hevvlett Packard, San Diego, CA) in-<br />
strumentation recorder (model 3968A). The bandpass fil-<br />
ter minimizes low-frequency movement artifacts and also<br />
prevents aliasing effects. A National Instruments (Na-<br />
tional Instruments, Austin, TX) AT-MIO-16L data ac-<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.
quisition board and Lab Windows (National Instruments,<br />
Austin, TX) software on a Zenith (Zenith, Los Angeles,<br />
CA) 386 computer were used to digitize the signals at a<br />
sampling rate of 2.5 kHz and 12 bits/sample. The data<br />
were then transferred to a SUN (SUN, Cupertino, CA)<br />
Sparcstation for processing.<br />
An electronic goniometer to measure the angle of the<br />
limb during movement was placed on the lateral aspect of<br />
the knee with the axis of rotation at the joint line. The<br />
signal from the goniometer was converted after digitiza-<br />
tion to the real angle in degrees based on the voltage of<br />
the goniometer at 0' and 90'. In this study, two databases<br />
were used :<br />
e Database AB, which consists of VAG signals of 51<br />
normal subjects, includes historically normal as well<br />
as symptomatic subjects who underwent arthroscopy<br />
and were found to be normal, and VAG signals<br />
of 39 symptomatic subjects with arthroscopically-<br />
confirmed cartilage pathology; and<br />
Database C extracted from database AB, which con-<br />
sists of 51 normal VAG signals and 20 abnormal<br />
VAG signals (restricted to chondromalacia patella<br />
only). Among the 20 chondromalacia patella cases,<br />
17 had additional pathology such as meniscal tears<br />
and chondromalacia of the tibial plateau.<br />
4 Adaptive Segmentation<br />
VAG signals are recorded during swinging movement of<br />
the knee, over a range of motion of 135' to 0' (exten-<br />
sion) and 0' to 135' (flexion). This kind of movement<br />
causes the joint surfaces to rub against each other, and<br />
also against the under-surface of the patella. The regions<br />
of the joint surfaces coming in contact are different at<br />
each position during the swing. The contact area may<br />
not be the same for every swing even for the same angle<br />
position: further, the quality of the joint surfaces com-<br />
ing in contact may change with joint angle. This means<br />
that signals of different characteristics are expected at<br />
different joint angles. As the statistical characteristics of<br />
the VAG signals are time-variant, the signals are non-<br />
stationary in nature. Hence, in order to apply standard<br />
signal processing techniques such as parametric modeling<br />
or spectral analysis on VAG signals, the signals have to<br />
be first adaptively segmented into locally-stationary seg-<br />
ments or components.<br />
Adaptive segmentation of VAG signals has already<br />
been reported in the literature by Tavathia et al. [3]<br />
and Moussavi et al. [l]. The new adaptive segmenta-<br />
tion method developed in the present work is based on<br />
the RLSL algorithm. The advantage in using a lattice fil-<br />
ter for segmentation of VAG signals is that the statistical<br />
211<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.<br />
340<br />
changes in the signals are well reflected in the filter pa-<br />
rameters, and hence segment boundaries can be detected<br />
by monitoring any one of the filter parameters such as<br />
the mean squared error, conversion factor, or the reflec-<br />
tion coefficients. Also, under certain circumstances, the<br />
required segmentation filter order can be predicted from<br />
the forward prediction error power. It was found that for<br />
VAG signals, the ensemble-averaged forward prediction<br />
error power (computed using 35 primary VAG signals)<br />
reaches a constant value after a model order of six. In<br />
this study, the conversion factor (y) has been used to<br />
monitor statistical changes in the VAG signals. In a sta-<br />
tionary environment, y starts with an initial value of zero<br />
and remains small during the early part of the initializa-<br />
tion period. After a few iterations, y begins to increase<br />
rapidly towards a final value of unity [4]. In the case of<br />
non-stationary signals such as VAG, y will fall from its<br />
steady value of unity whenever a change occurs in the<br />
statistics of the signal. This can be used in segmenting<br />
the VAG signal into locally-stationary components. The<br />
segmentation algorithm, in brief, is as follows:<br />
1. The VAG signal is passed twice through the segmen-<br />
tation filter. The first pass is used to allow the filter<br />
to converge, and the second pass is used to test the y<br />
value at each sample with a preferred fixed threshold<br />
value for detection of segment boundaries.<br />
2. Whenever y at a particular time sample during the<br />
second pass is lesser than the threshold value, a pri-<br />
mary segment boundary (PSB) is marked.<br />
3. If the difference between a PSB and the previous<br />
PSB of the same signal is greater than or equal to the<br />
minimum desired segment length of 120 data points<br />
[l], the PSB is marked as a final segment boundary;<br />
if not, the PSB is deleted and the process continued<br />
until all the PSBs are tested.<br />
Test results of the RLSL-based adaptive segmenta-<br />
tion method on different non-stationary synthetic signals<br />
indicated the high efficiency of this method in detecting<br />
rapid and gradual changes in signals [5]. The main advan-<br />
tage of the new method of adaptive segmentation is that<br />
the threshold is a fixed value as opposed to a variable<br />
value that was adopted in the previous study of Mous-<br />
savi et al.[l]. For some signals, especially normal VAG<br />
signals, it was found that the adaptive segmentation pro-<br />
cedure gave almost the same results as manual segmen-<br />
tation based on auscultation and/or arthroscopy. Figure<br />
2 shows the plot of y for the corresponding VAG signal<br />
in figure 1. The dashed lines in figure 1 show the final<br />
segment boundaries for the corresponding VAG signal.<br />
On the average, eight segments per VAG signal were ob-<br />
tained.
300<br />
100<br />
Ill Ill<br />
Ill Ill<br />
Ill Ill<br />
Ill Ill<br />
Ill Ill<br />
Ill Ill<br />
I l l Ill<br />
Ill Ill<br />
Ill Ill<br />
Ill Ill II<br />
1 1 1 Ill II<br />
Ill 111 I1<br />
I,,, 0, I<br />
-300<br />
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000<br />
lime Samples<br />
Figure 1: VAG signal of an arthroscopically abnormal<br />
subject (chondromalacia grade 111) with the final segment<br />
boundaries shown by vertical dashed lines. The final seg-<br />
ment boundaries were determined by the RLSL adaptive<br />
segmentation algorithm. au: Arbitrary units.<br />
0.996 1 I‘ I 1<br />
0.9955<br />
0.995 0 1000 2000 3000 4600 300 6000 7000 8M)O 9000 10000<br />
number of iterations<br />
Figure 2: Plot of the conversion factor (7) for the abnor-<br />
mal VAG signal shown in figure 1. The horizontal dashed<br />
line is the fixed threshold line.<br />
212<br />
5 Autoregressive Modeling<br />
Modeling techniques such as autoregressive (AR) mod-<br />
eling, also referred to as “all-pole” modeling, provide<br />
parameters which could potentially be correlated with<br />
the physiological system producing the signals. The AR<br />
model is a linear, second-moment stationary model. Al-<br />
though VAG signals are neither linear nor stationary,<br />
second-moment stationarity holds over VAG signal seg-<br />
ments. Hence, approlpriate analysis of VAG segments<br />
may be based on an 14R model to extract all linearly-<br />
retrievable information from the signal in a minimum-<br />
variance manner. Some of the common ways to estimate<br />
the AR parameters are the autocorrelation or the Yule-<br />
Walker method [6], covariance method [6], Cholesky de-<br />
composition method [4:1, least squares method 141, and the<br />
Burg-Lattice method [4J. In this study on VAG signals,<br />
AR modeling method based on Burg-Lattice algonthm is<br />
investigated.<br />
The Burg-Lattice method was applied to stationary<br />
VAG signal segments and the AR prediction coefficients<br />
were derived. The model order used was 40. This order<br />
was chosen based on a,pplication of the Akaike Informa-<br />
tion Criterion (AIC), and models of this order were ob-<br />
served to predict the VAG signal segments well [l]. How-<br />
ever, a performance analysis of AR model coefficients in<br />
terms of the classification accuracy rate indicated that<br />
the first six AR coefficients of VAG signal segments are<br />
adequate for pattern classification of VAG sigioals [5].<br />
6 VAG Pat tern Classification<br />
As described in the previous section, the AR prediction<br />
coefficients were derived by modeling each VAG segment<br />
by the Burg-Lattice method. One of the obvious visual<br />
differences between normal and abnormal signatls was that<br />
the abnormal signals were much more variable in ampli-<br />
tude across a swing cycle than the normal signals. How-<br />
ever, this difference is lost in the process of dividing the<br />
signals into segments and considering each segment as a<br />
separate signal. To overcome this problem, the means<br />
(time averages) of the segments of each subject’s signal<br />
were computed, and the variance of these means (VM)<br />
was computed across the various segments of the same<br />
signal. The first six AR model coefficients, tlhe VM pa-<br />
rameter, and a few clinical parameters such as sound,<br />
age, gender, and activity level were used as discriminant<br />
features in the classification experiments.<br />
In this study, the classification of signals was done<br />
using the logistic analysis subroutine available in the Sta-<br />
tistical Package for Social Sciences (SPSS) software [7],<br />
and the leave-one-out method [8] was used l;o estimate<br />
the correct classification accuracy rate. In applying this<br />
method, all the segments of the VAG signal of one sub-<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.
Table 1: Comparison of different classification experi-<br />
ments by using the accuracy rates determined by applying<br />
the leave-one-out method, and the best test classification<br />
results obtained with the random split method.<br />
ject were excluded from the database, the classifier was<br />
designed with the segments of the VAG signals of the re-<br />
maining subjects, and then the VAG signal segments of<br />
the excluded subject were tested by the classifier. This<br />
operation was repeated to test all the subjects in each<br />
database. If segments spanning more than 10% of the<br />
duration of a subject’s signal were classified as abnormal,<br />
the subject was labeled as an abnormal subject; other-<br />
wise the subject was labeled as a normal subject. The<br />
number of correctly-classified subjects was then counted<br />
to estimate the classification accuracy rate. Since each<br />
test subject is excluded from the training sample set in<br />
turn, independence between the test set and the training<br />
set is maintained.<br />
Further, in another procedure, the accuracy rate in<br />
classifying the VAG signal segments into two groups was<br />
determined by applying logistic analysis on random splits<br />
of the databases into training and test sets. The VAG<br />
signal segments used in the test sets were totally different<br />
and independent from the VAG signal segments used in<br />
the training sets. The overall-accuracy rate of a training<br />
or a test set was given as the percentage of the number of<br />
correctly-classified segments in the training/test stage to<br />
the total number of segments in the training/test stage.<br />
7 Results<br />
Table 1 shows the classification results with database AB<br />
and database C. Several random split experiments were<br />
conducted [5], and Table 1 shows the best test classifi-<br />
cation results obtained with the random split method.<br />
From the results of the leave-one-out and random split<br />
methods, we can infer that the proposed method shows a<br />
better classification result with database C, and is sensi-<br />
tive to chondromalacia patella cases.<br />
8 Discussion and Further Work<br />
Substantial numbers of normal and abnormal VAG sig-<br />
nals were analyzed in this work, and the results ascer-<br />
tain that VAG signal analysis is indeed a potential tool<br />
for non-invasive diagnosis of articular cartilage pathology.<br />
213<br />
342<br />
Also, the proposed method has shown tremendous po-<br />
tential in detecting chondromalacia patella (results with<br />
database C) with non-invasive procedures.<br />
Future work will be directed towards time-<br />
scale/time-frequency analysis for improved feature anal-<br />
ysis of the non-stationary VAG signals, which may over-<br />
come the approximations involved in our current para-<br />
metric approach and the difficulties in segment-based<br />
analysis, and could lead to improved identification of dif-<br />
ferent types of cartilage pathology.<br />
Acknowledgements<br />
We gratefully acknowledge support of this project<br />
with grants from the Arthritis Society of Canada and the<br />
Alberta Heritage Foundation for Medical <strong>Research</strong>.<br />
References<br />
Authorized licensed use limited to: <strong>Ryerson</strong> <strong>University</strong> Library. Downloaded on July 6, 2009 at 15:27 from IEEE Xplore. Restrictions apply.<br />
[l] Z.M.K. Moussavi, R.M. Rangayyan, G.D. Bell, C.B.<br />
Frank, K.O. Ladly, and Y.T. Zhang. Screening of<br />
vibroarthrographic signals via adaptive segmentation<br />
and linear prediction modeling. IEEE Transactions<br />
on Biomedical Engineering, 43( 1):15-23, 1996.<br />
[a] K.O. Ladly. <strong>Analysis</strong> of patellofemoral joint vibration<br />
signals. Master’s thesis, The <strong>University</strong> of Calgary,<br />
Calgary, AB, Canada, 1992.<br />
[3] S. Tavathia, R.M. Rangayyan, C.B. Frank, G.D. Bell,<br />
K.O. Ladly, and Y.T. Zhang. <strong>Analysis</strong> of knee vi-<br />
bration signals using linear prediction. IEEE Trans-<br />
actions on Biomedical Engineering, 39(9):959-970,<br />
1992.<br />
[4] S. Haykin. Adaptive filter theory. Prentice-Hall, En-<br />
glewood Cliffs, N.J., 2nd edition, 1990.<br />
[5] S. Krishnan. Adaptive filtering, modeling, and classi-<br />
fication of knee joint vibroarthrographic signals. Mas-<br />
ter’s thesis, Submitted, Dept. of Electrical and Com-<br />
puter Engineering, The <strong>University</strong> of Calgary, Cal-<br />
gary, AB, Canada, April 1996.<br />
[6] J. Makhoul. Linear prediction: A tutorial review.<br />
Proc. IEEE, 63(4):561-580, 1975.<br />
[7] SPSS Inc., Chicago, IL. SPSS Advanced Statistics<br />
User’s Guide, 1990.<br />
[8] K. Fukunaga. Introduction to Statistical Pattern<br />
Recognition. Academic Press, Inc., San Diego, CA.,<br />
2nd edition, 1990.
Other Refereed Conference Papers<br />
T. Tabatabaei, S. Krishnan and A. Guergachi, Speech-based emotion recognition using<br />
sequence discriminant Support Vector Machines, 4 pages in CDROM Proc. Canadian Medical<br />
and Biological Engineering Conference (CMBEC), Toronto, Ontario, May 2007.<br />
O. Nedjah, A. Hussein, S. Krishnan, R. Sotudeh, CN tower lightining current derivative Heidler<br />
model for the validation of wavelet de-noising algorithm, In Proc. 18th International Wroclaw<br />
Symposium and Exhibition on Electromagnetic Compatibility, Wroclaw, Poland, pp:282 – 287,<br />
June 2006.<br />
A. Morrison, S. Krishnan, A. Anpalagan and B. Grush, Receiver autonomous mitigation of GPS<br />
non-line-of-sight multipath errors, 6 pages in Proc. ION National Technical Meeting, Monterey,<br />
California, January 2006.<br />
A. Ramalingam and S. Krishnan, Video fingerprinting using space-time and Gaussian mixture<br />
models, 4 pages in Proc. Canadian Workshop on Information Technology (CWIT), Montreal,<br />
Quebec, June 2005.<br />
K. Momen, S. Krishnan, D. Beal, E. Bouffet, B. Kavanagh, T. Chau, Self-organization of the<br />
communication space based on user range-of-motion: a framework for configuring non-contact<br />
augmentative communication devices, 4 pages in Proc. Canadian Medical And Biological<br />
Engineering Conference, Quebec City, Quebec, September 9-11, 2004<br />
J. Lukose and S. Krishnan, EEG signal analysis for screening alcoholics, 4 pages in Proc.<br />
International Dynamics of Continuous, Discrete, and Impulsive Systems (DCDIS) Conference,<br />
Guelph, Ontario, May 2003.<br />
K. Umapathy and S. Krishnan, Pathological voice screening using local discriminant bases,<br />
International 4 Pages in Proc. International Dynamics of Continuous, Discrete, and Impulsive<br />
Systems (DCDIS) Conference, Guelph, Ontario, May 2003.<br />
S. Erkucuk (and S. Krishnan, M. Zeytinoglu), A novel technique for digital audio watermarking,<br />
Student Contest Presentation at the IEEE International Conference on Multimedia and Expo<br />
(ICME), Laussanne, Switerland, August 2002. (Won the IBM T.J. Watson <strong>Research</strong> award for<br />
innovative ideas)<br />
K. Umapathy, S. Krishnan, and S. Jimaa, Time-frequency analysis of wideband speech and<br />
audio, 2 pages in Proc. Micronet Annual Workshop, Aylmer, Quebec, April 2002.<br />
S. Krishnan, R.M. Rangayyan, and K. Umapathy, A time-frequency approach for auditory<br />
display of time-varying signals, in Proc. IASTED International Conference on <strong>Signal</strong> and Image<br />
Processing, Hawaii, USA, pp 236-241, August 2001.<br />
K. Umapathy and S. Krishnan, Joint time-frequency coding of audio signals, in Proc. 5th<br />
WSES/IEEE Multiconference on Circuits, Systems, Communications, and Computers, Crete,<br />
Greece, pp 32-36, July 2001.<br />
214
K. Umapathy and S. Krishnan, Low bit-rate time-frequency coding of wideband audio signals, in<br />
Proc. IASTED International Conference on <strong>Signal</strong> Processing, Pattern Recognition and<br />
Applications, Rhodes, Greece, pp 101-105, July 2001.<br />
R.M. Rangayyan, S. Krishnan, G.D. Bell, and C.B. Frank, Computer-aided auscultation of knee<br />
joint vibration signals. In Proc. European Medical and Biological Engineering Conference,<br />
Vienna, Austria, pp: 464-465, Nov. 1999.<br />
S. Krishnan and R.M. Rangayyan, Knee joint vibration signal analysis using adaptive timefrequency<br />
distributions. In Proc. European Medical and Biological Engineering Conference,<br />
Vienna, Austria, pp: 466-467, Nov. 1999.<br />
S. Krishnan and R.M. Rangayyan, Feature identification in the time-frequency distributions of<br />
knee joint vibroarthrographic signals using Hough and Radon transforms. In Proc. International<br />
Conference on Robotics, Vision, and Parallel Processing, Tronoh, Malaysia, pp: 82-89, July<br />
1999.<br />
R.M. Rangayyan, S. Krishnan, G.D. Bell, C.B. Frank, and K.O. Ladly, Impact of muscle<br />
contraction interference cancellation on vibroarthrographic screening, Proc. International<br />
Conference on Biomedical Engineering, Kowloon, Hong Kong, pp 16-19, June 1996. (invited<br />
paper)<br />
S. Krishnan, R.M. Rangayyan, G.D. Bell, C.B. Frank, and K.O. Ladly, Adaptive segmentation<br />
and cepstral analysis of vibroarthrographic signals for non-invasive diagnosis of knee joint<br />
cartilage pathology, Proc. 22nd Canadian Medical and Biological Engineering Conference,<br />
Charlottetown, PEI, Canada, pp 8-9, June 1996.<br />
N. Kumaravel and S. Krishnan, Knowledge based biosignal processing system for diagnosing<br />
heart disorders, Proc. International Conference on Robotics, Vision, and Parallel Processing,<br />
Ipoh, Malaysia, pp 602-609, May 1994.<br />
S. Krishnan, An expert diagnostic system using signal processing tool, in Proc. International<br />
conference on expert systems for development, Bangkok, Thailand, March 1994.<br />
215