Msp.ewi.tudelft.nl

The tangent kernel SVM for calibration-stable histogram
discrimination
Information and Communication Theory Group Faculty of Electrical Engineering, Mathematics and Computer Science Mekelweg 4, 2628 CD Delft, The Netherlands Keywords: Similarity; Invariant representation; Classification; Kernels; SVM
Abstract
tween objects. More specifically one can use tan-gent distance [1, 2] (dissimilarity) or tangent kernel In many practical situations the precise calibra- tion of a sensor is difficult or impractical. If the his- In this paper we study how the prior knowledge tograms of initial measurements are used as an input about the noise in a sensor helps to build a better of a recognition system, it leads to noisy affine trans- classifier. We consider a histogram-like class of data formations of their arguments. The straightforward which consists of, but is not limited to histograms way of histogram standardization is not suitable be- themselves (biomedical data [5], color histograms of cause of the presence of different histogram classes. the conventional images or channel histograms of We studied the tangent kernel approach to build a sim- multi-spectral images, etcetera) and any other data ilarity measure robust to imprecise calibration. This represented by non-negative vectors with fixed sum technique is tested on both artificial and real-world of elements (say, equal to one), e.g. normalized spec- data. Advantages and difficulties are discussed. tra. It is assumed that the imprecisions in the calibra-tion are small and can be considered as global shiftand scaling, i.e. affine imprecisions. The problem Introduction
is solved by developing an appropriate tangent kernelSVM classifier.
In solving pattern recognition problems one may A short review on the tangent kernel technique is consider at least two possibilities to improve the clas- presented in the next section. Then, in section 3 we sification performance: larger training sets and the define the problem in question and possible ways of use of a prior knowledge. The first way is the uni- versal one and the second depends on the problem at datasets. It is followed by the explanation and discus- sion of numerical experiments (section 5) and conclu- Prior knowledge helps to eliminate irrelevant vari- ability of data and thus lightens the classification task.
Examples of prior knowledge: class-conditional dis-tributions are Gaussian, hand-written letters are in- Review on the tangent kernel tech-
variant to (not too large) rotations, some shape prop- erties are invariant to affine transformations.
There are different ways of taking prior knowledge This section is a short account of ideas from [3, 4] into account. At first, one may build a relevant prob- adapted to our goals. Suppose that we have a two- ability density model. Another approach consists of class classification problem: having an object x ∈ extracting features which are invariant or robust to Rd, a label y ∈ {−1, +1} should be assigned defining transformations which have nothing to do with class to which one of two classes it belongs. In other words, the task consists of building a classification function This approach can be rethought of as a modi- fication of a definition of a similarity measure be- Let us assume that f (x) = sign(g(x)), where g : Rd → R is the smooth discriminant function.
Suppose also that we are given the prior knowledge Rγ(g; X, y) = (1 − γ)R(g; X, y) + γr(g; X) Note that r does not depend on object labels y. Thus, the additional not labeled objects may be used for its does not change the membership of the object. Thus it The formulation given by Eq.(3) is theoretically seems natural to demand that the classification func- suitable for any learning criterion R and any smooth function g. To make the problems tractable one needsto select some specific form of them. We selected the To simplify the task one can demand that this propertydoes not hold only for f but also for g, i.e which is stronger. Now, for the smooth function g it is possible using properties of the transformation Lt to reformulate this invariance property as As a learning algorithm the SVM classifier [6] was After rewriting the left side, it transforms into selected. After substitution of conventional SVM cri-terion R = 1 w 2 and Eq.(4) into Eq. (3) the new for all x. Note, that M is not necessarily a linear The typical pattern recognition setup of finding a discriminant function consists of solving the mini- we recover the original form of the SVM regularizer w 2. The discriminant function is also in- where R is the learning criterion (e.g. inverse margin, noise to signal ratio, etc.), X = (x1, . . . , xN )T is thetraining data set and y = (y Thus, in order to implement the technique we need bels of training objects. In order to incorporate prior only to redefine the matrix of inner products (kernel knowledge into this approach one should perform the minimization using the additional restriction (2). On the other hand, it is possible instead of requiring the strict equality (2) to minimize the weighted sum of So, this approach can be interpreted as a selection of a the original criterion and the regularizer new similarity measure that is robust to the unwanted If there is overlap between classes, then a soft- margin version of SVM algorithm should be used.
The modifications are straightforward and do not change Eq. (5). For the sake of brevity we use LTK- SVM (Linear Tangent Kernel SVM) abbreviation to 1Actually, set of equations (1) defines one-parametric Abelian name the soft margin SVM algorithm with kernel de- (commutative) group of transformations.
It is possible to derive a similar result for the non- and only one class of histograms were present. Af- linear discriminant function g which can be repre- ter applying standardization, it may in general hap- sented in a kernelized form (and thus for SVM with pen that all histograms from the same class will have non-linear kernels) [3]. In such situation one must approximately the same scaling (if the ”biological” typically deal with the subspace of an infinite dimen- variability is not so large) but histograms from the sional space. It means that kernel PCA [3] should be different classes will be scaled differently. Two possi- used for the calculation of matrix inversion.
ble ways to deal with these problems are 1) select thecommon mean and variance for all classes in such a Robustness to linear imprecisions in
way that the inter-intra distance or some other separa- sensor calibration
bility criterion is optimized 2) combine scaling with(the preliminary) classification, i.e decide to which Suppose that the input of the recognition system is class these histogram belongs before standardization a normalized one-dimensional histogram x ∈ Rd (what in fact implies a vicious circle). Another ap-proach [7] makes use of a dissimilarity representation namically selected to guarantee the smallest distance Here {η1, . . . , ηN } are samples of the random vari- Extraction of invariant features
0, . . . , ed} are the edges of histogram bins It is possible to eliminate calibration dependence from histogram by mapping it to the feature space in- variant to the unwanted changes in scaling and shift-ing (the above mentioned histogram standardization is Let us assume that η can be expressed as the particular case of such mapping). It can be donee.g. by the calculation of the mathematical expecta- tion of a function of the normalized argument where ξ is a measurement made on a perfectly cal- ibrated device, θ is a set of parameters of ξ distri- bution, and and a > 0 and b define possible affinecalibration imprecision. Parameters a, b and θ ran- where µ and σ are the mean and the standard devia- domly change from one histogram to another intro- tion of the distribution defined by the probability den- ducing calibration noise (a and b), ”biological” intra- class and inter-class variability of histograms (θ).
The main problem of this method is how to choose For the sake of simplicity, we assume that the num- the appropriate functions pi. Moreover, all possible sizes are selected in a such way that we can treat this are invariant to the calibration. Improper selection of the particular set of {u1, .} can lead us to a problemformulation in terms of features for which we have no It means that there are undesirable shift and scaling induced by a and b. Below we consider possible waysof dealing with this problem.
Derivation of the tangent kernel
Histogram standardization
In this article we apply the alternative approach based on the tangent kernel technique. Unlike in the Histograms with different (a, b) values can be put histogram standardization we do not change the sup- on the top of each other if we scale them in a such port of the histogram and also we do not need explic- way that means and variances of the distributions rep- resented by them are equal to some predefined values In our case there are two invariant transformations.
This approach would solve the problem if there wereno ”biological” variability between the histograms the first peak will be decreased while the height of thesecond peak will be increased. Also, the division of tetraploid cells produces the peak at the position four times larger than the first one. These changes specifythe tetraploid class of hystograms. It is also possi- In reality we deal with histograms not with densi- ble that two or three cells may stick together. This ties. Thus we should redefine the M operators aggregation process is responsible for the changes inthe peak heights and the appearance of the peak at a position three times larger than the first one.
Here, Z is the diagonal matrix which diagonal ele-ments contain the positions of histogram bin cenbters The data set used was kindly provided by Dr.
and ∆ is the operator calculating the (smoothed) fi- Marius Nap and Nick van Rodijnen from Pathology nite differences of the histogram x as a function of Department of ”De Wever” Hospital, Heerlen, The the histogram bin centers. Thus, we have the com- Netherlands. The data were obtained from multipa- rameter flow cytometric (MP-FCM) DNA analysis ofbreast cancer samples that were collected consecu- tively during routine diagnostic procedures from Jan- uary 2000 until November 2004. The resection spec- imens of the breast cancer samples were all routinely fixed in formalin and embedded in paraffin for diag- nostic workup. One paraffin block of representative tumor tissue was selected for MP-FCM and used to The parameter δ defines which one of the two trans- create a single cell suspension according to a proce- dure described in [9]. The DNA content of the sin- Finally, the new similarity measure reads as gle cell suspension was stained with propidium iodide(PI) and contamination with RNA was prevented by k(x, y) = xT{(1 − γ)I + γ[(1 − δ)Ctr + δCsc]}−1y pretreatment of the solution with RNase. The mea-surements were performed on a DAKO-galaxy flow Data description
cytometer with 4 photomultiplier tubes (PMT’s), pro-vided by DAKO Cytomation. The data were stored in Real-world data: flow cytometry
FCS2.0 format [10]. The resolution of the DNA his-tograms was scaled from 4096 values to 256 by apply- In our study we used a real-world flow cytomet- ing the unbiased summation method [5]. All the cor- ric DNA content data set (RD). The flow cytometric rections for signal crossover (from two other dyes), DNA analysis is performed on cells stained with an aggregation and debris in raw FCS2.0 files were man- appropriate fluorochrome specific for DNA. The pop- ually made by an experienced operator, using the soft- ulation of such cells passes an excitation source and ware package Flomax 2.2 (PARTEC). The labeling the fluorescent light emitted by stained cells is regis- was performed on the compensated data in Modfit LT tered by a photomultiplier tube (PMT). The result of 2.0 (Verity Software House, Inc.) on the DNA his- an experiment is a histogram of these signals recorded tograms. All histograms were first analyzed in an au- for a population of cells. Provided an appropriate de- tomatic mode and manually re-analyzed if the opera- vice calibration, this histogram can be interpreted as tor considered the outcome of the automated model a distribution of the DNA content over the cells.
The typical DNA content histograms are shown on intention to measure 20 000 counts per FCS2.0 file, Figure 1. One can observe a number of peaks. In this was not always possible due to variations in the healthy tissues or when a disease does not change the amount of single cells obtained from each tumor sam- distribution of the DNA content the first peak cor- responds to diploid cells (normal state) and the sec-ond peak is produced by cells with a doubled DNAamount because of the division process (tetraploidcells). The position of the second peak is two times In this paper we used the output of the third PMT.
larger than the position of the first one. This describes The first and the last histogram bins which contain the diploid type of histogram. It may happen that dur- only debris and noise were set to zero. No outlier de- ing division the DNA content of the cell is doubled tection was applied. For experiments 446 histograms but cell division does not occur. Thus, some part of were used: 205 from the diploid class and 241 from the cells stays in the tetraploid state and the height of The weights of mixture components πk stay the same for all histograms of the same class and actually define the difference between classes.
Each of the four components represents one of the peaks of the real-world flow cytometric data. We tried to set parameters in order to imitate both diploid (class A) and tetraploid (class B) histograms: only the first two components are present in class A histograms and the second and forth components are given more im- portance in class B histograms. To simplify the prob- lem we completely omitted the third peak for both In our experiments we used two types of synthetic datasets: SD0 a synthetic dataset having a perfect cal-ibration and SDσ - a dataset with σt = σb = σ. All Figure 1: Flow cytometric real-world data, selected the synthetic datasets consist of a large test dataset (5000 examples of each class) and 50 training datasets(1000 examples of each class). All the parametersused for the generating of synthetic datasets are sum- Synthetic data: mixture of lognor-
The real-world data may contain many sources of variability such as possible mislabelings and devicefaults. Also, heavy-tailed histograms may be clipped.
Because of this we test proposed technique on the synthetic data set SD which mimics basic properties of the real world data described above. Objects of this data set are normalized histograms (see Eq. (6)) of a random variable η which is a linear mixture of K log- normally distributed random variables {ζ1, . . . , ζK}with shift and scaling imprecisions. Namely, the ran- Figure 2: Selected samples from the synthetic data set without calibration imprecision (SD0). The differ-ence between histograms of the same class is caused The number of samples Ns per histogram and calibra- only by the intrinsic intra-class variation.
tion imprecision parameters t and b were randomlydrawn for each histogram from the distributions Experiments
We studied the difference in the performance of the conventional linear SVM classifier and its tan- gent kernel version. To see how useful the usage of prior knowledge can be for different sizes of training set the learning curves were computed. For synthetic To introduce the ”biological” variation we added datasets we used one large test set (5000 objects per Gaussian noise to the component parameters (µ∗, σ∗) class) and 50 independently generated training sets.
The results for the real-world dataset were obtained by averaging over 200 hold-out experiments: for each experiment we randomly took out 10% of all objects, the rest 90% objects were used as the training set.
The learning curves were obtained for each indepen- dent training set (for synthetic data) or hold-out ex- Table 1: Synthetic datasets parameters. For lognormal distributions (sample size Ns and components ζk) we havespecified not means and standard deviations of the underlying normal distributions (µ∗, σ∗) but their own (µ, σ)which can be expressed as µ = exp(µ∗ + 0.5σ∗2), σ = exp(2µ∗ + 2σ∗2) − exp(2µ∗ + σ∗2).
nel SVM were treated as follows: δ parameter was al- ways set to 0.5; in one part of experiments classifier was calculated at constant γ values, in the rest of ex- periments γ was optimized along with the ν parame- ter by the same procedure. For the derivative smooth- ing we used Savitsky-Golay filter of the first order.
The size of the filter window w was set to 5 in the ex- periments with constant γ values and was optimized by grid search together with ν and γ in the other ex- Figure 4 shows the learning curves of the conven- tional linear SVM and LTK-SVM. It can be seen that with increasing imprecisions in calibration the classi- fication performance of conventional SVM becomesworse. The applying of modified similarity measure Figure 3: Selected samples from the synthetic data alleviate this effect. It is interesting that the usage set with calibration imprecisions included (SD0.05).
of LTK-SVM for slightly deteriorated data SD0.01 The intra-class variation is larger than the one of the resulted in the smaller errors than the application of conventional SVM to the perfectly calibrated dataSD0. It can be probably explained by the fact thatthe intra-class ”biological” variations are component periment (for real-world data) by training classifiers specific shifts and scalings. It means that our tech- on the sequence of the nested training sets generated nique is not only able to soften the effect of the im- precise calibration but also to reduce specific types of In all experiments the ν regularization parameter intra-class variations. This topic deserves additional [11] of SVM/LTK-SVM classifier was optimized by grid search over the set of predefined values. For each The influence of the selection of parameters γ and candidate value the preliminary classifier was trained w of the modified similarity measure on classification on 75% data selected for the training procedure. The error was studied. For the data set with small im- other 25% were used to measure the performance and precisions SD0.01 (see Fig. 5) the best results were select the actual ν value. Using this value the final achieved at γ = 0.9. At small sizes of training set the classifier was trained on the whole current training learning curve with γ = 0.99 has the same behaviour but its performance becomes worse at large training The additional parameters of the linear tangent ker- data sets. Thus we observe an over-regularization. On Figure 4: Comparison of the learning curves of SVM and LTK-SVM on synthetic data sets. In all LTK-SVM algorithms parameters were set to γ = 0.9, w =5.
amount of imprecisions. The probable reason for thisbehaviour is that the linear kernel was used, althoughcalibration transformations are not linear. Thus the the other hand, the learning curve with optimized γ use of an appropriate nonlinear kernel may improve becomes only at large training data sets significantly better than the one of conventional SVM. This re-sult suggests that the procedure used for the similarity measure optimization heavily depends on the sample size. Although the method itself is able to compensate calibration imprecisions even at the beginning of thelearning curve, the automated procedure of parameter selection requires further improvement. One of thepossible approaches to try is kernel-target alignment [12]. For data with larger imprecisions SD0.05 thesame behaviour can be observed (see Fig. 5). How- ever, the effect of the over-regularization at γ = 0.99 Figure 7: Classification error at fixed training set size(45 objects per class) as a function of amount of cali- The results of experiments on the real-world flow cytometric data set (see Fig. 8) are in agreement withthe ones on synthetic data. The value of the parameter γ of the best learning curve (0.99) suggests that thecalibration imprecisions in the data are rather high.
However, for this data set we have only a limited num- ber of samples. Because of that it is impossible to as-sert (in contrast to synthetic data sets for which very large test sets are available) that γ = 0.99 is opti-mal not only for this data but also for future measure- Fig. 7 shows how the amount of imprecisions in- fluences the the performance of SVM and LTK-SVMclassifiers. In ideal case we would expect that the per- Conclusions
formance of the LTK-SVM does not change with theincrease of calibration imprecisions. But in our ex- We studied the possibility of the usage of the tan- periments the classification error increases with the gent kernel approach to stabilize the recognition of [3] B. Sch¨olkopf. Support Vector Learning. PhD [4] B. Sch¨olkopf, P. Y. Simard, A. J. Smola, and V. N. Vapnik. Prior knowledge in support vec- tor kernels. In M. I. Jordan, M. J. Kearns, andS. A. Solla, editors, Advances in Neural Infor- mation Processing Systems 10, pages 640–646,Cambridge, MA, 1998. MIT Press.
C. D. Bray, and M. Langweiler. Effects of res- olution reduction on data analysis. Cytometry,53A(2):103–111, 2003.
Figure 8: Learning curves for the real-world flow cy-tometric data set [6] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, USA, 2000.
histogram-like data liable to affine calibration impre- [7] E. Pekalska and Lecis P. Duin, R. P. W. Dis- cision. It has been demonstrated on the synthetic data similarity bases breas cancer classification. in set that the application of the tangent kernel approach to deteriorated data allows to reach a performancewhich is close to the one of conventional SVM algo- [8] E. Pekalska and R. P. W. Duin. Dissimilarity rithm on perfectly calibrated data. The experiments representations allow for building good classi- on the real-world data show that the tangent kernel ap- fiers. Pattern Recognition Letters, 23(8):943– proach helps to decrease the classification error. The tangent kernel approach has significant advantages at [9] M.P.G. Leers, B. Schutte, P.H.M.H. Theunissen, smaller sample sizes. On the other hand, the auto- F.C.S Ramaekers, and M. Nap. Heat pretreat- matic procedure of selecting the trade-off parameter ment increases resolution in dna flow cytometry γ becomes inefficient with a decreasing size of the of paraffin-embedded tumor tissue. Cytometry, training dataset. It is also expected that the use of non-linear kernels can improve the performance fur-ther. The studied approach seems to be a promising [10] Data file standards committe of the Society for creases resolution in dna flow cytometry of Acknowledgments
Authors would like to thank Dr. Mauris Nap and Nick van Rodijnen for the possibility to use the flow [11] B. Sch¨olkopf, A. Smola, R.C. Williamson, and cytometric data and for their contribution to section P.L. Bartlett. New support vector algorithms.
4. We also thank E. Pekalska for fruitful discussions.
Neural Computation, 12:1207–1245, 2000.
This research was supported by the Technology Foun- [12] N. Cristianini, A. Elisseeff, J. Shawe-Taylor, dation STW, applied science division of NWO and the and J. Kandola. On kernel-target alignment. In technology program of the Ministry of Economic Af- T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Pro-cessing Systems 14, Cambridge, MA, 2002. MIT References
[1] P. Simard, Y. Le Cun, and Denker J. Efficient pattern recognition using a new transformationdistance. In S. Hanson, J. Cowan, and Gile L.,editors, Advances in Neural Information Pro-cessing Systems 5, pages 50–58, San Mateo CA,1993. Morgan Kaufmann.
[2] P. Simard, Y. Le Cun, J. S. Denker, and B. Vic- Transformation Invariance in Pattern Recognition - Tangent Distance and TangentPropagation, volume 11, pages 181–197. 2000.

Source: http://msp.ewi.tudelft.nl/sites/default/files/verzakov_hist_asci.pdf

142 setembre:83 abril.qxd.qxd

CASO CLÍNICO Crisis convulsivas desencadenadas por los antidepresivos. Interacción entre los antidepresivos y los antiepilépticos Nogueras Ormazábal E1, Sarrasqueta Sáenz P1, Otamendi Murillo A2, González Alen P1, Presentamos el caso de una paciente que sufriócada 12 horas) porque había presentado nuevosun síndrome comicial con crisis parciales complejassecundariamente general

Neues aus der herpes-forschung

Melissenextrakt wirkt zweifach gegen Lippenbläschen Seit Jahren wird geforscht, womit dem unangenehmen Herpes simplex-Virus – Auslöser schmerzhafter Lippenbläschen – zu Leibe gerückt werden kann. Eine gute Nachricht kommt jetzt aus dem Universitätsklinikum Heidelberg*. Virologen und Biologen haben nachgewiesen, dass Melissenextrakt in gleich doppelter Weise einem Herpesausbruch

Copyright © 2010 Find Medical Article