Variable Selection for Sickle Cell Anemia
Sivanarayana Gaddam, Haofei Fang, and Shanthi Potla
space (m < n) that maximizes the classification accuracy
Abstract—Variable Selection is the process of deriving a new
[1,2,6]. GA maintains a pool of competing feature matrices
subset of variables from the original features in order to increase
and each members of the pool are then sent to classifier for
classifier efficiency, and allow higher classification accuracy. In
fitness evaluation. The fitness information will be used as a
general, because of higher dimensionality the deterministic algorithms cannot be applied unless we map the original pattern vectors to the new vectors of lower dimensionality. But many feature extraction techniques have their own disadvantages. The
GA typically requires a solution to be encoded in the form of
popular techniques such as Principal Component Analysis (PCA)
chromosome and a fitness function for evaluation. Siedlecki
and Linear Discriminant Analysis (LDA) conduct feature
and Sklansky [7] proposed a simple direct GA approach for
extraction independently with the classifier, which may degrade
feature selection. They used a binary chromosome of length n the performance of classifiers. Here, we followed wrapper model
(dimensions) where each gene (bit) is associated with the
to feature extraction in which feature selection, feature extraction and classifier training performed simultaneously using genetic
feature. If the ith bit is 0 then the corresponding ith feature will
algorithm (GA) and artificial neural networks (ANN). We tested
be discarded other wise considered for classification. Each
our algorithm on sickle cell patients data and found 10 of the 23
chromosome is evaluated on a set of test data using k-nearest
parameters are sufficient to train ANNs for classification.
neighbor classification. This technique was further extended to
Furthermore, we compared the performance of GA with the PCA
allow linear feature extraction [6] and provided a basis for
feature extraction technique and selection based on the contribution.
Index Terms—Genetic Algorithm, Artificial Neural Network,
Here, we followed the simple genetic algorithm approach but
Principal Component Analysis, Linear Discriminant Analysis,
we used artificial neural networks as wrapper method for
Feature Selection, Feature Extraction, Wrapper Model.
fitness evaluation. We did not use the K-NN classification
because of its sensitiveness to the redundant features and not suitable for difficult classification tasks. Moreover, K-NN is
outperformed by ANN on many difficult classification tasks [8]. We tested our approach on sickle cell anemia data and
THE purpose of feature selection is to design a compact compared our results with PCA feature extraction combined
classifier with high classification accuracy. The selection process should remove useless, redundant features [1]. Over
the years, thorough investigation has been carried out and several variable selection algorithms are proposed [3] and
comparative studies [4]. Finding an optimal feature subset is
We used sickle cell anemia patient’s data to test our
usually intractable and many problems have been related to
algorithms in this project. Among different treatments, the first
feature selection shown to be NP-hard [2]. Thus, an efficient
approved drug for the causative treatment of sickle-cell
search strategy is required. Recently, genetic algorithms have
anemia, Hydroxyurea, was shown to decrease the number and
drawn attention due to their capability of finding approximate
severity of attacks in a study in 1995 (Charache et al) [12]. For
this specific treatment, patients will have variant response on
the Hydroxyurea [9]; therefore, clinicians have to classify
GAs are a particular class of evolutionary algorithms that use
patients into two classes, responding and non-responding to
techniques inspired by the evolutionary biology such as
take more effective treatment. In the original dataset, some of
inheritance, mutation, crossover and selection [5]. GAs are
the patients did not have data for all the 23 parameters and
parallel, iterative optimizers and have been successfully
represented with zero. In order to alleviate the effect of zero
applied to many optimization problems. Typically, given an n-
data on back propagation neural networks, we shifted those
dimensional space, the GAs task is to identify m-dimensional
values by one because, ANNs associate a special meaning with
the number zero. The shifting principle is applied to all β
globin haplotypes (Bantu, Benin, Cameroon, Senegal) and to
gene contains 0 means that the corresponding ith feature is
nucleate red blood cell counts (NRBC). The ANN algorithm
discarded else included in the data set for classification.
looks for differences between input values and therefore, the
differences between 1,2,3 are tantamount to the differences
The performance of the chromosome is determined
according to the classification accuracy. The fitness function can be described as below should be maximized.
PCA is one of the popular techniques for dimensionality
reduction. In general, principle component analysis can be
Step1: For PCA to work correctly, we have to subtract mean
from each of the data dimensions. This is produced a dataset
whose mean is zero. The mean subtraction is part of PCA in
order to minimize the mean square error of approximation.
Step2: Calculate the covariance matrix of mean adjusted
Step3: Calculate the eigen values and eigen vectors of the
covariance matrix. These are very important because they tell
us the useful information about the data. In fact, the eigen
Fig.1. A binary chromosome of length N (no of dimensions)
vector with the highest eigen value is the principle component
Step4: Derive the new data set. Once we have chosen our
data we wish to keep in our data and formed a feature vector
C1: 111000 100000 000011 001000 000011
by taking the transpose of the vector multiplied by the original
C2: 000100 010000 000011 000100 001100
C1: 111000 100000 000011 000100 001100
A simple variable selection [10] has applied on the data
C2: 000100 010000 000011 001000 000011
obtained using PCA analysis. This method considers every
feature into account using the following formula.
Fig.2. Crossover operator exchanges the information between
Then we sort the resultant vector to get a ranking for the C1: 000100 010000 000011 000100 001100 variables. A threshold can then be used to select the
variables. In this project, we used 95% threshold to select
C1: 000100 010000 100011 000100 001100
Fig.3. Mutation operator helps to escape from local minima
There are three design considerations to consider when
implementing a GA to solve a particular problem. First, a solution must be encoded on GA chromosome. Secondly, an objective function needs to be identified to evaluate the fitness of a chromosome. Finally, GA run parameters must be specified including genetic operators and their probabilities. Chromosome
For the GA feature extractor, definition of chromosome is
fairly straightforward. A binary chromosome(Fig1) of length n (dimensions) in which a gene value of “0” indicates that the corresponding feature is discarded. That means, if ith
An artificial neural network is a system based on the
operation of biological neural networks, in other words, is an
emulation of biological neural system. [14] The commonest
type of artificial neural network consists of three layers. [15]
While (Termination Condition is not satisfied) do
Each perceptron in the layers has its own weights and
activation function. Those weights will be updated according
to the error from the calculated output and desired output. The
effect of the error would be propagated back from the output
layer to the hidden layer(s). This kind of network also is called
The activation function also can vary from needs. Here we use
sigmoid function for both hidden layer and output layer.
The promotion of chromosome to the next generation is
F(Ch) > min (mean(prev population),θ )
θ is a threshold which is a control parameter of our genetic
algorithm. In this project, we experimentally chosen 70% as
threshold but it can be adapted automatically based on the
We have implemented the standard operators such as
crossover and Mutation with high crossover probability (0.7)
and low mutation probability (0.01). Crossover operator (Fig2)
aims to interchange the information and genes between
chromosomes. Therefore, crossover operator combines two or more parents to reproduce new children, then, one of these
children may hopefully collect all good features that exist in
Generations
his parents. Mutation (Fig3) is a genetic operator used to maintain
chromosomes to the next. The purpose of mutation operator is
Fig.5.Exp1: Genetic Algorithm terminated at 50 generations.
to allow the algorithm to avoid local minima by preventing the
population of chromosomes from becoming too similar to each
other. Genetic Algorithm in the project implemented as shown in Fig4. We terminated the algorithm based on the number of generations.
The results show that there is no intersection of feature
subset selected by KLE expansion and the feature subset
selected using exhaustive search [9]. Our approach involves
the classifier to find the best subset of features. On the other
hand, KLE expansion works independently with the classifier,
which may degrade the classifier performance.
The proposed GA algorithm is suitable for large-scale
selection problems and has high possibility to find better
solutions. Despite the several advantages, our approach is
methodologies. Other common disadvantage of GA is the
premature convergence. We overcome this problem by carefully designed the algorithm in such a way that diversity of
Generations
population is maintained. In this research project, we encoded the solution in a binary chromosome of length tantamount to
Algorithm terminated at 100 number of dimensions. This approach may not be a good way
of representing solution for micro array data or any other high
dimensional problem domains. Despite all these minor disadvantages, GA is a good technique for variable selection.
[1] Seok Oh, Jin-Seon Lee, Byung-Ro Moon, “Hybrid Genetic Algorithms for Feature Selection”. ieee transactions on pattern analysis and
machine intelligence, vol. 26, no. 11, november 2004
[2] Huan Liu, Lei Yu, “Toward Integrating Feature SelectionAlgorithms for Classification and Clustering”. ieee transactions on knowledge and
data engineering, vol. 17, no. 4, april 2005
[3] Anil Jain and Douglas Zongker, “ Feature Selection: Evaluation, Application and Small Sample Performance”. ieee transactions on pattern analysis and machine intelligence, vol. 19, no. 2, february 1997
[4] FJ.Ferri, P.Pudil, M.Hatef and J.Kittler, “Comparative Study of Techniques for Large-Scale Feature Selection.”
[5] “Introduction to Genetic Algorithms” by Joachim Stender,Brainware
Table.1 Selected variables and accuracy from methods
[6] Michael L. Raymer, William F. Punch, Erik D. Goodman, Leslie A.
Kuhn, and Anil K. Jain."Dimensionality Reduction Using Genetic Algorithms".IEEE transactions on evolutionary computation, vol. 4, no.
Feature Selection using wrapper model has produced better
[7] W. Siedlecki and J. Sklansky, “A note on genetic algorithms for
results than the other two approaches. For the sickle cell
largescale feature selection,” Pattern Recognit. Lett., vol. 10, pp. 335–347,1989.
anemia data, the integrated approach of GA and ANN selected
10 relevant variables and obtained a classification accuracy of
[8] P´adraig Cunningham1 and Sarah Jane Delany2 "k-Nearest Neighbour
87.5%. GA can be better than the recursive elimination used
in [9] in following two points: (1) GA is controllable in the
[9] Homayoun Valafar, Faramarz Valafar, Alan Darvill and Peter
execution time, indeed we can terminate the generation
Albersheim, Complex Carbohydrate Research Center and the
whenever we want, and (2) the result of GA can be improved
Department of Biochemistry and Molecular Biology, University of Georgia, 220 Riverbend Road, Athens, GA 30602 and Abdullah Kutlar,
by repeating trials and by varying the values of parameters.
Kristy F. Woods, and John Hardin, Department of Medicine,Medical
With regard to this algorithm, GA seems preferable for all
College of Georgia, Augusta, GA 30912. “ predicting the effectiveness
large-scale problems in which n >100. For example, to get the
of hydroxyurea in ndividual sickle cell anemia patients”. Journal of
optimal solution, the recursive elimination algorithm evaluates
Artificial Intelligence in Medicine, 18 (2): 133-148, February 2000
2^100 –1 combinations. In our setting of GA, we need only
[10] Valafar, Faramarz, San Diego State University. “ Lecture Notes:
5000 evaluations for n >100 to find the approximate solution.
Methods in Bio informatics and medical informatics ”
[11] Abhilash Alexander Miranda · Yann-Aël Le Borgne ·Gianluca
Bontempi." New Routes from Minimal Approximation Errorto Principal Components".
[12] Wikipedia, the free encyclopedia Available:
http://en.wikipedia.org/wiki/Sickle-cell_disease
[13] Artificial Neural Networks – A neural network tutorial
Available: http://www.learnartificialneuralnetworks.com/
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Feed-forward%20networks
VIDENSKAB OG PRAKSIS | Retningslinjer for behandling af overvægt/fedme anno 2006 STATUSARTIKEL Overlæge Ole Lander Svendsen, overlæge Søren Toubro, øgning af fysisk aktivitet og motion vanskelig at gennemføre læge Jens Meldgaard Bruun, læge Jens Peder Linnet & for fede personer, og øgningen i sig selv fremkalder ofte kun et beskedent vægttab. Hvis man kombinerer øget fysi
reinkarnáció - ASZTRÁLUTAZÁS - új energiaÖSSZEFÜGGÉSEK I. - H1N1 és még más is amivel az egészségünket manipulálják Hozzáadta: gyury2009. August 10. Monday 09:07Utolsó frissités 2010. September 28. Tuesday 06:53 Érdemes visszanézni mert napi aktualitásokkal folyamatosan bõvítem! 2010 SZEPTEMBERI frissítés: A lassú halál, avagy fizess, hogy hamarább meghalhass! Mott