Paper.dvi

Dynamically Weighted Hidden Markov Model for Spam Deobfuscation
Seunghak Lee
Iryoung Jeong
Seungjin Choi
Abstract
and its effect is limited. Other approaches involve varioussearch strategies. The Viterbi decoding is a time-consuming Spam deobfuscation is a processing to detect obfus- task for HMMs which contain a large vocabulary (e.g., 20,000 cated words appeared in spam emails and to convert words). The beam search algorithm [Jelinek, 1999] speeds up them back to the original words for correct recog- the Viterbi decoding by eliminating improper state paths in nition. Lexicon tree hidden Markov model (LT- the trellis with a threshold. However, it still has a limitation HMM) was recently shown to be useful in spam in performance gain if a large number of states are involved deobfuscation. However, LT-HMM suffers from a (e.g., 100,000 states) since the number of states should still huge number of states, which is not desirable for be taken into account in the algorithm.
Spam deobfuscation is an important pre-processing task a complexity-reduced HMM, referred to as dy- for content-based spam filters which use words of contents in namically weighted HMM (DW-HMM) where the emails to determine whether an incoming email is a spam or states involving the same emission probability are not. For instance, the word ”viagra” indicates that the email grouped into super-states, while preserving state containing such word is most likely a spam. However, spam- transition probabilities of the original HMM. DW- mers obfuscate words to circumvent spam filters by inserting, HMM dramatically reduces the number of states deleting and substituting characters of words as well as by in- and its state transition probabilities are determined troducing incorrect segmentation. For example, ”viagra” may in the decoding phase. We illustrate how we con- be written as ”vi@graa” in spam emails. Some examples of vert a LT-HMM to its associated DW-HMM. We obfuscated words in real spam emails are shown in Table 1.
confirm the useful behavior of DW-HMM in the Thus, an important task for successful content-based spam fil- task of spam deobfuscation, showing that it signifi- tering is to restore obfuscated words in spam emails to origi- cantly reduces the number of states while maintain- nal ones. Lee and Ng [Lee and Ng, 2005] proposed a method of spam deobfuscation based on a lexicon tree HMM (LT-HMM), demonstrating promising results. Nevertheless, the Introduction
LT-HMM suffers from a large number of states (e.g., 110,919 Large vocabulary problems in hidden Markov models states), which is not desirable for practical applications.
(HMMs) have been addressed in various areas such as hand-writing recognition [A. L. Koerich and Suen, 2003] and Table 1: Examples of obfuscated words in spam emails.
speech recognition [Jelinek, 1999]. As the vocabulary sizeincreases, the computational complexity for recognition and con. tains forwa. rdlook. ing sta. tements decoding dramatically grows, making the recognition system impractical. In order to solve the large vocabulary problem, various methods have been developed, especially in speechand handwriting recognition communities.
For example, one approach is to reduce the lexicon size by using the side information such as word length and word shape. However, this method reduces the global lexicon toonly its subset so that the true word hypothesis might be dis- In this paper, we present a complexity-reduced structure of carded. An alternative approach is to reduce a search space in HMM as a solution to the large vocabulary problem. The the large lexicon, since the same initial characters are shared core idea is to group states involving the same emission by the lexical tree, leading to redundant computations. Com- probabilities into a few number of super-states, while pre- pared to the flat lexicon where whole words are simply cor- serving state transition probabilities of the original HMM.
rected, these methods reduce the computational complexity.
The proposed complexity-reduced HMM is referred to as dy- However, the lexical tree still has a large number of nodes namically weighted HMM (DW-HMM), since state transition probabilities are dynamically determined using the data struc- ture which contains the state transition probabilities of the In the DW-HMM, the number of states is reduced from 8 to to construct a DW-HMM, given a LT-HMM, reducing the The data structure Φ (as shown in the right of the bottom in number of states dramatically. We also explain conditions Figure 1) is constructed, in order to preserve the state transi- for equivalence between DW-HMM and LT-HMM. We apply tion probabilities defined in the original HMM. Each node DW-HMM to the task of spam deobfuscation, emphasizing in Φ is labeled by the super-state which it belongs to and its reduced-complexity as well as performance.
state transition probabilities are stored, following the origi-nal HMM.
Dynamically Weighted Hidden Markov
Model

A hidden Markov model is a simple dynamic Bayesian net- work that is characterized by initial state probabilities, state transition probabilities, and emission probabilities. Notations 1. Individual hidden states of HMM are denoted by 1 , q2, q3, . . . , qK }, where K is the number of states.
The hidden state vector at time t is denoted by q 2. Observation symbols belong to a finite alphabet, {y1, y2, . . . , yM }, where M is the number of distinct ob- servation symbols. The observation data at time t is de- 3. The state transition probability from state q 4. The emission probability of observation symbol y 5. The initial state probability of state q With these definitions, HMM is characterized by the joint dis- Figure 1: Transition diagrams for the 8-state HMM, its asso-ciated 4-state DW-HMM, and the data structure Φ, are shown.
The original HMM contains 8 different states with 4 different emission probabilities. The nodes with the same emission probabilities are colored by the same gray-scale. DW-HMM We first illustrate the DW-HMM with a simple example.
contains 4 different super-states and the data structure Φ is Figure 1 shows a transition diagram of a 8-state HMM and its constructed in such a way that the transition probabilities are associated DW-HMM that consists of four super-states, s1, s2, s3, and s4. In this example, the original 8-state HMM(as shown in the top in Figure 1) has four distinct emission The trellis of the 4-state DW-HMM is shown in Figure 2.
Transition probabilities are determined by searching Φ using a hypothesis, s1:t. To show the relation between DW-HMM and HMM, we define that super-state sequence of DW-HMM, s1:T , is corresponding to state sequence of HMM, q1:T , if The states involving the same emission probability, are Thus, in Figure 2, the hypothesis of the 8-state HMM cor- grouped into a super-state, denoted by s in the DW-HMM (as shown in the left of the bottom in Figure 1). In such a = {s1, s2, s3, s4} is the state sequence, case, we construct a DW-HMM containing 4 super-states, s 2 , s3, and s4, where the following is satisfied: sequence and the observation sequence are as follows: P (y|s2) = P (y|q2) = P (y|q4) = P (y|q6), Figure 2: Trellis of the 4-state DW-HMM converted from the 8-state HMM. Transition probabilities are determined by searchinga data structure, Φ, with a hypothesis, s1:t. The search process at a time instance is represented by the thick line in Φ.
yT } of HMMs equals that of the corresponding super-state sequence s1:T = {s1, s2, . . . , sT } and the same P (q1:T , y1:T ) = P (s1:T , y1:T ).
emission probabilities involved in the joint probabilities are The state transition probability of DW-HMM is defined as where s1:t is state sequence decoded so far, s1:t = The DW-HMM searches Φ with the hypothesis, mining the transition probabilities as follows: returns transition probabilities from Φ. The weight function,ω(s1:t), traces a hypothesis, s1:t, in Φ, and returns a state t. For example, in Figure 2, ω(s1:3) determines the state transition probability, P (s3|s2) at t = 3. The thick line in Φ shows that ω(s1:3) traces the hypothesis s1:3, and re- turns the transition probability, P (s3|s2), stored in the node, s3, which is the same as the value of P (q5|q2) as shown in By the above equalities, the 8-state HMM and the 4-state DW- Figure 1. If a node contains several state transition probabil- HMM have the same joint probability as follows: ities, we choose the correct one according to the node vis-ited at t − 1. It should be noted that DW-HMM determines the transition probability by searching the data structure of Φ In searching of a probability, P (st|st 1), we assume that whereas HMMs find them in the transition probability matrix.
state sequence s1:t is unique in Φ. The constraint is not so DW-HMMs converted from HMMs have the following strong in that if there are several state sequences which have exactly the same emission probabilities for each state, it is which have a very large state transition structure and com- very possible that the model has redundant paths. Therefore, mon emission probabilities among a number of states, such it is usual when the constraint is kept in HMMs.
as LT-HMMs. When we convert a HMM to its associated DW-HMM, the computational complexity significantly de- creases because the number of states is dramatically reduced.
Second, there is no need to maintain a transition probability matrix since the weight function dynamically gives transitionprobabilities in the decoding phase. Third, DW-HMMs are soflexible that it is easy to add, delete, or change any states inthe state transition structure since only the data structure, Φ, needs to be updated. Fourth, the speed and accuracy are con- figurable by using beam search algorithm and N-best search[Jelinek, 1999] respectively, thus securing a desirable perfor-mance.
Conversion algorithm
To convert a HMM to DW-HMM, we should make a set Figure 3: Representation of self-transitions of HMMs in Φ.
of super-states, S, from the states of HMM which repre-sent unique emission probabilities.
a small number of super-states and the original HMM’s Conditions for equivalence
state transition structure is large, the DW-HMM is more In the conversion process, DW-HMM conserves transition efficient than HMMs. For example, a LT-HMM is a good and emission probabilities. However, there is a difference candidate to convert to a DW-HMM, because it only has a between HMM and its associated DW-HMM when we use few states with unique emission probabilities and contains straightforward Viterbi algorithm. Figure 4 illustrates trellis large trie dictionary. The following explains the algorithm of of DW-HMM at t = 3 and the state s1, the only state path {s1, s2, s1} that can propagate further since {s1, s3, s1} hasa smaller probability than {s1, s2, s1}. Therefore, the path{s1, s3, s1} is discarded in a purging step of the Viterbi algo- rithm. However, in case of HMM shown in Figure 5, at t = 3and the state q1, both state paths {q1, q2, q1} and {q1, q3, q1}can propagate to the next time instant because there are two 1. Make a set of super-states which have unique emission To ensure that the results of DW-HMM and HMM are the this step, the number of states of HMM is reduced as same, we adapt N-best search for DW-HMMs and we choose the best path at the last time instant in the trellis. Figure 6shows that the two-best search makes two hypotheses at s1 2. If there are any loops (self-transitions) in the HMM, and t = 3. Both state paths {s1, s2, s1} and {s1, s3, s1} are a loop, an additional sj state is made.
possible to distinguish between self-transitions and 3. Construct the DW-HMM associated with the HMM between super-states if there exists a state transition in the HMM, from which the super-states are made. Forexample, in Figure 1, the 4-state DW-HMM has a state transition from s2 to s3 because q2 may transition to q5in the 8-state HMM.
4. Make a data structure, Φ, to define a weight function, ω(s1:t), which gives the transition probability of the DW-HMM. Φ contains the transition structure of the HMM and stores transition probabilities in each node.
In making Φ, self-transitions in the HMM are changed as shown in Figure 3. The super-state that has a looptransitions to an additional super-state made from step 2 and it transitions to states where the original stategoes. The structure of Φ and the state transition struc- ture of the HMM may be different due to self-transitions.
Figure 7 illustrates the case when two-best search is useful 5. Define emission probabilities of the DW-HMM’s super- in LT-HMM. We denote the physical property of the states states which are the same as of the corresponding states in the nodes, showing which states have the same emission the same observation symbols as the model of Lee and Ng.
Transition probabilities of the DW-HMM are determined in the decoding phase by a weight function equipped with Φ, which is a data structure of lexicon tree containing transitionprobabilities of the LT-HMM. Null transitions are allowed and their transition probabilities are also determined by the weight function. It recovers deletion of characters in the in- The set of individual hidden super-states of the DW-HMM Figure 6: Trellis for two-best search.
where s1 is an initial state and states in {s2, . . . , s27} are probability. q1 and qN are the initial and final state of the LT- match states. States in {s28, . . . , s54} are insert states and s55 HMM, respectively. Assuming that there exist two hypothe- is the final state. A match state is the super-state representing ses at t = 4, {q1, a, b, a}, {q1, a, c, a}, the LT-HMM allow the letters of the English alphabet and an insert state is the them to propagate further. However, the DW-HMM cannot state stems from the self-transitions of the LT-HMM. Since preserve both at t = 4 if we use one-best search. Since the self-transitions of the LT-HMM reflect insertion of characters states labeled ”a” in the LT-HMM are grouped into a super- in obfuscated words, the state is named an insert state. The state, one hypothesis should be selected at t = 4. Thus, final state is the state which represents the end of words.
in case of the DW-HMM, if the answer’s state sequence is Transition probability of the DW-HMM is as follows: {q1, a, c, a, c, i, a, qN }, and only one hypothesis, {q1, a, b, a}, is chosen at t = 4, we will fail to find the answer. We address the problem with N-best search. If we adapt two-best search, two hypotheses, {q1, a, b, a}, {q1, a, c, a}, are able to propa- The structure of Φ is made using lexicon tree and each node gate further, thereby preserving the hypothesis for the answer.
of Φ has the transition probability of the LT-HMM.
HMM and converted DW-HMM give exactly the same re- Here, a hypothesis s1:t starts from s1 which is the initial sults if we adapt the N-best search where N is the maximum state decoded. The DW-HMM converted from the LT-HMM number of states of HMMs that compose the same super-state has only one initial state and it may appear many times in and propagate further at a time instant. However, for practi- the hypothesis since the input data is a sentence. To reduce cal purposes, N does not need to be large. Our application the computational cost in searching Φ, we can use the sub- for spam deobfuscation shows that the DW-HMM works well sequence of s1:t to search Φ since it is possible to reach the same node of Φ if a subsequence starts from the initial stateat any time instant.
We deobfuscate spam emails by choosing the best path us- ing decoding algorithms given observation characters. For example, given the observation characters, ”vi@”, the best state sequence, {s1, s23, s10, s2}, is chosen which represents We define emission probabilities as follows: Figure 7: Transition diagram of LT-HMM. By using two-best search, its associated DW-HMM is able to keep two states Application
A lexicon tree hidden Markov model(LT-HMM) for spam deobfuscation was proposed by Lee and Ng [Lee and Ng, 2005]. It consists of 110,919 states and 70 observation sym- t is not a letter of the English alphabet bols, such as the English alphabet, the space, and all otherstandard non-control ASCII characters. We transform the LT- 1More than one consecutive character deletion is not considered HMM to the DW-HMM which consists of 55 super-states and due to the severe harm in readability.
An emission probability mass function is given according the parameter for the self-transition should be significantly to the type of the observation. Here, a corresponding charac- smaller than the parameter for non-self-transition, because ter is a physical property of each node. For example, if a node the insertion of characters is less frequent than correctly writ- represents a character ”a”, the letter ”a” is the corresponding ten letters. From the starting point, we optimize each value character. A similar character is given by Leet, 2 which lists in θ and get θ0 which is locally maximized around the initial analogous characters. For example, ”@” is a similar character Decoding
Experimental results
The straightforward Viterbi algorithm is used to find the best state sequence, s1:T = {s1, s2, . . . , sT }, for a given observa- P (qt|qt ), and make Φ using a English dictionary (83,552 words) and large email data from spam corpus. 4 Parameters be different due to null transitions.
of our model are optimized with actual spam emails contain- The accuracy and speed are configurable for DW-HMM by ing 65 lines and 447 words. We perform an experiment with using the N-best search and the beam search algorithm. N- actual spam emails which contain 313 lines and 2,131 words, best search makes it possible to improve the accuracy in that including insertion, substitution, deletion, segmentation, and multiple super-states are preserved in the decoding phase. For the mixed types of obfuscation. Almost all the words are in- our model, we select the best path rather than N most prob- cluded in the lexicon that we use. Table 2 exhibits some ex- able paths at the final time instant in the trellis. However, it amples of various types of obfuscation.
increases computational complexity at a cost of the improvedaccuracy.
The time complexity of the Viterbi algorithm is O(K2T ), Table 2: Some examples of various types of spam obfusca- where K is the number of states and T is the length of the input. When the N-best search is used, the time complexityis O((N K)2T ). To address the speed issue, the beam search algorithm is used and it improves the speed without losing much accuracy. Although LT-HMMs are also able to speed up by using beam search algorithm, the large number of states In our experiment, when we use the Viterbi algorithm, the rate of process was 246 characters/sec. We speed up the de- obfuscation process at a rate of 2,038 characters/sec with abeam width of 10 by using the beam search algorithm.
When it comes to the complexity of the weight function, Our experiments are performed using various decoding methods. We use the one-best and two-best searches with var- s1:t), it is almost negligible since trie has O(M ) complex- ity where M is the maximum length of words in the lexicon.
ious beam widths and evaluate the results in terms of the ac-curacy and speed. Table 3 shows the accuracy of the results of Parameter learning
spam deobfuscation when two-best search with a beam widthof five is used. It represents that our model performs well for Our model has several parameters, θ 3, which should be the insertion, substitution, segmentation, and the mixed types optimized. We adapt greedy hillclimbing search to get local of obfuscation. However, considering that the deletion type maxima. By using a training set which consists of obfuscated of obfuscation is rare in real spam emails, our model has not words and corresponding answers, we find the parameter set which locally maximize the log likelihood.
Table 3: Accuracy of DW-HMM with two-best search and a where (s1:t, y1:t) is a pair of obfuscated observations in thetraining data and corresponding answer’s state sequence. θ is the locally optimized parameter set and n is the number of lines of the training set. η and ǫ determine the probability of the self-transition and null transition respectively [Lee and We start optimizing parameters, θ, from initial values which are set according to their characteristics. For example, 2Leet is defined as the modification of written text, see (en.wikipedia.org/wiki/Leet) website.
4We use 2005 TREC public spam corpus. For more information, 3θ = {η, ǫ, ρ1, ρ2, ρ3, σ1, σ2, ψ1, ψ2, ψ3}.
see (plg.uwaterloo.ca/˜gvcormac/treccorpus/) website.
Figure 8 shows the effect of N-best search and beam width can also use DW-HMM where the state transition structure for the overall accuracy and computation speed 5 of our frequently changes since it is easy to maintain such changes model. Overall accuracy is defined as the fraction of correctly deobfuscated words and the computation speed is the numberof processed characters per second. As the beam width in- Acknowledgment
creases, the overall accuracy rises a little only when the one- Portion of this work was supported by Korea MIC under best search with a beam width of five is used. However, it ITRC support program supervised by the IITA (IITA-2005- turns out that the two-best search significantly improves the overall accuracy compared to the one-best search. The two-best search with a beam width of five shows the accuracy of96.9% and the processing speed of 2,136 characters/sec. The References
experimental results show that two-best search with a beam [A. L. Koerich and Suen, 2003] R. Sabourin A. L. Koerich width of five is a desirable configuration maintaining the high and C. Y. Suen. Large vocabulary off-line handwriting accuracy and a low computational cost.
recognition: A survey. Pattern Analysis and Applications,6(2):97–121, 2003.
[Chow and Schwartz, 1989] Y. Chow and R. Schwartz. The n-best algorithm: An efficient procedure for finding topn sentence hypotheses.
Language Workshop, pages 199–202, Cape Cod, Mas-sachusetts, 1989.
One-best Computation SpeedTwo-best Overall Accuracy [H. Murveit and Weintraub, 1993] V. Digalakis H. Murveit, J. Butzberger and M. Weintraub. Large vocabulary dic- tation using SRIs DECIPHER speech recognition system:Progressive search techniques. In Proc. ICASSP, pages [Jelinek, 1999] F. Jelinek. Statistical Methods for Speech [Lee and Kim, 1999] H. Lee and J. Kim. An HMM-based threshold model approach for gesture recognition. IEEE Figure 8: Effect of beam width and N-best search on the Transactions on Pattern Analysis and Machine Intelli- gence, 21(10):961–973, October 1999.
[Lee and Ng, 2005] H. Lee and A. Y. Ng. Spam deobfusca- tion using a hidden Markov model. In Proc. 2nd Confer- Conclusions
ence on Email and Anti-Spam, Stanford University, CA,USA, 2005.
We have presented dynamically weighted hidden Markovmodel (DW-HMM) which dramatically reduced the number [Lifchitz and Maire, 2000] A. Lifchitz and F. Maire. A fast of states when a few sets of states had distinct emission prob- lexically constrained Viterbi algorithm for on-line hand- abilities. The states sharing the same emission probabilities were grouped into super-states in DW-HMM. State transi- shop on Frontiers in Handwriting Recognition (IWFHR- tion probabilities in DW-HMM were determined by a weight 7), pages 313–322, Amsterdam, The Netherlands, Septem- function which reflects the original state transitions main- tained in the data structure Φ, rather than a large transition [Rabiner, 1989] L. R. Rabiner. A tutorial on hidden Markov probability matrix. We have shown how an HMM is con- models and selected applications in speech recognition.
verted into its associated DW-HMM, retaining a few num- Proceedings of the IEEE, 77(2):257–286, February 1989.
ber of super-sates. We have applied DW-HMM to the task of [S. Procter and Mokhtarian, 2000] J. Illingworth S. Procter spam deobfuscation, where the LT-HMM was replaced by the and F. Mokhtarian. Cursive handwriting recognition using DW-HMM. Our experimental results showed that it improves hidden Markov models and a lexicon-driven level build- the speed from 10 characters/sec to 207 characters/sec when a Vision, Image, and Signal Processing, straightforward Viterbi algorithm is applied. DW-HMM can be applied to diverse areas where a highly structured HMMis used with a few distinct emission probabilities. For exam-ple, in speech and handwriting recognition areas, our modelmay be used to address the large vocabulary problems. We 5Computation speed is evaluated when DW-HMM is executed on PC with Pentium 4 1.66GHz and 512MB RAM.

Source: http://mlg.postech.ac.kr/publications/inter_conf/2007/ijcai07.pdf

Tsdnc proceedings 2010 - adjusted.pmd

Nutritional and Animal Welfare Implications to Lameness Jan K. Shearer1 Department of Veterinary Diagnostic and Production Animal Medicine Abstract metalloproteinase enzymes and peripartumhormones, such as estrogen and relaxin. Theimplications of this are that in addition to feedingfermentive disorders occurring secondary to theand nutrition, dairy farmers must pay particularconsu

Microsoft word - 090902 press release-astelin jv purchase.doc

News Release MEDPOINTE ACQUIRES 100% OF ASTELIN® RIGHTS IN US AND CANADA Acquires Joint Venture Interest, Intellectual Property Not Already Owned SOMERSET, NJ – September 9, 2002 – In its third strategic move this year designed to capture the full commercial potential of its flagship product, the anti-allergy nasal spray ASTELIN®, MedPointe announced today it has acquired ex

Copyright © 2010 Find Medical Article