Dynamically Weighted Hidden Markov Model for Spam Deobfuscation Seunghak Lee Iryoung Jeong Seungjin Choi Abstract
and its effect is limited. Other approaches involve varioussearch strategies. The Viterbi decoding is a time-consuming
Spam deobfuscation is a processing to detect obfus-
task for HMMs which contain a large vocabulary (e.g., 20,000
cated words appeared in spam emails and to convert
words). The beam search algorithm [Jelinek, 1999] speeds up
them back to the original words for correct recog-
the Viterbi decoding by eliminating improper state paths in
nition. Lexicon tree hidden Markov model (LT-
the trellis with a threshold. However, it still has a limitation
HMM) was recently shown to be useful in spam
in performance gain if a large number of states are involved
deobfuscation. However, LT-HMM suffers from a
(e.g., 100,000 states) since the number of states should still
huge number of states, which is not desirable for
be taken into account in the algorithm.
Spam deobfuscation is an important pre-processing task
a complexity-reduced HMM, referred to as dy-
for content-based spam filters which use words of contents in
namically weighted HMM (DW-HMM) where the
emails to determine whether an incoming email is a spam or
states involving the same emission probability are
not. For instance, the word ”viagra” indicates that the email
grouped into super-states, while preserving state
containing such word is most likely a spam. However, spam-
transition probabilities of the original HMM. DW-
mers obfuscate words to circumvent spam filters by inserting,
HMM dramatically reduces the number of states
deleting and substituting characters of words as well as by in-
and its state transition probabilities are determined
troducing incorrect segmentation. For example, ”viagra” may
in the decoding phase. We illustrate how we con-
be written as ”vi@graa” in spam emails. Some examples of
vert a LT-HMM to its associated DW-HMM. We
obfuscated words in real spam emails are shown in Table 1.
confirm the useful behavior of DW-HMM in the
Thus, an important task for successful content-based spam fil-
task of spam deobfuscation, showing that it signifi-
tering is to restore obfuscated words in spam emails to origi-
cantly reduces the number of states while maintain-
nal ones. Lee and Ng [Lee and Ng, 2005] proposed a method
of spam deobfuscation based on a lexicon tree HMM (LT-HMM), demonstrating promising results. Nevertheless, the
Introduction
LT-HMM suffers from a large number of states (e.g., 110,919
Large vocabulary problems in hidden Markov models
states), which is not desirable for practical applications.
(HMMs) have been addressed in various areas such as hand-writing recognition [A. L. Koerich and Suen, 2003] and
Table 1: Examples of obfuscated words in spam emails.
speech recognition [Jelinek, 1999]. As the vocabulary sizeincreases, the computational complexity for recognition and
con. tains forwa. rdlook. ing sta. tements
decoding dramatically grows, making the recognition system
impractical. In order to solve the large vocabulary problem,
various methods have been developed, especially in speechand handwriting recognition communities.
For example, one approach is to reduce the lexicon size
by using the side information such as word length and word
shape. However, this method reduces the global lexicon toonly its subset so that the true word hypothesis might be dis-
In this paper, we present a complexity-reduced structure of
carded. An alternative approach is to reduce a search space in
HMM as a solution to the large vocabulary problem. The
the large lexicon, since the same initial characters are shared
core idea is to group states involving the same emission
by the lexical tree, leading to redundant computations. Com-
probabilities into a few number of super-states, while pre-
pared to the flat lexicon where whole words are simply cor-
serving state transition probabilities of the original HMM.
rected, these methods reduce the computational complexity.
The proposed complexity-reduced HMM is referred to as dy-
However, the lexical tree still has a large number of nodes
namically weighted HMM (DW-HMM), since state transition
probabilities are dynamically determined using the data struc-
ture which contains the state transition probabilities of the
In the DW-HMM, the number of states is reduced from 8 to
to construct a DW-HMM, given a LT-HMM, reducing the
The data structure Φ (as shown in the right of the bottom in
number of states dramatically. We also explain conditions
Figure 1) is constructed, in order to preserve the state transi-
for equivalence between DW-HMM and LT-HMM. We apply
tion probabilities defined in the original HMM. Each node
DW-HMM to the task of spam deobfuscation, emphasizing
in Φ is labeled by the super-state which it belongs to and
its reduced-complexity as well as performance.
state transition probabilities are stored, following the origi-nal HMM. Dynamically Weighted Hidden Markov Model
A hidden Markov model is a simple dynamic Bayesian net-
work that is characterized by initial state probabilities, state
transition probabilities, and emission probabilities. Notations
1. Individual hidden states of HMM are denoted by
1 , q2, q3, . . . , qK }, where K is the number of states.
The hidden state vector at time t is denoted by q
2. Observation symbols belong to a finite alphabet,
{y1, y2, . . . , yM }, where M is the number of distinct ob-
servation symbols. The observation data at time t is de-
3. The state transition probability from state q
4. The emission probability of observation symbol y
5. The initial state probability of state q
With these definitions, HMM is characterized by the joint dis-
Figure 1: Transition diagrams for the 8-state HMM, its asso-ciated 4-state DW-HMM, and the data structure Φ, are shown.
The original HMM contains 8 different states with 4 different
emission probabilities. The nodes with the same emission
probabilities are colored by the same gray-scale. DW-HMM
We first illustrate the DW-HMM with a simple example.
contains 4 different super-states and the data structure Φ is
Figure 1 shows a transition diagram of a 8-state HMM and its
constructed in such a way that the transition probabilities are
associated DW-HMM that consists of four super-states, s1,
s2, s3, and s4. In this example, the original 8-state HMM(as shown in the top in Figure 1) has four distinct emission
The trellis of the 4-state DW-HMM is shown in Figure 2.
Transition probabilities are determined by searching Φ using
a hypothesis, s1:t. To show the relation between DW-HMM
and HMM, we define that super-state sequence of DW-HMM,
s1:T , is corresponding to state sequence of HMM, q1:T , if
The states involving the same emission probability, are
Thus, in Figure 2, the hypothesis of the 8-state HMM cor-
grouped into a super-state, denoted by s in the DW-HMM
(as shown in the left of the bottom in Figure 1). In such a
= {s1, s2, s3, s4} is the state sequence,
case, we construct a DW-HMM containing 4 super-states, s
2 , s3, and s4, where the following is satisfied:
sequence and the observation sequence are as follows:
P (y|s2) = P (y|q2) = P (y|q4) = P (y|q6),
Figure 2: Trellis of the 4-state DW-HMM converted from the 8-state HMM. Transition probabilities are determined by searchinga data structure, Φ, with a hypothesis, s1:t. The search process at a time instance is represented by the thick line in Φ.
yT } of HMMs equals that of the corresponding
super-state sequence s1:T = {s1, s2, . . . , sT } and the same
P (q1:T , y1:T ) = P (s1:T , y1:T ).
emission probabilities involved in the joint probabilities are
The state transition probability of DW-HMM is defined as
where s1:t is state sequence decoded so far, s1:t =
The DW-HMM searches Φ with the hypothesis,
mining the transition probabilities as follows:
returns transition probabilities from Φ. The weight function,ω(s1:t), traces a hypothesis, s1:t, in Φ, and returns a state
t. For example, in Figure 2, ω(s1:3) determines the
state transition probability, P (s3|s2) at t = 3. The thick line
in Φ shows that ω(s1:3) traces the hypothesis s1:3, and re-
turns the transition probability, P (s3|s2), stored in the node,
s3, which is the same as the value of P (q5|q2) as shown in
By the above equalities, the 8-state HMM and the 4-state DW-
Figure 1. If a node contains several state transition probabil-
HMM have the same joint probability as follows:
ities, we choose the correct one according to the node vis-ited at t − 1. It should be noted that DW-HMM determines
the transition probability by searching the data structure of Φ
In searching of a probability, P (st|st 1), we assume that
whereas HMMs find them in the transition probability matrix.
state sequence s1:t is unique in Φ. The constraint is not so
DW-HMMs converted from HMMs have the following
strong in that if there are several state sequences which have
exactly the same emission probabilities for each state, it is
which have a very large state transition structure and com-
very possible that the model has redundant paths. Therefore,
mon emission probabilities among a number of states, such
it is usual when the constraint is kept in HMMs.
as LT-HMMs. When we convert a HMM to its associated
DW-HMM, the computational complexity significantly de-
creases because the number of states is dramatically reduced.
Second, there is no need to maintain a transition probability
matrix since the weight function dynamically gives transitionprobabilities in the decoding phase. Third, DW-HMMs are soflexible that it is easy to add, delete, or change any states inthe state transition structure since only the data structure, Φ,
needs to be updated. Fourth, the speed and accuracy are con-
figurable by using beam search algorithm and N-best search[Jelinek, 1999] respectively, thus securing a desirable perfor-mance. Conversion algorithm
To convert a HMM to DW-HMM, we should make a set
Figure 3: Representation of self-transitions of HMMs in Φ.
of super-states, S, from the states of HMM which repre-sent unique emission probabilities.
a small number of super-states and the original HMM’s
Conditions for equivalence
state transition structure is large, the DW-HMM is more
In the conversion process, DW-HMM conserves transition
efficient than HMMs. For example, a LT-HMM is a good
and emission probabilities. However, there is a difference
candidate to convert to a DW-HMM, because it only has a
between HMM and its associated DW-HMM when we use
few states with unique emission probabilities and contains
straightforward Viterbi algorithm. Figure 4 illustrates trellis
large trie dictionary. The following explains the algorithm of
of DW-HMM at t = 3 and the state s1, the only state path
{s1, s2, s1} that can propagate further since {s1, s3, s1} hasa smaller probability than {s1, s2, s1}. Therefore, the path{s1, s3, s1} is discarded in a purging step of the Viterbi algo-
rithm. However, in case of HMM shown in Figure 5, at t = 3and the state q1, both state paths {q1, q2, q1} and {q1, q3, q1}can propagate to the next time instant because there are two
1. Make a set of super-states which have unique emission
To ensure that the results of DW-HMM and HMM are the
this step, the number of states of HMM is reduced as
same, we adapt N-best search for DW-HMMs and we choose
the best path at the last time instant in the trellis. Figure 6shows that the two-best search makes two hypotheses at s1
2. If there are any loops (self-transitions) in the HMM,
and t = 3. Both state paths {s1, s2, s1} and {s1, s3, s1} are
a loop, an additional sj state is made.
possible to distinguish between self-transitions and
3. Construct the DW-HMM associated with the HMM
between super-states if there exists a state transition in
the HMM, from which the super-states are made. Forexample, in Figure 1, the 4-state DW-HMM has a state
transition from s2 to s3 because q2 may transition to q5in the 8-state HMM.
4. Make a data structure, Φ, to define a weight function,
ω(s1:t), which gives the transition probability of the
DW-HMM. Φ contains the transition structure of the
HMM and stores transition probabilities in each node.
In making Φ, self-transitions in the HMM are changed
as shown in Figure 3. The super-state that has a looptransitions to an additional super-state made from step
2 and it transitions to states where the original stategoes. The structure of Φ and the state transition struc-
ture of the HMM may be different due to self-transitions.
Figure 7 illustrates the case when two-best search is useful
5. Define emission probabilities of the DW-HMM’s super-
in LT-HMM. We denote the physical property of the states
states which are the same as of the corresponding states
in the nodes, showing which states have the same emission
the same observation symbols as the model of Lee and Ng. Transition probabilities of the DW-HMM are determined in
the decoding phase by a weight function equipped with Φ,
which is a data structure of lexicon tree containing transitionprobabilities of the LT-HMM. Null transitions are allowed
and their transition probabilities are also determined by the
weight function. It recovers deletion of characters in the in-
The set of individual hidden super-states of the DW-HMM
Figure 6: Trellis for two-best search.
where s1 is an initial state and states in {s2, . . . , s27} are
probability. q1 and qN are the initial and final state of the LT-
match states. States in {s28, . . . , s54} are insert states and s55
HMM, respectively. Assuming that there exist two hypothe-
is the final state. A match state is the super-state representing
ses at t = 4, {q1, a, b, a}, {q1, a, c, a}, the LT-HMM allow
the letters of the English alphabet and an insert state is the
them to propagate further. However, the DW-HMM cannot
state stems from the self-transitions of the LT-HMM. Since
preserve both at t = 4 if we use one-best search. Since the
self-transitions of the LT-HMM reflect insertion of characters
states labeled ”a” in the LT-HMM are grouped into a super-
in obfuscated words, the state is named an insert state. The
state, one hypothesis should be selected at t = 4. Thus,
final state is the state which represents the end of words.
in case of the DW-HMM, if the answer’s state sequence is
Transition probability of the DW-HMM is as follows:
{q1, a, c, a, c, i, a, qN }, and only one hypothesis, {q1, a, b, a},
is chosen at t = 4, we will fail to find the answer. We address
the problem with N-best search. If we adapt two-best search,
two hypotheses, {q1, a, b, a}, {q1, a, c, a}, are able to propa-
The structure of Φ is made using lexicon tree and each node
gate further, thereby preserving the hypothesis for the answer.
of Φ has the transition probability of the LT-HMM.
HMM and converted DW-HMM give exactly the same re-
Here, a hypothesis s1:t starts from s1 which is the initial
sults if we adapt the N-best search where N is the maximum
state decoded. The DW-HMM converted from the LT-HMM
number of states of HMMs that compose the same super-state
has only one initial state and it may appear many times in
and propagate further at a time instant. However, for practi-
the hypothesis since the input data is a sentence. To reduce
cal purposes, N does not need to be large. Our application
the computational cost in searching Φ, we can use the sub-
for spam deobfuscation shows that the DW-HMM works well
sequence of s1:t to search Φ since it is possible to reach the
same node of Φ if a subsequence starts from the initial stateat any time instant.
We deobfuscate spam emails by choosing the best path us-
ing decoding algorithms given observation characters. For
example, given the observation characters, ”vi@”, the best
state sequence, {s1, s23, s10, s2}, is chosen which represents
We define emission probabilities as follows:
Figure 7: Transition diagram of LT-HMM. By using two-best
search, its associated DW-HMM is able to keep two states
Application
A lexicon tree hidden Markov model(LT-HMM) for spam
deobfuscation was proposed by Lee and Ng [Lee and Ng,
2005]. It consists of 110,919 states and 70 observation sym-
t is not a letter of the English alphabet
bols, such as the English alphabet, the space, and all otherstandard non-control ASCII characters. We transform the LT-
1More than one consecutive character deletion is not considered
HMM to the DW-HMM which consists of 55 super-states and
due to the severe harm in readability.
An emission probability mass function is given according
the parameter for the self-transition should be significantly
to the type of the observation. Here, a corresponding charac-
smaller than the parameter for non-self-transition, because
ter is a physical property of each node. For example, if a node
the insertion of characters is less frequent than correctly writ-
represents a character ”a”, the letter ”a” is the corresponding
ten letters. From the starting point, we optimize each value
character. A similar character is given by Leet, 2 which lists
in θ and get θ0 which is locally maximized around the initial
analogous characters. For example, ”@” is a similar character
Decoding Experimental results
The straightforward Viterbi algorithm is used to find the best
state sequence, s1:T = {s1, s2, . . . , sT }, for a given observa-
P (qt|qt ), and make Φ using a English dictionary (83,552
words) and large email data from spam corpus. 4 Parameters
be different due to null transitions.
of our model are optimized with actual spam emails contain-
The accuracy and speed are configurable for DW-HMM by
ing 65 lines and 447 words. We perform an experiment with
using the N-best search and the beam search algorithm. N-
actual spam emails which contain 313 lines and 2,131 words,
best search makes it possible to improve the accuracy in that
including insertion, substitution, deletion, segmentation, and
multiple super-states are preserved in the decoding phase. For
the mixed types of obfuscation. Almost all the words are in-
our model, we select the best path rather than N most prob-
cluded in the lexicon that we use. Table 2 exhibits some ex-
able paths at the final time instant in the trellis. However, it
amples of various types of obfuscation.
increases computational complexity at a cost of the improvedaccuracy.
The time complexity of the Viterbi algorithm is O(K2T ),
Table 2: Some examples of various types of spam obfusca-
where K is the number of states and T is the length of the
input. When the N-best search is used, the time complexityis O((N K)2T ). To address the speed issue, the beam search
algorithm is used and it improves the speed without losing
much accuracy. Although LT-HMMs are also able to speed
up by using beam search algorithm, the large number of states
In our experiment, when we use the Viterbi algorithm, the
rate of process was 246 characters/sec. We speed up the de-
obfuscation process at a rate of 2,038 characters/sec with abeam width of 10 by using the beam search algorithm.
When it comes to the complexity of the weight function,
Our experiments are performed using various decoding
methods. We use the one-best and two-best searches with var-
s1:t), it is almost negligible since trie has O(M ) complex-
ity where M is the maximum length of words in the lexicon.
ious beam widths and evaluate the results in terms of the ac-curacy and speed. Table 3 shows the accuracy of the results of
Parameter learning
spam deobfuscation when two-best search with a beam widthof five is used. It represents that our model performs well for
Our model has several parameters, θ 3, which should be
the insertion, substitution, segmentation, and the mixed types
optimized. We adapt greedy hillclimbing search to get local
of obfuscation. However, considering that the deletion type
maxima. By using a training set which consists of obfuscated
of obfuscation is rare in real spam emails, our model has not
words and corresponding answers, we find the parameter set
which locally maximize the log likelihood.
Table 3: Accuracy of DW-HMM with two-best search and a
where (s1:t, y1:t) is a pair of obfuscated observations in thetraining data and corresponding answer’s state sequence. θ
is the locally optimized parameter set and n is the number of
lines of the training set. η and ǫ determine the probability of
the self-transition and null transition respectively [Lee and
We start optimizing parameters, θ, from initial values
which are set according to their characteristics. For example,
2Leet is defined as the modification of written text, see
(en.wikipedia.org/wiki/Leet) website.
4We use 2005 TREC public spam corpus. For more information,
3θ = {η, ǫ, ρ1, ρ2, ρ3, σ1, σ2, ψ1, ψ2, ψ3}.
see (plg.uwaterloo.ca/˜gvcormac/treccorpus/) website.
Figure 8 shows the effect of N-best search and beam width
can also use DW-HMM where the state transition structure
for the overall accuracy and computation speed 5 of our
frequently changes since it is easy to maintain such changes
model. Overall accuracy is defined as the fraction of correctly
deobfuscated words and the computation speed is the numberof processed characters per second. As the beam width in-
Acknowledgment
creases, the overall accuracy rises a little only when the one-
Portion of this work was supported by Korea MIC under
best search with a beam width of five is used. However, it
ITRC support program supervised by the IITA (IITA-2005-
turns out that the two-best search significantly improves the
overall accuracy compared to the one-best search. The two-best search with a beam width of five shows the accuracy of96.9% and the processing speed of 2,136 characters/sec. The
References
experimental results show that two-best search with a beam
[A. L. Koerich and Suen, 2003] R. Sabourin A. L. Koerich
width of five is a desirable configuration maintaining the high
and C. Y. Suen. Large vocabulary off-line handwriting
accuracy and a low computational cost.
recognition: A survey. Pattern Analysis and Applications,6(2):97–121, 2003.
[Chow and Schwartz, 1989] Y. Chow and R. Schwartz. The
n-best algorithm: An efficient procedure for finding topn sentence hypotheses. Language Workshop, pages 199–202, Cape Cod, Mas-sachusetts, 1989.
One-best Computation SpeedTwo-best Overall Accuracy
[H. Murveit and Weintraub, 1993] V. Digalakis H. Murveit,
J. Butzberger and M. Weintraub. Large vocabulary dic-
tation using SRIs DECIPHER speech recognition system:Progressive search techniques. In Proc. ICASSP, pages
[Jelinek, 1999] F. Jelinek. Statistical Methods for Speech
[Lee and Kim, 1999] H. Lee and J. Kim. An HMM-based
threshold model approach for gesture recognition. IEEE
Figure 8: Effect of beam width and N-best search on the
Transactions on Pattern Analysis and Machine Intelli-gence, 21(10):961–973, October 1999.
[Lee and Ng, 2005] H. Lee and A. Y. Ng. Spam deobfusca-
tion using a hidden Markov model. In Proc. 2nd Confer-Conclusions ence on Email and Anti-Spam, Stanford University, CA,USA, 2005.
We have presented dynamically weighted hidden Markovmodel (DW-HMM) which dramatically reduced the number
[Lifchitz and Maire, 2000] A. Lifchitz and F. Maire. A fast
of states when a few sets of states had distinct emission prob-
lexically constrained Viterbi algorithm for on-line hand-
abilities. The states sharing the same emission probabilities
were grouped into super-states in DW-HMM. State transi-
shop on Frontiers in Handwriting Recognition (IWFHR-
tion probabilities in DW-HMM were determined by a weight
7), pages 313–322, Amsterdam, The Netherlands, Septem-
function which reflects the original state transitions main-
tained in the data structure Φ, rather than a large transition
[Rabiner, 1989] L. R. Rabiner. A tutorial on hidden Markov
probability matrix. We have shown how an HMM is con-
models and selected applications in speech recognition.
verted into its associated DW-HMM, retaining a few num-
Proceedings of the IEEE, 77(2):257–286, February 1989.
ber of super-sates. We have applied DW-HMM to the task of
[S. Procter and Mokhtarian, 2000] J. Illingworth S. Procter
spam deobfuscation, where the LT-HMM was replaced by the
and F. Mokhtarian. Cursive handwriting recognition using
DW-HMM. Our experimental results showed that it improves
hidden Markov models and a lexicon-driven level build-
the speed from 10 characters/sec to 207 characters/sec when a
Vision, Image, and Signal Processing,
straightforward Viterbi algorithm is applied. DW-HMM can
be applied to diverse areas where a highly structured HMMis used with a few distinct emission probabilities. For exam-ple, in speech and handwriting recognition areas, our modelmay be used to address the large vocabulary problems. We
5Computation speed is evaluated when DW-HMM is executed
on PC with Pentium 4 1.66GHz and 512MB RAM.
Nutritional and Animal Welfare Implications to Lameness Jan K. Shearer1 Department of Veterinary Diagnostic and Production Animal Medicine Abstract metalloproteinase enzymes and peripartumhormones, such as estrogen and relaxin. Theimplications of this are that in addition to feedingfermentive disorders occurring secondary to theand nutrition, dairy farmers must pay particularconsu
News Release MEDPOINTE ACQUIRES 100% OF ASTELIN® RIGHTS IN US AND CANADA Acquires Joint Venture Interest, Intellectual Property Not Already Owned SOMERSET, NJ – September 9, 2002 – In its third strategic move this year designed to capture the full commercial potential of its flagship product, the anti-allergy nasal spray ASTELIN®, MedPointe announced today it has acquired ex