Dibsarticle.pdf

Using Data Mining methodology for text retrieval
Institute of Computer Science, Warsaw University of Technology ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
Abstract:
Futurologists and science-fiction writers have been foreseeing an information explosion phenomenon for many
years, but during last decades we can experience it by ourselves. Thanks to the rapid development of Internet,
printing technology and widespread use of multimedia most of us have almost instant access to tremendous
amounts of information. Unfortunately these advancements in data storage and distribution technology have not
been accompanied by respective research in data retrieval technology for a long time. To put it in short: we are
now being flooded with data, yet we are starving for knowledge.
This need has created an entirely new approach to data processing - the data mining, which concentrates on
finding important trends and meta-information in huge amounts of raw data.
In this paper the main concepts of data mining and automatic knowledge discovery in databases are presented
(clustering, finding association rules, categorisation, statistical analysis). Special emphasis has been put on
possible applications of these methodologies in full text information retrieval and processing.
Keywords: data mining, text mining, web searching, natural language processing, machine learning
1. Introduction
More than five hundred years ago Johannes Guttenberg started the avalanche that has been recently given a name
"information explosion". Human knowledge has became cumulative since the invention of writing, but high
duplication costs meant that this knowledge remained only potentially available to most of humankind.
Guttenberg's printing press dramatically reduced costs of spreading information. This has given access to
accumulated knowledge to potentially unlimited number of people.
That, I daresay, information revolution is in many aspects similar to what is happening now. Widespread use of
electronic media - such as telecommunication networks and CD-ROM's - slashed costs of spreading information
almost to zero. Charges for Internet connections are minimal in most developed countries, in fact the web access
probably will be regarded in near future almost as human rights, such as access to telephone communication
network and own phone number. The user interfaces necessary to download the information have evolved into
such simple, yet powerful tools - like contemporary web browsers - that almost every one can use them. Most of
us have potential access too much more information than we can read (not to mention understand) in our entire
lifetimes.
The cost reduction process described above applies not only to information retrieval, but also to data sources
creation. In recent years the publishing tools - DTP programs, web page creation and multimedia authoring tools
- became simple to use and very cheap. This, together with the almost magical property of electronic media -
zero costs of medium itself1 - really gives everyone the possibility of direct contribution to the humankind
knowledge repositories.
The power to publish is of course not only reason triggering the information explosion. The amount of valuable
knowledge that we discover about our world and ourselves really augments by new facts every day. Increasing
number of science workers generate more and more research information, which in turn, due to synergy effect,
accelerates science development.
In 1960 the number of articles published monthly in academic journals equalled roughly to 5000, according to
[18]. In early nineties this number reached 30000 and that growth trend seems to be even stronger towards the
end of the century.
Articles
Fig.1 Information explosion's last phase, according to [18] Above figures may be shocking for some of us, yet they do not represent entire amount of data that is being created. For centuries, data sources containing pre-processed2 information were produced only by humans. Now an entirely new class of information has emerged: machine generated data. This could be telemetry data, satellite images, watchdog programs reports and so on. Even some economical data - like stock market figures - will be created not by humans, but by autonomous agents operating on virtual markets in near future. All this 1 As opposed to traditional publishing, where medium costs - be it paper, audio tape, or photographic paper - represent significant contribution to the publisher expenditures. 2 That is not coming directly from our senses, and therefore "created" by nature - like air temperature. information can be made accessible to entire community, either by putting on the Web3, or by more traditional means of distribution such as CD-ROM. Albeit this data is mostly numerical, not textual, it can also be a source of important knowledge. The huge amount of such information, rendering any kind of purely manual analysis impossible is a problem that emerges here. Some researchers argue that information explosion phenomenon is mostly a psychological effect. For centuries people have been concocting - and even sometimes writing down - an innumerable amount of stories, tales, and philosophical and scientific findings. Human invention in producing the data (putting aside its usefulness) seems to be more or less constant. Only the recent advent of telecommunication systems and aforementioned dramatic drop in publication costs allowed people to suddenly realise how much information is actually being produced by humanity. This may be true, but as we are not going to revert to our previous unawareness for amount of available information, we must learn to navigate this new ocean of data. This can be a real barrier, as our navigational aids - library indices, search engines, software agents - are still very primitive and ineffective. Very often when we try to find a piece of information via the Web the search engines return thirty, fifty, even hundreds of "hits". We suspect that information crucial for us is probably buried somewhere between the returned pages - but where exactly? In most cases it is impossible to read all these pages and assess them, throwing out irrelevant "hits". Information explosion is related not only to increasing number of publications available, but also to their declining overall quality. Publishing professionals - librarians, editors, even censors - have been filtering out incomplete or misleading information for most of our history. Now the publishing channel has been greatly shortened, so our knowledge repositories (mainly the Internet) contain a lot of garbage data. How to distinguish it from valuable information? The information explosion emerged as the data storage and transfer technology achieved its maturity. Now researchers should concentrate more and more on devising new ways of dealing with such huge amount of data that would allow us to retrieve necessary information effectively, and to extract real knowledge from it. In short, we need methods for "distilling" the data. 2. Corporate perspective
One of the favourite buzzwords among the economic and management science societies is globalisation. Indeed,
the trend to create super corporations4 that span not only across country boundaries, but also across continents
has been significant during recent decades. Paraphrasing Alvin Toffler's "global village" concept we can now
speak also about "global market".
I do not want to discuss whether that economical evolution was a step in right direction or not. One thing is
certain - it would not be possible without computers and efficient communication networks. Moreover - the
globalisation trend increased the amount of information processed by organisations, effectively duplicating the
"information explosion phenomenon" in the smaller scale.
In modern organisations - especially those operating in FMCG markets - information is "the blood of a
company". To survive, an organisation must constantly analyse all data that could influence its operations.
Along with creation of bigger and bigger corporations the amount of important data increased up to the point,
where existing analysis methods became insufficient. Imagine for example a worldwide trading company. All its
transactions should be recorded in a central database - at least for accounting purposes. Even if we are registering
only whole trucks of products, not individual packages, there could be hundreds of such transaction data every
day in each country. That perhaps is still manageable, but for the entire corporation this means several thousands
transactions every day, as information from all countries needs to be amassed for strategic decision purposes.
Information buried within that data could be very precious. For example some interesting trends could emerge
that should drive decisions on a corporate scale. This could be number of sold products decreasing in some
countries, or number of production faults increasing, and so on. Unfortunately the amount of information alone
means that it is not possible for humans - even for an army of analysts - to analyse that raw data without some
kind of computer - or rather artificial intelligence - assistance.
The "processing power" needed for analysis is not the only problem for company's data management division.
Some important trends can be only observed over long periods of time - yet the data storage space in
3 This process could even be automatic: we already have printers, cameras and hi-fi equipment incorporating web servers. 4 The name "transnationals" has already been conceived for such organisations. transactional databases is very expensive, as they have to be quick and reliable. Historical data has to be therefore removed from them - very rarely such systems can cope with keeping information even about last year events. First step in solving this problem were Data Warehouse concept and Online Analytical Processing (OLAP). Data Warehouses are very large databases, separated from transactional systems, that preserve information for a long time and in uniform format - ready for analysis. Such capabilities come for a price - most Data Warehouse systems are quite slow, but as they do not have to operate in real time, this is not a great disadvantage. Typical Data Warehouse not only stores huge amounts of historical data - for example all daily sales volumes since the company has been established. Their even more important property is that information contained within them has been purified: the transactional errors have been removed, monetary units have been recalculated, and some basic aggregations have already been done. Historical data can be therefore safely removed from company's transactional systems, and uploaded to Data Warehouse, what overcomes the storage cost problem. However - what about the analysis? This is where OLAP comes to the rescue. OLAP tools allow the analysts to perform a lot of statistical and visualisation functions on data from Data Warehouse in real time. These include variety of presentation methods (like charts), aggregation functions and statistical methods such as regression. The most popular example of OLAP tools is probably the data cube concept, which allows to present information in three dimensional space, with dimensions representing variables such as sales volume, time, geographic region, market segment etc. Such cube includes also aggregation information - like total sales volume in one month - and moreover allows for "drilling down" in the data - that is, expanding aggregates to look for data from the lower level. Data Warehouses and OLAP are currently "hot" things among large corporations, as is "reengineering", "market orientation" and so on. While being very effective in some applications, they still rely on human reasoning, and therefore can fail when confronted with really big volumes of data and needs to identify subtle trends. They are also not useful in real time systems5 for obvious reasons. Fortunately the research needed for creation of first Data Warehouses and OLAP tools lied a foundation for more automatic methods, that have been designated the common name Data Mining. 3. Data mining - automatic knowledge discovery
In previous chapters I have been trying to show that despite the amount of information available to us is
increasing at incredible pace, our ability to put that information for practical use - and to extract knowledge from
it - is still very limited.
This is well illustrated by a proverb popular among data mining community that I shall cite here:
Although we have large sets of information at our disposal - we are still starving for knowledge.

But - what is knowledge? I will try to define that concept graphically:
5 Such as automatic credit card fraud detection, or telephone network monitoring Let's explain above concepts using the telephone directory example. We would be dealing with such directory in electronic format, so one of lowest semantic levels would be bytes. These, together with ASCII code interpretation, represent strings of characters. For example we can encounter a sequence of bytes that after decoding would give us such sequence of characters: 6133560, what definitely can be regarded as some kind of data. This number could mean anything. It could represent number of people in the world wearing red jackets, but because we know that we are dealing with telephone directory, we can interpret that string as a telephone number, therefore jumping on the higher semantic level and obtaining a piece of information. Now we can start analysing further relationships between objects within this telephone directory. That can possibly lead us to discovery that 6133560 is Piotr Gawrysiak's, living in Warsaw Wawer district, telephone number. Moreover, further analysis shows that all telephone numbers that begin with digits 613 belong to person living in that district or in its proximity. This conclusion definitely has higher semantic importance than raw data analysed, and therefore we would classify it as a newly discovered knowledge. We of course do not know whether this particular piece of knowledge is useful. Even if it is - we have no idea how to use it. Such decisions do not seem to be amenable to computerisation and in fact could be called "wisdom". Above example is very crude, but illustrates the point of Data Mining - using raw data, that per se does not have any visible underlying meaning, we extracted important semantic information. That information enriched our knowledge about the external world. Now we can define Data Mining more precisely: Data Mining (DM) is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases 6. Two words are crucial in above definition: DM is an automatic process that - once tailored and started - can be run without human intervention (as opposed to OLAP), and databases that DM mines knowledge from are very large, and therefore not subject to human analysis. Data Mining is not a single method, or algorithm - it's rather a collection of various tools and approaches sharing the common purpose - to "torture the data until they confess". The results of Data Mining analysis can be miscellaneous, ranging from discovering customer behaviour, to fraud detection and automatic market segmentation, to full-text document analysis. 4. Main methods of data mining
4.1 Association rules
Association rules finding is perhaps the most spectacular example of Data Mining, because it can quickly contribute to sales volume or profit when correctly implemented. Association models find items that occur together in a given event or record. They try to discover rules of the form: if an event includes object A, then with certain probability7 object B is also part of that event. Consider for example large supermarket network using association rules finding to analyse their databases. These databases contain information about transactions made by customers: articles bought, volume, transaction time etc. During the analysis process such hypothetical rules could be discovered: If a male customer buys beer, then in 80% of cases he also buys potato chips or If a customer is paying at cash desks 1-5, then in 60% of cases he is not buying the daily newspaper. Using these rules some strategic decisions could be made. The potato chips stand could be moved away from the beer stand, to force customers to visit more supermarket space. Special "beer plus chips" bundles could be 6 Definition has been taken from [13]. 7 This probability value is called the confidence factor. introduced for customers' convenience. The newspapers stand could be probably installed near cash desks 1-5 and so on. 4.2 Classification & Clustering
The data that we are dealing with is very rarely homogenous. In most cases it can be categorised using various criteria. For example company's customers can be divided into various segments according to their weekly purchases volume, scientific texts can be divided by science discipline, and further into full papers and abstracts and so on. The characteristics of such segments and their number provide us with substantial information about the nature of our data. Moreover even the sole fact that our data can be divided into different segments can be sometimes important. In data mining we distinguish two types of such segmentation process. First of them is classification, which is a learning process aimed at determining a function that maps - in other words classifies - a data object into one, or several, predefined classes. Classification employs a set of pre-classified examples to develop a model that can classify other records. Clustering on the other hand maps a data object into one of several categorical classes but in this case they have to be determined from the examined data. Such data clusters that emerge during clustering process are defined by finding natural groupings of data items based on similarity metrics or probability density models. Classification and clustering is in classical data mining used most often for purely marketing purposes, such as market or competitors segmentation. These methods proved to be very useful in text mining (see section 5). 4.3 Statistical analysis
Statistical analysis is usually regarded as the most traditional method used in data mining. Indeed, many statistical methods used to build data models were known and used many years before the name Data Mining has been invented. We must however remember that these simple techniques cannot be utilised in Data Mining without modifications, as they will have to be applied to much larger data sets than it is common in statistics. In effect a whole new breed of advanced artificial intelligence methods, combining conventional statistical tools with neural networks, rough sets and genetic algorithms has been recently created. The most widely used simple statistical method is regression. Regression builds models basing on existing values to forecast what other values, not present in input data set, could be. There are many possible applications of regression, the most obvious being product demand forecasts or simulation of natural phenomena. Three methods presented above are perhaps the most common tools used in data mining, mainly because they are especially good in dealing with numerical data. Extracting useful information from large amounts of textual information needs slightly different approach, what does not mean that experience gained from classical data mining research can not be reused there. 5. Full text documents analysis
Full text document analysis is one of the most difficult problems in modern computer science, mainly because it
is closely related to natural language processing and understanding. Processing of human language has proved to
be much more challenging task, that it seemed in early sixties or seventies, and is still - as a technology - in it's
infancy.
Fortunately a lot of problems related to "information explosion" can be coped with by using quite simple and
even crude approaches, that do not need the computer system to understand the text being processed. Data
Mining methods - like clustering and categorisation - can be effective here, because they don't rely on external
information (such as extensive use of text semantics), and organise data using only relationships contained
within it.
Below I present a quick overview of most important problems related to full text document retrieval together
with examples of solutions utilising data mining - like approaches.
5.1 Problems
Among all problems related to full text analysis two seem to be currently the most important. These are: poor quality of search engines - especially Internet search engines, and lack of automatic text categorisation tools which would allow for quick assessment of large document collections. Internet search engines
Almost everyone agrees that current state of the art in Internet search engine technology means that extracting information from the Web is an art itself. Widely used search engines, such as [W2] and [W9] are plagued either by the lack of precision or by inadequate recall rate. They tend to return thousands of answers for even specific queries while from time to time refusing to find appropriate documents albeit they exist and are accessible through the net. Almost all commercial search engines use classical keyword-based methods for information retrieval. That means that they try to match user specified pattern (i.e. query) to texts of all documents in their database, returning these documents that contain terms from the query. Such methods are quite effective for well-controlled collections - such as bibliographic CD-ROMs or handcrafted scientific information repositories. Unfortunately the Internet has not been created, but it rather evolved and therefore cannot be treated as well controlled collection. It contains a lot of garbage and redundant information and what is maybe even more important - it does not have any kind of underlying semantic structure, that could facilitate navigation. Some of the above issues are result of improper query construction. The questions directed to search engines are often too generalised (like "water sources" or "capitals") and this produces millions of returned documents. The texts that the user was interested in are probably among them, but cannot be separated as the human attention seems to be constant - one hundred documents is generally regarded as maximum amount of information that can be still useful in such situations. On the other hand documents sometimes can not be retrieved because the specified pattern was not matched exactly. This can be caused by flexion in some languages, or by confusion introduced by synonyms and complex idiom structures (English word Mike is often given as an example of this, as it can be used as a male name or a shortened form of a noun "microphone")8. Most search engines also have very poor user interfaces. The computer aided query construction systems are very rare, and search results presentation concentrates mostly on individual documents, not allowing for more general overview of retrieved data (which could be very important when number of returned documents is huge). Last group of problems is created by the nature of information stored on the Internet. Search tools must deal not only with hypertext documents (in the form of WWW pages) but also with free-text repositories (message archives, e-books etc.), FTP and Usenet servers and with many sources of non-textual information such as audio, video and interactive content. 5.1.2 Text categorisation
It would be much easier to cope with "information explosion" and digest all data that is flooding us, if we could at least identify main subjects of all documents at our disposal, and further organise these subjects into some kind of structure, preferably hierarchical. A classical approach to this problem would involve building a handcrafted index and in fact such indices are in widespread use among the Internet [W5], [W6], and juridical communities. Unfortunately they simply cannot cope with the number of new documents created every day. It means that they tend to be more and more incomplete as the number of information available increases faster than index creators can analyse and classify it. Certainly, the need for automatic categorisation is really strong here. 5.2 Solutions
I will not try to present all research results related to text mining here, as this would be an impossible task.
Instead I will focus on innovative technologies developed especially with the Internet, or similar hyperlinked
environment9, in mind. I am also not presenting here these new search methods, which do not have much in
common with data mining. Such techniques include new generation of web page presentation tools [W7],
autonomous software agents, and topic oriented search engines [W3].
Practically all new document retrieval and analysis methods fall into one of two groups. First of them includes
techniques exploiting practically only hyperlink information and not being very concerned with actual text
contents. This approach is possible because the hyperlinks are human-created entities, and therefore represent
additional layer of semantic information, describing relations between document contents.
8 More detailed analysis is available in [11] and [12]. 9 These could be also scientific papers, with citations treated as hyperlinks between them.

Second group comprises of tools dealing only with raw text, and performing mainly some kind of statistical or associative analysis. These methods do not relay on hyperlinks and therefore have wider scope of possible applications. 5.2.3 Link-based methods

5.2.3.1 PageRank
As I already mentioned the hyperlink structure of the Web provides a lot of semantic information that can be
used while assessing web page quality. The most obvious method, adopted from the bibliometrics field, would
assign an authority index (or "weight") to a page basing on number of hyperlinks (in other words "citations")
coming to this page. This method is simple and straightforward, but can be easily confused. Consider for
example the following network, representing part of the worldwide web:
If a classical algorithm is used Page A would be assigned very low authority value as opposed to pages B and C. However, we intuitively know, that Page A could be important because it's relatively easy to get there using hyperlinks, from such different, not directly connected and widely cited parts of the Web as Page B and Page C. PageRank index has been conceived as solution to this problem. Its calculation simulates behaviour of so called "random surfer". Such hypothetical user starts browsing the Web from randomly selected page, and navigates it by clicking on the hyperlinks, writing down the addresses of visited pages. After certain amount of time (which is represented in this model as a number of "clicks") user gets bored, and starts anew from freshly selected random page. PageRank index value is defined by a probability that our random surfer visits given page. Exact definition of PageRank is given below: where PR(A) - PageRank of page A; C(A) - number of outlinks from A; d - simulates random surfer path length,
T - pages linking to A

Practical experiments have shown that in most cases strong correlation exists between PageRank index and
human assigned "authority score" of a page. In other words, most valuable and trusted pages tend to have high
PageRank indices. This allows for easy categorisation of Web pages and can especially effective in sorting
search engine results. Practical implementation of such sorting method is currently tested in Stanford University
[W1].
For detailed description of PageRank calculations and its other possible applications see [3] and [4].
5.2.3.2 HITS
Link structure has been also used for automatic identification of strongly interconnected web page clusters. Such
emergent groups of pages often share the same topic, and can be treated as a kind of "Web community".
First approach to automatic isolation of such Web thematic collections was J. Kleinberg's HITS algorithm,
developed later into full-blown information retrieval system called CLEVER.
One of the most important findings of Kleinberg was the concept of authority and hub pages. In classical
bibliometrics the number of citations contained in a document is rarely seen as a significant contribution to this
document importance10.
10 The most important is number of documents citing the source being analysed. However in the chaotic structure of the Internet such pages rich in outgoing hyperlinks act as important landmarks, providing tables of contents and "road directions" for surfers. Klienberg calls such pages with a name "hub". Accordingly, the pages containing mostly valuable information and therefore pointed by many pages are called "authorities". In HITS algorithm we repeatedly assign each page two weight values: an authority score, and hub score, defined as follows: Practical experiments show that after several iterations these weights seem to stabilise, thanks to mutually
reinforcing hub-authority influence. The pages having highest authority or hub represent most important sources
of knowledge and related hierarchical information and are closely interconnected.
Of course above approach would not be very helpful in categorising entire Web contents, but it is quite effective
with semantically restrained sets of pages. We can for example use it to quickly find most important pages
within search engine results, filtering out the rubbish. This can lead to spectacular effects with very general
queries (like "bicycles", "aviation" etc.) as HITS algorithm tend to identify pages created by special-interest
groups or indexes to web resources on a given topic.
More details about HITS and CLEVER algorithms contain papers [4], [8] and [9].
5.2.3.3 Automatic Resource Compilation
Above methodology has been also successfully used in automatic creation of Yahoo-like indices. P. Raghavan
has set up an experimental system called ARC in Stanford University. This system uses HITS like methodology,
together with anchor text11 analysis to create authoritative indices for general topics.
Process used by ARC to create an index has three phases: search and growth phase, weighting phase and
iteration and reporting phase. In the first phase a classical search engine is used to create a set of pages related to
given topic. In the growth phase this base set is extended through incorporation of pages pointed to by its
members. Then the slightly modified12 HITS like weights are computed and 15 pages with strongest hub or
authority score are returned into output index.
The ARC results were evaluated by human volunteers and in most cases have been highly rated. In fact some of
the machine created indices had better quality than corresponding Yahoo or Infoseek parts of hierarchy. ARC
advantage over these systems seems to be even stronger, when we realise that it can be run automatically - for
example on daily basis - thus keeping index contents up to date, what is impossible for human created indices.
More detailed description together with exact evaluation results can be found in [15].
5.2.4 Content analysis methods

5.2.4.1 Document similarity based classification
To perform effective object classification we must be able to compute some kind of distance metrics between
them. Internet pages give us much more possibilities here than raw text documents, because when try to
determine level of similarity in between hypertext documents we can use such formatting information like
number of hyperlinks, frequency of viewing, depth of children and so on. Very interesting attempt to use this
information in classification has been made by Peter Pirolli and James Pitkow from Xerox Palo Alto Research
Center.
They have tried to assign documents from Xerox intranet to one of the following classes: index, source index,
reference, destination, content, head, personal home page and organisational home page13. The method used by
them involved checking the strength of several page properties (such as its size, or number of hyperlinks) and
using following table to perform classification:
11 This is text describing the hyperlink in the source page 12 Modification takes into account aforementioned anchor text analysis 13 Classes description is available in [6]. Node type
Outlinks
Children
S imilarity
Precision
frequency
children
index
Reference
Destination
Org. Home
Page
Pers. home

+ means that this property should be strong for given node type, - means this property should be weak
This analysis, accompanied by simple statistical comparisons between pages and topology computations resulted
in categorisation with quite high precision. Almost all content pages has been classified correctly, and in
practically all other cases more than 50% of all analysed pages has been assigned to the correct group. Note that
this approach does not deal at all with semantic meaning of documents, which at first seems necessary to
distinguish for example personal home page from content page (i.e. - page that actually delivers information, and
not facilitates navigation).
Complete description of this method, together with node type and properties explanation is available in [6].
5.2.4.2 Coocurence analysis
Other very promising method for computing relationship strength between lexical objects (not only documents,
but also smaller entities such as paragraphs or even words) is Latent Semantic Analysis. The primary assumption
of this method is that there exists some underlying structure in the pattern of word usage among documents,
which can be discovered using statistical methods.
Latent Semantic Analysis uses singular value decomposition over lexical objects concurrence matrix to discover
relationships between words (or phrases etc.) that are appearing in similar contexts. Consider for example14 the
following two sentences:
1) The U.S.S Nashville arrived in Colon harbour with 42 marines
2) With the warship in Colon harbour, the Colombian troops withdrew
Classical text analysis systems (that is - not equipped with thesaurus) will not be aware of semantic similarity of
words "U.S.S Nashville" and warship. The LSA analysis is however able to capture this relationship, because
both terms appear in similar context of words such as "Colon" and "harbour".
Latent Semantic Analysis can have many applications in text retrieval. The most interesting seem to be:
-
automatic thesauri building and query expansion : as LSA is able to grasp the semantic relationship between lexical units, it could be used to build a thesaurus base frame. Of course very careful document selection (that is - documents fed to LSA algorithm) is necessary to ensure high quality of such thesauri, which anyway have to be rechecked by human experts afterwards. automatic document grouping and topographic text visualisation : similarity between documents calculated by LSA is a good distance measure than can be used in classical clustering algorithms to discover topic-focused groups in a large collections of documents. Such techniques can be used for example in analysis of corporate email archives, or Internet newsgroups. Some companies (see [W4] and [W8]) are also experimenting with using these similarity metrics in construction of three dimensional maps, representing documents space. finding semantically similar documents (an text mining application of case based reasoning), like matching abstracts to full papers, or identifying examination frauds. More detailed description of LSA is presented in [1], [2] and [10]. 6. Conclusion
Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings
will no longer rely on their memory and therefore will recall everything from potentially misleading external
sources. This prophecy has been almost fulfilled as the amount of information available to us increased
enormously, while the methods of retrieving that information remained relatively ineffective. The main source of
difficulties in text retrieval research was natural language understanding barrier, which proved to be much more
challenging than anyone had envisaged before.
Fortunately it turned out that a lot of useful full-text analysis could be performed without a need to understand
analysed text contents, in a way similar to emerging Data Mining techniques. Grouping and retrieval algorithms
that have been roughly presented in this paper extract the underlying semantic information directly from the
structure of analysed documents. Their simplicity and robustness give us hope that new generation of
information retrieval tools will appear in near future.
7. Bibliography
[1] M. Berry, S. Dumais, T. Letsche : "Computational Methods for Intelligent Information Access", University
of Tennesee, 1995
[2] Peter W. Foltz : "Latent Semantic Analysis for Text Based Research", Behavior Research Methods,
Instruments and Computers 28, 1996
[3] S. Brin, L.Page: "Anatomy of a large -scale hypertextual Web search engine", WWW7 Conf. Proceedings,
1998
[4] D. Gibson, J.Kleinberg, P.Raghavan: "Inferring Web communities from link topology", Proceedings of the
9th ACM Conference on Hypertext and Hypermedia, 1998
[5] „Survey of the State of the Art in Human Language Technology”, European Commission 1996
[6] Peter Pirolli, James Pitkow, Ramana Rao: "Silk from a Sow's Ear: Extracting Usable Structures from the
Web", Xerox Palo Alto Research Centre, 1996
[7] Daniel P. Dabney: „The Curse of Thamus: An Analysis of Full Text Legal Document Retrieval”, American
Association of Law Libraries, 1986
[8] S.Chakrabati : "Experiments in topic distillation", IBM Almaden Research Center, 1998
[9] J. Kleinberg : "Authoritative Sources in a Hyperlinked Environment", ACM SIAM'98 proceedings, 1998
[10] T. Landauer, S. Dumais : "A solution to Plato's Problem: The Latent Semantic Analysis Theory of
Acquisition, Induction and Representation of Knowledge", Psychological Review 98, 1998
[11] Marcin Frelek, Piotr Gawrysiak, Henryk Rybiñski : " A method of retrieval in flexion-based language text
databases", IIS'99 conference proceedings, 1999
[12] Marcin Frelek, Piotr Gawrysiak : "Przeszukiwanie tekstowych baz danych w jêzyku polskim", Master
thesis, Politechnika Warszawska 1998
[13] W.Daszczuk, J. Mieœcicki, M.Muraszkiewicz, H.Rybiñski : "Towards Knowledge Discovery in Large
Database", Politechnika Warszawska 1999
[14] P. Gawrysiak : "Information retrieval and the Internet", PWII Information Systems Institute Seminars, 1999
[15] P. Raghavan : "Automatic Resource Compilation by Analysing Hyperlink Structure and Associated Text",
IBM Almaden Research Center, 1998
[16] C. Westphal, T. Blaxton : "Data Mining Solutions", Wiley Computer Publishing, 1998
[17] V. Dhar, R.Stein "Seven methods for transforming corporate data into business intelligence", Prentice Hall,
1997
[18] "Designing the next generation of knowledge management centers", Bar-Ilan University, Department of
Library and Information Studies, 1999
Web resources:
[W1] www.google.com
[W2] www.altavista.com
[W3] www.ask.com
[W4] www.cartia.com
[W5] www.yahoo.com
[W6] www.polska.pl
[W7] www.inxight.com
[W8] www.newsmaps.com
[W9] www.infoseek.com

Source: http://bolek.ii.pw.edu.pl/~gawrysia/publ/DIBSarticle.pdf

Microsoft word - p051-61_sc ss10to12apr1.doc

Science scope and sequence chart: Grades 10 to 12 Advanced SCIENTIFIC ENQUIRY Methods of scientific • Identification of a focused research question with investigation • Selection of appropriate equipment and materials • Identifying and controlling variables • Working constructively and adaptively with others • Evaluating experimental design, identifying weaknesses an

Doi:10.1016/j.jcrs.2008.06.016

Effect of intraoperative mitomycin-C on healthyLi-Quan Zhao, MD, Rui-Li Wei, MD, Xiao-Ye Ma, MD, Huang Zhu, MDPURPOSE: To evaluate the effect of mitomycin-C (MMC) on corneal endothelial cells after laser-assisted subepithelial keratectomy (LASEK). SETTING: Department of Ophthalmology, Changzheng Hospital, Shanghai, China. METHODS: One hundred seventy-four eyes of 89 patients who did not previo