2.MOTIVATION
Modern search engines produce a large number of relevant results for a given search term [111]. Unfortunately, comparing the results for different search terms – for example, comparing the results for different categories in a taxonomy – is difficult. To compare the categories, users must consult hundreds of documents and users soon suffer information overload: they quickly reach a futility point [11] beyond which they will not search. Current research aimed at categorizing documents or search results typically has as its goal narrowing down from a large collection of documents, to progressively smaller collections, until a single document or set of documents which are relevant can be found [5, 15, 34, 47, 73, 74, 79, 80, 81, 136, 141, 143]. In contrast, the motivation for the current research is to provide aggregations of search results across multiple categories, to facilitate comparison of the categories themselves. Our goal is to allow the user to evaluate and compare categories using aggregate statistics, rather than the conventional goal of allowing users to find a relevant document more rapidly. Effectively, we are looking to create new summary information for categories by categorizing, ranking, and filtering millions of documents, rather than simply trying to find facts already explicitly encoded in single documents. The summary information allows a management user to determine the prevalence of a search phrase in millions of relevant documents for thousands of industries, products, places, time periods, or other categories. This allows the manager to rank industries, products, places, or topics, which, we suggest, could be helpful, for instance, for devising a market rollout strategy. Of course, the analysis of documents from multiple categories is only useful if the comparison it yields of the categories is a fair reflection of reality. Our goal is to test the validity of the comparison generated from unstructured text, and to determine, using external evidence, whether the comparison is a sensible and viable proxy for other available comparative data on the categories1.
3.RELATED WORK
Manual and automatic categorization of documents using classification schemes is a well studied field of research [70]. In this section, we provide a thorough survey of the existing literature on document classification, and we explain how our work relates to the document classification field.
Basic approaches to document classification involve manual tagging by humans: for example Yahoo! [82] and domain specific subject gateways [134], both of which are quality-controlled document collections, organized by subject. Numerous document classification approaches use automatic classification by machine. Automatic text categorization research is many decades old [8, 9], and dozens of text categorization and topic identification approaches have been suggested in the literature. Automatic text categorization approaches can broadly be broken down into unsupervised and supervised approaches:
3.1Unsupervised automatic document classification
Unsupervised approaches for document classification include, inter alia, the use of Kohonen self-organizing feature maps or neural nets to cluster documents [26, 44, 45, 56], hierarchical clustering [23, 66, 133], Cluster Abstraction Models [59], and Suffix Tree Clustering based on shared phrases [18, 140]. These approaches are termed ‘unsupervised’ as the algorithms are not trained using a historic set of correctly classified documents but, rather, learn by finding similarities within documents, and clustering documents by similarity.
3.2Supervised automatic document classification
In supervised approaches a small number of documents are manually tagged with a limited set of category identifiers. A machine learning approach is then used to infer general patterns so that unseen documents can automatically be assigned to the categories. In Neural Network approaches [27, 32, 88, 94, 113, 129, 135] a training set of documents and their categories is used to adjust the arc weights in a graph of nodes composed of mathematical functions, until a desired classification accuracy is obtained. Bayesian approaches [9, 21, 61, 69, 78, 80, 83, 100, 124, 130] make use of conditional probabilities to determine the likelihood of a document being in a class given certain features of that document. In K-Nearest-Neighbor approaches [19, 58, 49, 50, 80, 84, 98] a document is compared to a set of pre-classified documents, and its class is taken as the majority class of the few most similar documents. With rule induction approaches [1, 25, 76, 89] a set of rules is induced from a historic set of documents pre-tagged with their categories, and the set of rules is then used to classify unseen documents. The induced rules have the form “if (x AND y AND z AND …) then (document is in category P)”. Decision trees [2, 19, 90] proceed similarly, except individual rule clauses – that is, simple condition evaluations – are applied in turn, starting with a root clause, and proceeding with different clauses in a branching fashion depending on the outcome of each condition evaluation. With Support Vector Machines approaches [65, 85], vectors of n document features are defined, and the hyperplane that provides maximal separation in the n-dimensional space is used to divide the documents into separate categories. Genetic programming approaches [28, 126] start with preliminary candidate classification rules – the best candidate rules are identified, and combined, to produce successive generations of improved rules. Some automatic document classification approaches employ Rocchio Relevance Feedback [64, 80] – these are based on Rocchio’s algorithm [110] which attempts to find a query vector that maximizes the similarity of documents in the same class, while minimizing their similarity to (i.e. maximizing the distance from) documents in the alternate class. Miscellaneous other automatic document classification approaches also exist: see, for example [67, 75, 91, 92, 131]. For comparisons and surveys of text categorization approaches, see [47, 86, 117, 118, 137, 138, 139], and for a list of some software implementations see [43].
Automatic document classification approaches typically make their inferences using internal document content, external information, or user behaviors. Mechanisms that use document content [26] may make use of key words, word strings / n-grams [54], linguistic phrases [106, 107, 108, 109], word-clusters [124], or multi-word features [101] within the document. In contrast, some schemes use external information in hyperlinked documents that point to the target document [40]. Finally, some approaches make inferences from search logs and/or behavioral patterns of users searching and accessing the document set [22] or retain user preferences [131, 132] when categorizing documents. Modern search engines, such as Google, employ a variety of means – including internal document content [7], external information [7], and user behaviors [142] – to determine whether a document should be classed in a particular result set, and how it should be ranked.
By employing modern search engines to construct our CDBs (see Section 4.2), the CDB construction approach presented in this paper benefits from the state-of-the-art composite classification algorithms that are effectively available by sending search terms for multiple categories to Google, Yahoo, or any other search engine.
Many practical software implementations of automatic classification systems exist. GERHARD (German Harvest Automated Retrieval and Directory) automatically classifies web content according to the Universal Decimal Classification (UDC) scheme [97]. WebDoc [130] uses a naïve Bayes approach to assign documents to the Library of Congress classification scheme. Scorpion automatically assigns Dewey Decimal Classifications or Library of Congress Classifications to documents [48, 120, 128]. Commercial text mining software, such as SAS Text Miner2 clusters documents into categories by extracting key terms from each document. Commercial document content mining tools – often known as customer experience intelligence software products – such as Aubice3, Clarabridge4, IslandData5, QL2 MarketVoice6, and similar products, are targeted at monitoring market trends, customer feedback, or potential fraud, by scouring, categorizing, and summarizing web documents. Larkey and Croft’s system [80] uses a variety of techniques to automatically assign ICD9 codes (i.e. category of diagnosis) to dictated inpatient discharge summaries. Multiple systems for automatically categorizing news stories exist [10, 57]. Though the afore-going list is not exhaustive, it demonstrates that automatic document classification systems have enjoyed broad application.
A number of authors [5, 15, 28, 34, 47, 73, 74, 79, 80, 81, 131, 136, 141, 143] have proposed techniques for clustering search results by category. The intention is typically to disambiguate different senses or uses of the search term, or to simply visualize common topics in the search results, to facilitate narrowing in on the correct document subset. For instance, “wind” results may be separated into categories like:
-
“Weather” (containing documents like “A guide to wind speeds”)
-
“Movies” (containing documents like “Gone with the Wind”)
-
“Energy” (containing documents like “Wind Energy”)
Some approaches involve the post-processing of search results, so as to organize them by category, to support more rapid navigation to the desired search results [5, 24, 140]. For example, commercial search engines such as Northern Light7 and Vivisimo.com / Clusty.com, use post-processing to categorize search results. In other cases, the document base is organized by category before conducting the search (e.g. [105]).
In the field of Faceted Search [51, 52, 60] (also known as View-Based Search [104], Exploratory Search [95, 114] or Hierarchical Search [39]) objects are catalogued, by annotating them with attribute values or classifications, to facilitate navigation through or exploration of the catalogue. Objects in the catalogue may be documents, web pages, products (e.g. consumer products), books, artworks, images, people, houses, or other items. For example, the Flamenco faceted search engine, demonstrates use of faceted search to browse catalogues of recipes [51], architectural images [52], Nobel prizewinners8, and other domain-specific catalogues, with the ability to drill down by category (e.g. recipes may be divided by ingredient, such as celery, onion, or potato). In a faceted search system, annotations may be manually assigned or automatically assigned using information extraction techniques. The annotations, or tags, allow the objects in the catalogue to be progressively filtered, to assist the user in finding a particular type of item, and also allow the user to easily determine how many items of a certain type exist in the catalogue. For example, a catalogue of products may be annotated to allow the user to find all products of a certain brand, and within a certain price range. Faceted search techniques are very useful for navigating a catalogue of a certain type of item through progressive filtering.
The mechanism we propose in this paper differs in a number of ways from the prior art. Firstly, we gather documents into categories by obtaining the top results, by relevance, for each category, from any existing search tool (e.g. Google, Yahoo, etc.). The authors currently have a patent pending on this CDB construction and usage approach9. Secondly, our method is focused on exploration of the aggregate statistics obtained from documents organized in a classification scheme, and not on searching for a particular document. Unlike faceted search techniques, the CDB described in this paper is focused on comparing categories using different metrics, rather than focusing on finding a particular type of item by category, or on counting the number of items in each category. In contrast to faceted search systems, where every category is populated with a certain type of item, CDB’s populate categories with the most relevant documents for those categories.
The contrast between faceted search techniques and the CDBs we describe in this paper is best illustrated with an example. Consider a user wishing to compare US locations to see where fishing is popular. Faceted search engines that are used to catalogue artworks, people, houses, or books are obviously not relevant, so the user looks for a faceted search engine that provides a catalogue of web pages or images. The user types ‘fishing’ as the search term, but the results are categorized according to classification schemes mandated by the faceted search engine. Figure 1 shows the results produced, for example, by searching for ‘fishing’ on clusty.com. Figure 2 shows the results of the same search performed using an alternative search results visualization engine, Kartoo.com. As can be seen from both figures – which are indicative of the nature of output produced by search engines able to cluster or categorize search results – information about fishing locations is haphazard and locations cannot be compared. The user attempts another search ‘US locations fishing’. Figure 3 shows the results produced by clusty.com10. Again, the result categories do not allow US locations to be compared to see where fishing is popular. What the user really needs is, first, a dynamically constructed catalogue of US locations, and then the ability to compare these locations for fishing. This is exactly what a CDB provides: the ability to first generate an appropriate catalogue, and then allow comparison of categories using an additional search term. In our approach, the user begins by obtaining a classification scheme of US locations from the United States Geographic Names Service, and then feeds the classification scheme to the CDB. The CDB populates each category (US location) with data, and the user is then able to run a search term (‘fishing’) against the categories. Figure 4 shows the results produced by the CDB. The US locations have been grouped by state, and the states have then been ranked by the number of hits per state. The user can drill-down to see the number of hits per US location by clicking on a US state. The report allows for easy comparison of US locations by fishing popularity11. In short, CDBs allow the user to catalogue a type of item of their own choosing (by feeding a classification scheme to the CDB), whereas faceted search engines catalogue a certain type of item pre-chosen by the developer of the catalogue. The CDB presents a new way of creating faceted search engines. By using the taxonomies themselves in creating the CDBs we introduce a level of flexibility and scope of application that is materially beyond that of present systems.
As should be evident from the discussion above, the CDBs results are aggregates for various categories, and are not record-oriented. For instance, for the term “wind power”, a traditional search engine would lead the user to the document “Wind power for the home”. In contrast, a CDB exploration would suggest that “wind power” is a greater concern in Idaho than it is in Mississippi, as it is more prevalent in documents in the Idaho category than it is in documents in the Mississippi category. For a thorough comparison of aggregate versus record-oriented document search results, see [31]. In this paper we aim to show that useful category-comparative aggregate information, heretofore unobtainable, can be delivered by an appropriately constructed CDB.
Our work can, in some sense, be seen as partially analogous to work in the field of Online Analytic Processing (OLAP) [33, 42]. In OLAP, structured data is organized by category and aggregated: for instance, to find total sales by product, customers by city, or complaints by franchise outlet. OLAP technologies are widely deployed in practice – examples include Microsoft Excel PivotTables and PivotCharts, Microsoft Access CrossTab reports, SAS JMP Overlay Charts, and similar aggregate charts in special purpose reporting tools like Crystal Reports, Cognos, Oracle OLAP, Hyperion Analyzer (now part of Oracle), Microsoft Data Analyzer, and other products. Rather than providing tabular and graphical aggregates of structured (tabular) data, CDBs produce category-by-category aggregate statistics from document collections. Producing aggregate statistics from unstructured text data is not new. For example, a pioneering system by Spangler et al. [6, 53, 121, 122], computes aggregate statistics for various categories and features of documents in a document collection in order to provide exploratory insights into the constitution of the document collection, and in order to infer relationships between categories found in the document collection. Their process, which is implemented in IBM Unstructured Information Modeler12, involves first finding a subset of documents to analyze (e.g. problem tickets relating to a particular brand). Summary statistics – word and phrase occurrence counts – are then computed for each document in the collection. The documents are clustered using a variation of the k-means algorithm, and one or more user-editable taxonomies are created which categorize the document set. Finally, the relationships between various categories are analyzed to determine correlations – for example to determine whether a particular call center representative has substantially more unresolved tickets than their colleagues, or to determine whether particular product complaints are associated with a particular reseller. The process used to construct CDBs is markedly different. CDBs are constructed by populating each category in a user-supplied taxonomy with the n most relevant documents for that category. The categories can then be ranked based on the number of hits in each category for a user-specified search phrase. CDBs allow users to rank categories by hit-count or term frequency for a chosen phrase, using a limited set of only the most highly relevant documents for each category as source data. Using only the most relevant documents from each category allows the user to compare categories against each other, using an arbitrary comparison metric chosen by the user (e.g. a user might compare the relative frequency of a particular word or phrase, across categories).
Share with your friends: |