|Creating and exploiting aggregate information from text using a novel Categorized Document Base
Document classification has long been a popular field of research in information retrieval. Classification of documents is typically used to aid in faster location of relevant documents by end users. In this paper, we present a method for constructing a Categorized Document Base (CDB) and assess whether the categorization of a document collection in this manner can be helpful for another purpose: understanding of, and comparison of, the categories in the document collection, through the use of aggregate statistics from the documents in each category. Our experimental results indicate that, taking a convenience sample of properties for which we were able to obtain independent assessments of their values, there is a relatively clear association – in non-population-sensitive industries – between what the CDB-based methods turn up and what the independent data says. This is evidence that CDB-based methods could be useful for gauging non-population-sensitive properties for which independent data does not exist.
Text Mining, Information Retrieval, Categorization
Modern search engines have demonstrated their ability to retrieve and rank documents relevant to a given search term and are well suited to finding documents relating to different topics. However, what if an end-user would like to compare topics instead of merely retrieve a document? It is relatively straightforward to take a categorization scheme, and pass each category (topic) in the scheme as a search terms to a search engine, in order to gather a certain number of relevant documents for hundreds or even thousands of categories. Once the few most relevant documents for each category have been obtained, categories (topics) can be compared by computing aggregates on the documents for each category – such as hit counts or relative frequencies for a specific word or phrase in the documents for each category. But, is this information sensible? Will it correlate with what is known about those categories (topics) from other data sources? Let us take a specific example: Can we reliably determine, from text documents for multiple locations gathered and analyzed in the fashion described above, how those locations compare with regard to, for instance, environmental characteristics (sunshine, warmth, natural resources like oil and coal, existence of mountains, forests, or fishing) or market characteristics (demand for different products and services across those locations)?
Our goal in this paper is to determine if the analysis obtained from unstructured text documents using the above-described means is comparable in quality to data for those locations obtained from conventional structured sources, such as traditional environmental surveys or market statistics. This paper posits that it is indeed possible to obtain comparative information on locations by employing search engines in the simple but unusual manner described above. Specifically, our objective is to test the following hypothesis:
Hypothesis H1: aggregate information on various natural and market phenomena, extracted from text documents for United States locations in the fashion described above, can provide better-than-random rankings of those locations based on the environmental or market characteristics of those locations.
As shown in the Experimental Results (Section 6) we find some support for this hypothesis in industries which are not sensitive to population – in these industries the Categorized Document Base (CDB) is able to fruitfully compare locations that diverge widely on some environmental or market characteristic. However, we find the hypothesis is not valid for population-sensitive industries where the CDB is not able to distinguish subtle per-capita distinctions in market characteristics by location.
This paper, then, describes a novel method for creating and exploiting aggregate information from CDBs, and assesses the quality of the information so produced. In a CDB, document sets are organized by topic. CDBs occupy a useful middle ground between highly structured (e.g. tabular) data, and completely unstructured (textual) data, by introducing some order and organization into the document set. We aim to show that aggregate information can be derived from categorized document sets, and that this information is valuable.
We begin with a motivation for this research, and a discussion of related work. We show how our CDB approach differs from prior work in document classification: in particular, the prior art focuses on the use of categorization to narrow down search results, whereas our CDB approach is targeted at aggregate analyses. Next, we describe the process for creating and querying a CDB: from creation and population of the classification scheme, and construction of the search term, to tabulation of the results for the sub- and super-categories, and integration with other external data for the categories. We document the technical implementation of our CDB prototype, including the system architecture and data structure. To validate our approach for the creation and analysis of categorized document collections, we conduct experiments in a variety of industries. In these experiments, aggregate information is generated from the CDB, categorized by location. Our experiments reveal that the CDB can act as a rough instrument for discerning differences between locations: for industries where the difference between locations are substantial, the aggregates computed by the CDB for each location are statistically correlated with quantitative measures that we found for related natural and commercial phenomena in those industries, from traditional structured data sources. In contrast, for industries with subtle differences between locations, the CDB does not appear to be able to discern differences between locations. Following our experiments, we discuss a number of useful applications of the CDB, and conclude with limitations of the approach, areas for future work, and a summary of our process and results.