4.PROCESS
The process of constructing and using a Categorized Document Base can be separated into the following phases:
-
Creating a classification scheme
-
Populating the classification scheme with documents
-
Creating an aggregate search term (comparison metric)
-
Determining the aggregate results for each category
-
Integrating the aggregate results per category with external data for the categories
-
Collaborative annotation of CDB results, if users in a team wish to share their comments on particular categories of interest they have found, with each other
We discuss each of these phases in depth in the following subsections:
4.1Creating a classification scheme
Classification schemes are often referred to as hierarchies, taxonomies, or coding schemes. We will use these terms interchangeably in this paper. A classification scheme can be created by importing any existing taxonomy, or defining a new taxonomy13. Table 1 provides examples of some popular classification schemes. The administrator of the CDB may use one of the provided classification schemes, or may import or create a proprietary classification scheme, such as a product hierarchy from a product catalogue. Use of a standard classification scheme is helpful as other data providers typically provide statistics using codings from such classification schemes, and these statistics can be easily integrated with category-specific aggregates from the CDB (see Section 4.5 later). For example, profit margins by industry, coded according to the NAICS classification, are provided by the United States Internal Revenue Services, and could be integrated with aggregate document statistics for each industry from the CDB. Popular taxonomies can be organized by type (e.g. product hierarchies, industry hierarchies, place hierarchies, activity hierarchies, time hierarchies, …) to aid in targeted exploration (see Table 1).
4.2Populating the classification scheme with documents
The Categorized Document Base is created by populating the classification scheme with relevant documents. Our method involves iterating through every category in the classification scheme, and providing the category name, or other identifying features of the category14, as a search term to a standard Information Retrieval tool or search engine (such as Yahoo, Google, Lycos, Medline, Lexis/Nexis, etc.). The search engine returns a list of matching documents, and the most relevant n documents returned by the chosen search engine are stored under that category in the CDB. The full text of each document is stored. Note that a document may be assigned to more than one category, if it is relevant to more than one category. However, search engine rankings usually ensure that the documents that are most relevant to a given category do not appear in the top results for any sibling category. Figure 5 shows schematically how categories in a taxonomy are populated with documents to create the CDB: in this example, a taxonomy of markets (industries) is populated with the most relevant documents for each industry.
The reader may be concerned that web-pages are long and information on multiple categories may co-occur within the same web-page, potentially leading to the page being multiply classified, and requiring the use of short snippets of text from each page to ensure that the text in the document set for a category relates only to that category. While this concern is reasonable, it should be noted that, by nature, our process populates each category only with the few most relevant documents for that category, from a search engine. Each document in that category should be wholly (or at least predominantly) relevant to the category provided the category name is unambiguous and sufficient content exists on the internet for that category to allow the search engine to easily retrieve highly relevant documents for that category. Consider an algorithm like Google’s PageRank [7], retrieving documents on the category “Blacksburg, Virginia”. Documents that have content on Blacksburg Virginia alone and have inbound hyperlinks referring to Blacksburg Virginia alone will have higher rank than documents that also discuss Philadelphia, Pennsylvania or have inbound hyperlinks from Philadelphia, Pennsylvania. Thus, by using only the top-ranked search engine hits for each category, our process typically ensures that each document for that category predominantly refers to that single category, and therefore the use of snippets is ordinarily not required.
Having said that, we must qualify our explanation by pointing out that a requirement of a well-constituted CDB is that categories are unambiguous and sufficient documents exist for even obscure categories. This is because, when either the category name is ambiguous, or insufficient documents exist for obscure categories, the documents in the document base for a category could contain irrelevant information, or information for multiple categories. Consider the United States Patents and Trademarks Office (USPTO) patent classification scheme, which contains approximately 155,000 classes and subclasses. Our analysis of the USPTO scheme revealed approximately 2,650 ambiguous categories: sub-categories of the same name that occurred under different parent categories. Major offenders were “CLASS-RELATED FOREIGN DOCUMENTS”, “MISCELLANEOUS”, “PLURAL”, “ADJUSTABLE”, “PROCESSES”, and dozens of others, each of which appeared under multiple parent categories. Clearly, these category names would have to be disambiguated in order to intelligently populate the category only with relevant documents for the correct sub-category. For obscure categories, the content available on the specific category is sufficient sparse that only a few documents mention the category, and those that do also mention other sibling categories. In the USPTO scheme, which contains dozens of obscure categories (e.g. “Using mechanically actuated vibrators with pick-up means”), a general-purpose search engine such as Yahoo is unable to find documents highly relevant to the specific category and yields, instead, pages containing excerpts from the USPTO scheme itself, mentioning dozens of other USPTO categories, and not particularly relevant to any one USPTO category. It can be concluded that use of the USPTO scheme with a general-purpose search engine, such as Yahoo, will not yield a well-constituted CDB, as the USPTO scheme does not supply unambiguous category descriptors, and, further, a general-purpose search engine will not find sufficient relevant documents for obscure USPTO categories.
Turning to another classification scheme, the United States General Services Administration Geographic Locator Codes (US GSA GLC) list of locations, we found the problems encountered above with the USPTO scheme to be surmountable. While many ambiguous category names exist in the US GSA GLC list of locations – for example, there are cities named “Philadelphia” in PA, IL, MO, MS, NY, and TN – this was easily rectified by simply appending the state name to the location name in order to disambiguate. Furthermore, a small sampling of various obscure towns indicated that sufficient pages exist on the internet (e.g. local tourism pages for those towns) to yield highly relevant results when we requested the top few hits for the specific town and state on a general-purpose search engine such as Yahoo. We therefore proceeded with the use of the location classification scheme as a viable taxonomy on which to conduct further experiments to more fully assess CDB usefulness and robustness. As mentioned, we were careful to always include the state name with the city or town name, when populating each category, to reduce ambiguity.
As we have seen, in the absence of sufficient content relevant to a specific category (as in the USPTO case described above), general purpose search engines cannot populate each category with wholly relevant documents. Furthermore, it can in certain circumstances be a challenge (again as in the USPTO case), to provide unambiguous search phrases for each category, especially when tens of thousands of categories need to be populated. This can result in irrelevant documents being included in the document set for a category. A human user might consider removing documents from categories, editing the documents, or reassigning them to different categories, if they believe the document was incorrectly assigned, is from an unreliable source, is inaccurate, or inappropriate to that category for some other reason. For example, a user may remove document 6780.html from the “Chicago” sub-category of “Illinois”, if they notice the document pertains to the movie “Chicago”, rather than to the place. In another example, a user may notice that document 13581.html has been assigned to both categories “Tilapia” and “Trout”. On further investigation, the user realizes that the document is merely an alphabetic grouping of fish, and therefore contains content relevant to more than one type of fish. The user could decide to remove the document from the CDB, and replace it with two edited versions – one that refers only to tilapia, and another that refers only to trout – so that the trout-related content of the document doesn’t impact the “Tilapia” category, and vice versa. Similarly, the user may extract advertisements or sponsor links from a web-page document, if the user believes these items introduce content not relevant to the category.
While human editing of categories is permissible, it has its drawbacks. Firstly, it is extremely time-intensive, particularly when thousands of categories must be edited. Secondly, if not done in a systematic and consistent fashion, different biases can be introduced into different categories in the document set. We therefore do not advise human editing of categories. Rather we recommend that, prior to any taxonomy being fed to the CDB, the taxonomy first be assessed, and any taxonomy with ambiguous or obscure categories be shunned, to avoid the need for manual human editing. In our experiments (Section 6), we avoided the need for human editing by using a classification scheme that could be unambiguously populated to a satisfactory extent without any manual intervention. Specifically, by automatically appending the state name to the location name for every location, we could ensure that, for instance, results pertained to the location “Chicago, Illinois”, rather than to the movie “Chicago”. Further, for the reasons described above, we could be confident that the few top ranking documents for each category were highly relevant to the category alone, did not contain mention of sibling categories, and therefore did not need to be manually excerpted (‘snippeted’) nor edited for relevance.
Share with your friends: |