Creating and exploiting aggregate information from text using a novel Categorized Document Base abstract


Creating an aggregate search term (comparison metric)



Download 355.17 Kb.
Page4/10
Date18.10.2016
Size355.17 Kb.
#1911
1   2   3   4   5   6   7   8   9   10

4.3Creating an aggregate search term (comparison metric)


The Categorized Document Base is queried by specifying a search term (which can be thought of as a ‘comparison metric’), and optionally some additional parameters. The search term typically consists of one or more words, and aggregate statistics computed for the search term in each category of documents allow the user to compare categories. In a simple case, the user asks the CDB to compile aggregate statistics for each category that show the relative frequency of the search term in each category. Composite search terms can be created so that counts tally hits on any of the terms, all of the terms, or only the exact phrases (i.e. have the terms in that specific sequence).

There are a variety of known ways to augment a search term to be used for purposes of searching with an Information Retrieval engine [36]. These techniques are often known as query expansion (or query augmentation). The expansion is typically intended to improve precision and/or recall, by finding “hits” that do not match the literal search term. For instance, for the search term “dry” a user may be offered the following expansions:



  • Synonyms: exsiccate, dehydrate, dry up, desiccate

  • Antonyms: wet, moisten, wash, dampen

  • Related words: desiccant, drier, drying agent, siccative

  • Stems / truncations (e.g. “dry”), and derived / inflected forms (e.g. “dried”, “drier”)

  • Troponyms / Hypernyms / Hyponyms: a troponym is a word that denotes a manner of doing something – for example “dehydrate” is a manner of “drying”, so it may be helpful to search on “dehydrate” when searching on “dry”. A hypernym is a word which denotes a subclass of a superclass: for example, “freeze drier”, “vacuum drier”, “spray drier”, and “oven” are all hypernyms of “drier” since they are all types of drier. The superclass (“drier”) is called a hyponym.

  • Meronyms / Holonyms: a meronym is a word that names a constituent part of a larger item. The word for the larger item is called a holonym. E.g. “fan” is a meronym of “oven” as a fan is a constituent part of an oven. Similary, “oven” is a holonym of “fan”. As fans may be used for drying, and fans are part of ovens, a user searching on “drying” may also be interested in “fan” and “oven”.

It is also desirable that the user select the specific sense of the word that they wish to search. This is known as query disambiguation. For instance, ‘dry’ has, inter alia, the following different senses: ‘lacking moisture’ (as in “dry clothes”), ‘ironic or wry’ (as in ‘dry humor’), or ‘having a large proportion of strong liquor’ (as in ‘dry martini’) [38]. In the absence of a sense-sensitive search engine, a monosemous synonym – i.e. a synonymous word that has only a single sense – can be chosen, provided the synonym is in sufficiently popular use. For example ‘dehydrate’ is preferable to ‘dry’, as the latter is highly polysemous. Though ‘dehydrate’ is less commonly used than ‘dry’, it is still in sufficiently popular use that we can expect to regularly find hits for it in a document collection. Compare to ‘siccative’ which is monosemous but rarely used (seldom found in a given document collection), and may therefore not be appropriate as an alternative search term to ‘dry’.

The user is able to perform both query expansion and query disambiguation in our prototype via a run-time interaction we have provided with the WordNet lexical database [38]. Figure 6 shows a pop-up screen from our prototype software which allows the user to perform query disambiguation or query expansion for their chosen search term using WordNet. As Figure 6 illustrates, the user is shown synonyms and related terms for their search term, to allow them to choose a less ambiguous search term in the event that their chosen term is a highly ambiguous term. For example, a user contemplating the use of the term “dry” (which has many senses – e.g. “dry skin” vs “dry humor”), might instead choose to use the related term “desiccant” which is less ambiguous. As mentioned before, a caveat, though, is that “desiccant” is comparatively rare, and perhaps less likely to produce a significant number of hits if, for instance, document authors prefer the term “drying agent” to “desiccant”. An ideal search term is one that is both unambiguous and in common usage.

While we provide run-time, user-driven disambiguation and query expansion facilities via WordNet, as shown in Figure 6, we do not currently prescribe nor provide any additional automatic means for semantic-sense-sensitive (context sensitive) search, though many are available: see, for example, the literature on Word Sense Disambiguation (WSD) [3, 62, 116]. In our experiments, reported in Section 6, we relied on appropriate word choice (i.e. choice of monosemous and commonly-used words) by the end-user with computer-assistance, using our integrated WordNet feature as illustrated in Figure 6, where necessary.

It might be suggested that a possible, though labor intensive, means of ensuring that only documents pertaining to the correct sense of the category name are associated with that category, is to have human readers, skilled in linguistics, manually remove documents that pertain to homonyms (i.e. different sense of a word or phrase, which share the same spelling). However, manual intervention quickly becomes impractical, given the enormous number of documents in the CDB, and use of our computer-assisted word choice facility (Figure 6), or supplemental automatic word sense disambiguation (WSD) techniques, as suggested above, is preferable.


4.4Determining the aggregate results for each category


To create aggregate statistics for all categories in the CDB, the search terms from the previous section are compared to the documents in each category. Figure 7 shows the basic process: a search term, in this case, “foam reduction”, is run against all documents in each category – in this case, only the “Inks” sub-category (under the “Printing” category), has hits.

A number of basic statistics can be computed for every category in the classification scheme:



  • total number of hits (word / phrase hits) in top n documents for category

  • number of documents with one or more hits, amongst top n documents in that category

  • hits per thousand words (a.k.a. “relative term frequency”), for top n documents in that category

Table 2 shows the absolute number of hits for the terms “network”, “monitoring”, “devices”, etc, in the document categories “Minibuses”, “Busses’, “Automobiles and Cars”, etc, for a sample CDB. Shading is used to indicate where the word occurs with unusually high or unusually low frequency in the category.

More advanced aggregate statistics can also be created. For example, we can compute the relative prevalence (a.k.a. lift) of the search term in that subcategory, as compared to similar categories. A lift of 2 indicates that a word is two times as prevalent in the current category as it is in other categories chosen for comparison – i.e. it is found two times as often as expected. A lift of 1 indicates the word is as common as expected: its prevalence is the same in that category as it is on average in the other categories chosen for comparison. A lift of ½ indicates the word is half as common as expected. Lift is a useful indicator of interestingness [96]. For example, common words like “small” may have a high number of absolute hits in a category, but this may not be interesting, as the relative prevalence, when compared to other categories, may not be significant or unusual, if the other categories also have a similar number of absolute hits for “small”.

Formal tests of statistical significance, such as chi-squared tests, can also be conducted to determine whether the relative prevalence (difference of actual prevalence from expected) is statistically significant. A category is ‘interesting’ or ‘unusual’ if it has significantly greater prevalence of the search term than expected or, alternatively, if it has significantly lower prevalence of the search term than expected. For example, documents in the category Philadelphia may be interesting as “murder” is mentioned more frequently than documents pertaining to other cities in Pennsylvania. Similarly, documents in the category Erie may be interesting as “murder” is mentioned comparatively less frequently than for other cities in Pennsylvania.

Figure 8 shows the relative prevalence of the terms “smoothness”, “strength”, and [“wet” or “damp”], in various segments of the stone quarrying industry. Three bars are shown for each industry: from left to right, the three bars for that industry are “smoothness” for that industry, “strength” for that industry, and [“wet OR damp”] for that industry. As shown by the left-most bar for each industry, “smoothness” is mentioned almost twice as frequently in Crushed and Broken Limestone mining, compared to the other segments. “Strength” (the middle-bar for each industry in the chart) is mentioned almost twice as frequently in Dimension stone mining compared to the other stone quarrying industry segments. Finally, looking at the right-most bar for each industry in the chart, we see that [“Wet or “Damp”] is mentioned half as frequently in Crushed and Broken Granite Mining, than in the other segments. The baseline (i.e. average absolute number of hits out of total words in the documents) in each category for each search term is obviously relevant, since a moderate lift, off a low baseline (e.g. a baseline of one or two absolute hits out of thousands of words in the documents), would not be statistically significant. It is therefore important that chi-squared tests, of the relevant degree of freedom, be applied to ascertain whether the lift is statistically significant given the baseline. In Figure 8, the various baselines for each term are omitted for readability, but different shading is used to indicate lift that is statistically significantly higher than expected (at the 95% confidence), or statistically significantly lower than expected (at the 95% confidence).

Aggregate statistics for any parent category (that is, a category that has sub-categories), can be obtained either from:


  1. the document collection for that parent category (e.g. a document collection obtained by finding all documents relevant to parent category “Pennsylvania”), or from

  2. aggregates of statistics from the document collections of its children (e.g. aggregates of statistics from all documents relevant to child categories “Philadelphia”, “Pittsburgh”, “Erie”, etc. which are children of the parent category “Pennsylvania”)

Both types of statistics are interesting, since the former is obtained from documents directly related to the parent category, and the latter is obtained from documents which relate to descendents (i.e. children, grandchildren, etc.) of that category.

Figure 10 shows a hierarchical drill-down view of hits per category, for the word “biodegradable” across various categories in the United Nations Standard Products and Services Code (UNSPSC)15.

The absolute number of hits can also be normalized or standardized in various ways. For example, a large number of hits for “dogs” in Los Angeles CA, as compared to, say, Blacksburg VA, is unsurprising, as Los Angeles CA has a substantially higher population. Normalizing the hits, by dividing by the population size in this case to produce per-capita popularity, can provide an alternative statistic for comparison.

Statistics can also be calculated for combinations of categories taken from different classification schemes, in much the same way as On-Line Analytical Processing (OLAP) statistics are calculated on structured data. For example, taking a CDB of documents categorized by both Place and Time, we could, for instance, find the statistics for “New York, September 2001” documents (i.e. documents that appear both under “New York” and under “September 2001”). This set of documents would have high hits for “trade center”, compared to documents for, say, “Boston, January 1992”), as September 2001 was the time of the World Trade Center attacks in New York.

More sophisticated numerical scores for each category can also be computed, using the data in the document base alone, or in conjunction with other data sources. In general, a numeric score for a category is any quantitative measure that can be derived from the contents of (i.e. documents in) that category, or from, or in combination with, a structured data source that associates that category with some statistic (e.g. ‘population’ is a statistic for the category “Los Angeles, CA”, that can be obtained from a structured data source).

For our software implementation, we initially implemented a Microsoft Excel interface, which would download a comma-separated-value text file from the server farm, and allow the results to be viewed graphically, using tree views and charts. The Excel interface included intuitive expandable and collapsible tree-views of the various taxonomies, to allow easy visualization of the aggregate statistics for each category in the CDB by end users. Figure 9 below shows the Excel-based interface – in this case the user, a molecular engineer from a chemical company, is viewing the hits for the phrase “foam reduction” amongst a number of product categories in a taxonomy of different product types, in an attempt to identify relevant product applications for a new foam reducing surfactant she has developed. Categories which have more hits than a defined threshold are shown shaded: in this case, the threshold is arbitrarily defined as 2 hits per category. Our Excel interface was eventually retired, in favor of a web-based interface (see Figure 10).




Download 355.17 Kb.

Share with your friends:
1   2   3   4   5   6   7   8   9   10




The database is protected by copyright ©ininet.org 2024
send message

    Main page