Creating and exploiting aggregate information from text using a novel Categorized Document Base abstract

Download 355.17 Kb.

Page	6/10
Date	18.10.2016
Size	355.17 Kb.
	#1911

1 2 3 4 5 6 7 8 9 10

5.2CDB Database Structure

The database tables used by the CDB can be divided into three major areas:

Tables used by the Population routines (§4.2) to represent and index the documents gathered by the CDB – see Figure 12.
Tables used by the Query routines (§4.4) to store and cache queries and their results (aggregate statistics) from the CDB – see Figure 13.
Tables used to implement the Collaborative Annotation interface (§4.6) – see Figure 14.

As mentioned earlier, we found that use of a relational database structure by the Population routines for the representation and indexing of documents resulted in significant performance impediments. As clients found the 2 to 3 day delay in result compilation to be excessive, we have begun experiments with the storage of documents in simple directory folders and our initial results indicate that we can obtain aggregate statistics for 20,000 categories within a few minutes using a traditional file storage scheme.

6.EXPERIMENTAL EVALUATION

To evaluate the our CDB approach, we set up a number of experiments, in varied industries, including some industries that relate to natural phenomena, and some that relate to commercial phenomena. We observed that our studies could be segregated into:

population-sensitive studies (§6.1) – such as burgers, pizza, and hotels – where the phenomenon varies by population

versus

non-population-sensitive studies (§6.2) – such as solar, wind and rain – which are governed by natural phenomena, rather than driven by human population.

In all experiments, we made use of a taxonomy of United States place names, taken from the United States General Services Administration Geographic Locator Codes (US GSA GLC) / Geographic Names Service^¹⁹. This taxonomy was then populated with the ten most relevant documents for each place, from Yahoo^²⁰. A total of 27,547 individual places were investigated, and the ten most relevant documents acquired for each. In each experiment, we gathered summary statistics for industry-specific search terms across these top-ranked documents for each location. We then aggregated the data by state, and made use of a public, structured data sources for each state to validate whether the findings from the exploration of the document-based data in the CDB were sensible. In all cases we used Pearson correlation (Pearson’s ) [102]^²¹ to determine whether the ranking of states as provided by aggregate statistics from the textual CDB was correlated with the ranking we obtained for those states from an alternative structured, quantitative data source, and hence whether the aggregate statistics from the textual CDB provided a reliable proxy for alternative, widely-accepted quantitative data^²².

6.1Population-Sensitive Studies

We regard population-sensitive studies as those where the phenomenon is likely to vary with population. To further investigate population-sensitive industries, we selected the top 100 American franchise corporations from Entrepreneur Magazine’s Franchise 500 listing^²³ for the year 2008. As some of the franchises were in the same industry, we selected only the largest franchise from each of the 37 unique industries found^²⁴. For each industry, we visited the website for the largest franchise corporation in that industry, viewed the home-page source code, and selected up to 4 META tag keywords listed by the company, which described the company’s main product or products^²⁵. We then ran each keyword against the above-described CDB (populated with documents for each United States location from the US GSA GLC), aggregated the hits for each keyword by state, and ranked the states from states with the most hits per 1,000 words^²⁶ for that keyword down to those states with the least. To determine whether the CDB ranking correlated with independent data about the popularity of each product in each state, we visited a commercial data provider, InfoUSA^²⁷. For each corporation, we submitted the name of the corporation to InfoUSA, and obtained a count of the number of outlets operated by the corporation in each state. To remove population biases and attempt to ascertain relative demand for each product in each state, we divided the number of outlets in the state by the population of that state from the US Census Bureau^²⁸, and then ranked the states by number of outlets per-capita for each corporation. Finally, we compared the ranking of states from the CDB (relative term frequency for the product keyword in each state^²⁹) to the ranking of the states by number of outlets per-capita operated by the corporation in each state. Our results are shown in Table 4^³⁰ – correlations significant at the 90% confidence level are shown in bold. For brevity, and because these experiments did not generate significant correlations, we have shown only a small representative selection of industries. As can be seen in column 5 of Table 4, all industries were indeed strongly population-sensitive as we had expected, with franchise outlets per state being highly correlated with population of that state. Column 4 of Table 4 shows, however, that statistically significant positive correlations between term frequency for the search term and per-capita franchise outlets per state for the industry were seldom found. This indicates that the CDB is not a credible implement for discerning differences in per-capita demand for different products between states in the population-sensitive industries in our experiments.

6.2Non-Population-Sensitive Studies

We regard non-population-sensitive studies as those which are governed by some natural phenomenon, rather than by human population. Through a process of group brainstorming, we identified a short-list of non-population-sensitive industries. For each industry, we identified an independent external data source that provided state-specific metrics for that industry. We used the CDB system to rank each state by the metric in question, and then we compared these rankings with rankings established by the independent external data sources. Following are the external data sources we gathered for each industry:

Wind energy: We obtained data from Department of Energy (DoE), Energy Efficiency and Renewable Energy, on current installed wind power capacity, in Megawatts, per state as at Jan. 31st 2009^³¹. We also obtained data from National Renewable Energy Laboratory (NREL) annual average wind resource estimates, in Megawatts, in the contiguous United States^³².
Solar energy: We obtained data on annual average daily solar radiation, in British Thermal Units (BTUs) per square meter for a 10 tube solar collector, for each US state^³³.
Rain: We obtained data from the National Oceanic and Atmospheric Administration (NOAA), on total inches of precipitation for 2008, for each state^³⁴. We also obtained data from the National Atlas on average annual precipitation per square mile for each US State from 1961 through 1990^³⁵.
Fishing: We obtained data from the United States Fish and Wildlife Service (USFWS), on the number of non-resident fishing licenses issued per state^³⁶.
Coal, Gemstone, and Gold: We obtained data from the National Mining Association (NMA) State Fact sheets, on the total number of mines, total production, and total revenue, for coal, gemstones, and gold in each US state^³⁷.
Forests: We obtained data from the National Forest Service (NFS) on total forest acres under administration in each state^³⁸.
Oil: We obtained data from the Energy Information Administration (EIA) on oil production for each state^³⁹.
Mountain Climbing: Data on the highest elevations in each state was obtained from the United States Geological Survey (USGS)^⁴⁰.
Eco-tourism and Gambling: Data on employment in these specialty occupations was obtained from US Bureau of Labor Statistics Occupational Employment Statistics (OES) state cross-industry estimates^⁴¹.

After gathering external data for each industry, to allow us to rank and compare the states for that industry, we compared the ranking of states using the external data, to a ranking of each state by term-frequency for a search term for that industry using the CDB. Table 5 (column 3) shows the search term keywords we chose for each industry. In each case we computed (see column 4) the correlation – using Pearson’s  – between the state ranking using the external data for that industry versus the CDB ranking of states by search term frequency (hits for the search term per 1,000 words) for that industry. We also computed (column 5) the correlation between the external data and the population for that state, to confirm whether the industry was indeed not-population-sensitive. As before, correlations significant at the 90% confidence level are shown in bold. For instance, for the mountain climbing industry, we find a strong (0.65) correlation between the ranking of states by USGS Elevation Data and the ranking of states by term frequency for the term “mountain climbing”. As is evident from Table 5, for non-population sensitive industries, we regularly find statistically significant correlations between the ranking of the states by relative term frequency of the search term using the CDB, and the ranking of the states using the external data. We conclude that the CDB is an approximate, but viable, means of comparing states for the non-population sensitive industries in our experiments, as the CDB rankings are plausible proxies for rankings of the phenomena obtained from external data.

Given the large number of correlations run, the Bonferonni correction [12, 13] needs to be applied to determine whether each result, when considered alone, is statistically significant. After dividing the desired statistical confidence level (p = 0.1) by the number of experiments (17), the actual p-values obtained are, in most cases, sufficiently low to conclude that the correlation is significant.

To determine whether the set of correlations, when taken together, is significant, a chi-squared test ( test) can be performed^⁴². For instance, at a 90% confidence level it is likely that 10% of studies performed would, by chance, indicate correlations. A -test can be performed to reveal whether the actual number of correlated studies is significantly different from 10%. Of the 17 non-population-sensitive industries investigated, 13 produced statistically significant positive correlations. Though admittedly a small sample, the p-value for this -test (13 actual correlations obtained in 17 studies compared to 1.7 correlation in 17 studies expected) is substantially less than 0.01, indicating strong statistical significance. We conclude that the CDB appears to perform satisfactorily for non-population sensitive industries.

Directory: ~sok
~sok -> Name: cisc 3220 – sample exam 1

Download 355.17 Kb.

Share with your friends:

1 2 3 4 5 6 7 8 9 10