Creating and exploiting aggregate information from text using a novel Categorized Document Base abstract



Download 355.17 Kb.
Page10/10
Date18.10.2016
Size355.17 Kb.
#1911
1   2   3   4   5   6   7   8   9   10

TABLES





Type of Taxonomy

Examples

Product hierarchies

  • United Nations Standard Products and Services Code (UNSPSC) *

  • United Nations Central Product Classification (CPC)

  • United States Patent Classification (USPTO USPC) *

  • International Patent Classification (IPC)

  • Proprietary corporate product catalogues (e.g. Amazon, Wal-Mart, Sears, or any other catalogue defined by any large or small company)

Industry taxonomies

  • North American Industry Classification Scheme (NAICS)

  • United States Standard Industrial Classifications (SIC)

  • International Standard Industrial Classification (ISIC)

  • SITC3 (Standard International Trade Classification)

Company classifications

  • Fortune 500 and Fortune 1000

  • S&P 500

  • Inc. 500 and Inc. 5000

  • Entrepreneur Magazine’s Franchise 500

  • Internet Retailer 500

Activity taxonomies

  • WordNet (Verb relationships)

  • United States Bureau of Labor Statistics Standard Occupation Classification System (SOC)

Place (Location) taxonomies

  • United States General Services Administration Geographic Locator Codes (US GSA GLC) / Geographic Names Service *

  • United States Direct Marketing Areas (DMA)

  • Getty Thesaurus of Geographic Names (TGN)

Time taxonomies

  • ISO

Topic taxonomies

  • Library of Congress Classification system (LoC)

  • UK Joint Academic Coding System (JACS)

  • UK Higher Education Standard Authority Coding (HESACODE)

Medical taxonomies

  • International Classification of Diagnoses (e.g. ICD10)

  • International Classification of Primary Care (ICPC)

  • Current Procedural Terminology (CPT)

  • US FDA Classification of Medical Devices

Table 1: Popular Classification Schemes
indicates the taxonomy (category names and relations) was imported into our prototype system,
and a random selection of approximately 10% of categories were populated with documents
* indicates that the taxonomy was imported into our prototype system and all categories were populated with documents

Table 2: Absolute Hits for a Number of Search Terms, by Document Category



Type of Taxonomy

Examples of data indexed using standard taxonomies

Product hierarchies

Sales data for each product category, from an internal company database, indexed by product category (e.g. UNSPSC, or UCC Stock Keeping Unit [SKU]).

Industry taxonomies

Industry size figures from the Bureau of Economic Analysis (BEA.gov), or from the Internal Revenue Service (IRS.gov), indexed by NAICS code.

Company classifications

Company profit figures, from the Securities and Exchange Commission (SEC.gov), indexed by NAICS code.

Activity / Employee taxonomies

Salary data for each profession, from the Bureau of Labor Statistics (BLS.gov), indexed by SOC occupation classification.

Place (Location) taxonomies

Population, land area, and other geographic data from the United States Geological Survey (USGS.gov), indexed by Geographic Locator Code (GLC).

United States General Services Administration Geographic Locator Codes (US GSA GLC).



Time taxonomies

Sales data for each date, from an internal company database, indexed by time.

Topic taxonomies

Enrollment data for each academic subject, from the National Center for Education Statistics (NCES.ed.gov), indexed by educational field.

Medical taxonomies

Infection rate, for each illness, in each area, indexed by ICD9 or ICD10 disease code.

Table 3: Structured Data Sources for Various Taxonomies

Corporate Franchise

Total US Franchise Outlets

CDB Search Terms Used

Pearson’s : Per-capita franchise outlets per state vs CDB search term frequency for state44

Population Correlation45

McDonalds

11,318

“burger”

-0.25

0.98

“hamburgers”

-0.15

Pizza Hut

5,676

“pizza”

0.04

0.34

KFC

4,378

“chicken”

-0.17

0.95

Intercontinental

3,023

“hotels”

-0.34

0.93

Starbucks

9,869

“coffee”

-0.21

0.90

RE/MAX

4,628

“property”

0.15

0.95

Supercuts

1,644

“hair”

-0.35

0.79

Jackson Hewitt

2,475

“tax”

0.10

0.87

Carlson Wagonlit

340

“travel”

0.26

0.87

“flight”

0.13

Jiffy Lube

1,923

“car”

-0.10

0.82

Miracle Ear

1,349

“hearing”

0.02

0.89

Table 4: Summary of Experimental Results – Selected Population-Sensitive Industries

Industry

External Data Used

CDB Search Term Used

Pearson’s 46

Population Correlation47

Wind energy

DoE wind generating capacity

“windy”

0.07

0.10

NREL wind resource availability

0.25

-0.36

Solar energy

Thermomax solar energy (BTUs)

“warm”

-0.11

-0.09

“sunny”

-0.28

“sunshine”

0.22

Rain

NOAA precipitation per square mile 2008

“rain”

0.29

-0.02

NationalAtlas.gov
1961-1990

0.27

0.01

Fishing

USFWS Non-resident fishing licenses sold

“fishing”

0.46

0.19

Coal

NMA Number of coal mines

“coal”

0.75

0.18

NMA Coal production

0.74

0.08

Gemstone

NMA Gemstone production

“gemstone”

0.30

0.09

Gold

NMA Gold revenues

“gold”

0.29

-0.06

Forests

NFS Forest area

“forest”

0.30

0.18

Oil

EIA Oil production

“oil”

0.39

-0.04

Mountain climbing

USGS Elevation Data

“mountain climbing”

0.65

-0.10

Eco-tourism

USBLS Number of
eco-tourism employees

“ecotourism”

0.39

0.35

Gaming

USBLS Number of game dealers

“gambling”

0.29

0.32

Table 5: Summary of Experimental Results – Non-Population-Sensitive Industries

1 The reader may be curious as to why compilation of aggregate data from text is useful if similar data is already available in structured sources. There are numerous reasons why the ability to glean the information from text is helpful:

  1. Cost: Aggregate information from text may be cheaper to generate than using alternative sources for that information (e.g. gleaning data from text on locations may be cheaper than conducting geological surveys on those locations – for natural data – or market or sociological research on those locations – for commercial or social data.)

  2. Time: Aggregate information from text may be more current if the text is current (e.g. alternative sources, typically constructed by manual human market research may be years out of date).

  3. Breadth: Aggregate information from text may provide data on a broad range of phenomena that have not yet been the subject of manual surveys. E.g. while the economic effects of federal stimulus money on various locations can be readily assessed from tax return data, social effects are more difficult to rapidly and cheaply observe and would traditionally require manual surveys that may be prohibitively costly. If it can be shown that aggregate information from text is plausible, textual sources may more readily be consulted to assess such social effects, particularly in cases where manual surveys of a certain social phenomenon do not exist.

2 http://www.sas.com/technologies/analytics/datamining/textminer/

3 http://www.aubice.com/

4 http://www.clarabridge.com/

5 http://www.islanddata.com/

6 http://www.ql2.com/

7 nlresearch.com USA Patent No. 5,924,090

8 http://flamenco.berkeley.edu/demos.html (Accessed on 13 March 2009).

9 United States Patent Application 20070106662 “Categorized Document Bases”.

10 Kartoo’s results, not shown for brevity, are similarly haphazard.

11 For now, the reader can assume the results produced are sensible - in §6 EXPERIMENTAL EVALUATION we evaluate the validity of the results produced by the CDB, and find that the CDB produces results that are correlated with United States Fisheries and Wildlife service data on fishing popularity in those states.

12 http://www.alphaworks.ibm.com/tech/uimodeler

13 Though, as we will see later (§4.2), certain taxonomies are more amenable to analysis by a CDB – specifically, taxonomies with relatively unambiguous category descriptors are better suited to analysis using a CDB – while others are, at the present time, not.

14 As category names may be ambiguous, it is preferable that additional identifying features of the category be specified, or some other mechanism be employed to obtain documents for the correct category. For example, the place “Reading” in Pennsylvania, in a location taxonomy, is different from the category “reading” in an activity taxonomy, so documents from these two categories should not be mixed. Supplying the search term “Reading” to, for instance, Yahoo, would typically return documents from both categories, whereas the CDB should only store documents relevant to the place “Reading” when populating documents for “Reading, Pennsylvania” into the CDB. One method for disambiguation is to append the parent category name to the child category name: for example “reading”, becomes “Reading, Pennsylvania” or “reading activity”. Capitalization may be used to distinguish a proper noun from a common noun and using a quoted phrase to ensure co-occurrence of the city and state name reduces ambiguity further. Also, because modern search engines often consult recent search history for the purposes of results personalization and query disambiguation (see Google’s United States Patent Application 20050222989), the results returned to the CDB are likely to be further disambiguated as the CDB is repeatedly requesting search results for similar categories. A vast number of other disambiguation techniques exist [3, 62, 116] and can be used.

15 http://www.unspsc.org/

16 Excel was used as it allowed us to easily create bubble charts, and it also allowed us to easily integrate financial data for various industries with the aggregate statistics for those industries from the CDB.

17 The increase is not exactly proportionate since the size of each new document varies, and document size also affects the rate at which term occurrence counts are computed.

18 For example, for the reasons reported in §4.2, we observed that the quality of documents in the US Patent taxonomy was poor, and this taxonomy was therefore abandoned as unusable.

19 http://www.gsa.gov/glc

20 http://www.yahoo.com/

21 We chose to use a Pearson ρ [102] instead of Spearman’s rank correlation coefficient (Spearman’s ρ) [115, 123], since Spearman’s measure does not cater for ties. Pearson’s ρ is equivalent to Spearman’s rank correlation coefficient, when computed on ranks. In many cases, we also computed Kendall’s tau rank correlation coefficient (Kendall’s ) [71, 72]. However, for brevity, we have not shown Kendall’s  in our results, as Kendall’s metric consistently showed similar statistical significance to Pearson’s ρ so does not provide substantial additional information.

22 The source data used for our experiments – including summary statistics for each keyword in each state obtained from the CDB, and the statistical calculations we performed – is too large for inclusion here, but is available separately for download from the authors, should the reader wish to validate our findings. Due to copyright restrictions on the quantitative data which we obtained for each state from the external data sources, the authors are unable to redistribute the external data sources. However, we have provided hyperlinks in footnotes in all cases, to allow the reader to obtain the data themselves. Also, owing to copyright restrictions on the Yahoo search results we used to populate the CDB, the authors are unable to make the source documents available, but the reader can compile a similarly-arranged data set using the techniques described in this paper, albeit for a different point in time.

23 http://www.entrepreneur.com/franchise500/

24 In the case of Dunkin’ Donuts (coffee and donuts franchise), ranked 3rd in the Franchise 500, we noticed that the franchise had a predominantly East Coast penetration in the United States, attributable to the franchises’ unique and peculiar roll-out strategy, and we therefore substituted it with Starbucks corporation, which had a more representative national penetration of stores across all states.

25 In the rare event that META keywords were not available, we manually decided appropriate product keywords for the company.

26 We divided raw hit count for the state by number of total words in the documents for that state, to remove population biases which result from some states having more locations, and hence more documents and words, than other states.

27 http://www.infousa.com/

28 http://www.census.gov/popest/states/NST-ann-est.html

29 Originally, we computed raw total hit count for the product keyword in each state. However, we found that the strong correlation between raw total hit count and number of franchise outlets was spurious, since states with more locations had more documents, and hence more words and keywords. The authors are grateful to the reviewers for pointing out this issue. We therefore made use of an alternative metric – term frequency (hits per 1,000 words) – which is not skewed by population.

30 As the table is large, it has been split into 4 parts for readability.

31 http://www.eere.energy.gov/windandhydro/windpoweringamerica/wind_installed_capacity.asp

32 http://rredc.nrel.gov/wind/pubs/atlas/maps/chap2/2-01m.html

33 http://www.thermomax.com/usdata.htm

34 http://cdo.ncdc.noaa.gov/cgi-bin/climaps/climaps.pl?directive=quick_search&subrnum=

35 http://nationalatlas.gov/printable/precipitation.html

36 http://www.fws.gov/news/newsreleases/R9/A2D9B201-0350-4BD4-A73477A70A25FC69.html?CFID=3980850&CFTOKEN=92935320

37 http://www.nma.org/statistics/states_econ.asp

38 http://www.fs.fed.us/publications/documents/report-of-fs-2002-low-res.pdf

39 http://tonto.eia.doe.gov/dnav/pet/pet_crd_crpdn_adc_mbbl_a.htm Note: this document is no longer accessible on the EIA website but can be found in Google’s cache by searching for the URL using Google.

40 http://erg.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html

41 http://www.bls.gov/oes/oes_dl.htm

42 The Bonferonni adjustment is notoriously conservative and lacking in power so a -squared test here is dispositive.

43 http://www.sec.gov/. Similar data, listing the companies in a given industry, could also have been obtained from commercial sources, such as Hoovers, Dun & Bradstreet, Microsoft Money, Yellow Pages, or other alternatives.

44 For each study shown in Table 5, fifty one (51) pairs of data points – one pair of rankings for each U.S. state, including the District of Columbia – were compared. For 51 data points (the 51 states compared in each case), the Pearson  score required for statistical significance at the weaker 90% confidence level is  > 0.231, and at the stronger 99% confidence level the Pearson  score threshold required is  > 0.354. Therefore, in all cases where  > 0.354 we can conclude that it is highly unlikely that the correlation between the two rankings being compared occurred by chance.

45 ‘Population Correlation’ is the Pearson -value found when correlating the ranking for the states by the US Franchise outlets in that industry with the ranking of the states by their population. Population figures were obtained from the US Census Bureau.

46 Pearson’s  here is the correlation obtained by comparing the state ranking using external data to the CDB ranking of states by search term frequency (search term hits per 1,000 words).

47 Population Correlation is the Pearson -value found when correlating the ranking for the states from the external data with the ranking of the states by their population. Population figures were obtained from the US Census Bureau. Note that the eco-tourism and gaming industries display some population sensitivity, though much milder than the population-sensitive industries studied in Table 4.

Page of


Download 355.17 Kb.

Share with your friends:
1   2   3   4   5   6   7   8   9   10




The database is protected by copyright ©ininet.org 2024
send message

    Main page