Creating and exploiting aggregate information from text using a novel Categorized Document Base abstract

Integrating the aggregate results per category with external data for the categories: A new kind of ‘mash up’

Download 355.17 Kb.

Page	5/10
Date	18.10.2016
Size	355.17 Kb.
	#1911

1 2 3 4 5 6 7 8 9 10

4.6Collaborative annotation of CDB results
5.TECHNICAL IMPLEMENTATION

4.5Integrating the aggregate results per category with external data for the categories: A new kind of ‘mash up’

‘Mash ups’ are web-based services that weave data from different sources together, creating an interesting and useful report from the synthesized data [14]. For example, data on coal reserves by state from a public data source – such as the National Mining Association – can be integrated with geographic data (e.g. from a mapping service like Google Maps) to create a visual map showing which states have the highest coal reserves. Similarly, oil output for each state – for example, obtained from the US Energy Information Administration – could be overlaid onto the map to visually illustrate which states supply the most energy from fossil fuels.

Categorized Document Bases represent a new source of data on categories, and therefore provide an additional source of data for mash-ups. A simple example of creating a mash-up from a CDB would be to take the hits for ‘coal’ and for ‘oil’ in the document sets stored for multiple states and integrate this with geographic data to create a visual map of how frequently these terms are mentioned in the documents stored for each state.

Let us consider a more sophisticated example, which illustrates the mashing of data from both an unstructured source (the documents in the CDB) and a structured source (a spreadsheet). Assume we have a CDB populated with document sets for multiple industries. For market research purposes, we may want glean information on those industries from the CDB, and show it alongside information on those industries from other sources. Figure 15 gives a simple example of such a ‘mash up’: here, a molecular engineer has constructed a bubble-chart, using our Excel-based prototype implementation^¹⁶, to explore possible applications of a new biodegradable compound her company has developed. In Figure 15, the Y axis is the relative prevalence (lift) for the search term “biodegradable” in each of the two industry categories “Oilseed Processing” and “Plastics Packaging” – this data has been obtained from the CDB. As is evident, Oilseed Processing has far greater relative prevalence for the search term. The asset turnover and revenues for each industry were obtained from an external structured data source – specifically, a spreadsheet obtained from the Internal Revenue Service (IRS) – and plotted as the X-axis and bubble size respectively. From Figure 15 it appears that Oilseed Processing is a relatively small industry, by revenue, compared to Plastics Packaging. Thus, while Oilseed Processors are apparently very interested in biodegradable molecules (as shown by the high relative prevalence of the term “biodegradable” in the document set for that category), sales into that industry may not be lucrative, given its relatively small revenues. The ‘mash-up’ of information from the CDB with information from the IRS has yielded thought-provoking insights into the industries shown. Note that the CDB serves only as a useful heuristic for more quickly finding possible solutions – further manual study is typically required to validate or eliminate suggested solutions. In one of our commercial trials, molecular engineers and business development managers at a chemical company considered 30 industries identified as most promising by the CDB: the company was already operating in 7 of the identified industries, 12 industries were previously known but the company and its competitors did not operate in these industries as they had already been found to be unviable, 3 industries were previously unknown but further investigation showed they were infeasible, 3 industries were previously unknown and feasible but deemed to be not promising, and 5 industries were previously unknown and deemed highly promising.

The examples above illustrate that CDBs can be used to compose interesting ‘mash-ups’: profound insights into the relationships between categories can potentially be illustrated by showing aggregate hit results by category (from the unstructured text documents in the CDB) alongside structured data (e.g. from databases or tabular text files) that are organized according to the same coding scheme. Table 3 lists some examples of structured data, from both private and public sources, that have been coded according to standard taxonomies mentioned earlier in Table 1, and can therefore be integrated with the results of queries on the CDB. Given that we are able to use these coding schemes to cross-reference CDB results for multiple categories with existing structured data for those categories, abundant opportunities to create new mash-ups exist.

4.6Collaborative annotation of CDB results

In our trials with commercial organizations, our clients requested collaborative annotation facilities, to allow business development managers and chemical engineers to share their observations on categories of interest. We therefore rebuilt our visualization features, this time using a web-based interface, and added collaborative annotation features to allow the technology commercialization team to share their comments on interesting categories discovered by the CDB. Figure 11 illustrates the collaborative annotation interface implemented for the CDB, and shows users sharing comments on possible applications of a biodegradable molecule with foam reduction properties. The CDB exploration and annotation software and interface were code-named Sizatola, meaning “help find” in Zulu.

5.TECHNICAL IMPLEMENTATION

In this section we describe the system architecture and the data structure for our Categorized Document Base.

5.1CDB System Architecture

Due to the data volume of the Categorized Document Base, which exceeds the capacity of a single machine, we implemented a parallel processing architecture for the CDB, allowing it to be distributed across a number of machines. Partitioning of the CDB across machines greatly increases the rate of both document gathering and aggregate statistics compilation. A controller-servant architecture was employed. The controller machine maintained a list of categories, including the timestamp at which population last started (if it had begun) and ended (if it had ended) for that category. In order to maintain data freshness, the documents in the category could be periodically refreshed, for example, every week. Individual servant machines requested categories from the controller. The controller assigned the servant a list of categories to populate. If any of these categories do not populate within 24 hours, the controller allowed them to be reassigned to another servant. If a category has been reassigned to 3 servants, and still had not populated, it is flagged as problematic, so that a programmer could investigate why the category was not populating.

Two queues of categories were maintained on the controller: a high priority queue, and a low-priority queue. If there were any categories in the high priority queue, these were processed first, otherwise the low priority queue was processed. A category was added to the high priority queue if a user had attempted a search on that category – the high priority queue was intended to ensure that documents for categories that users are specifically interested in were imported to the CDB as soon as possible. The low priority queue contains categories that no users had yet requested, but that could conceivably be requested in the future, thereby ensuring that we had a forward cache that could rapidly satisfy new requests.

The controller machine was always started first, so that it could listen for requests, for category lists, from the servants. As each servant was started, it requested a list of categories from the controller, then populated documents into each of those categories, and notified the server as each category was completed. The servant requested new categories once it had completed all of the categories in its current list. To populate an individual category, the servant requested the top matches for that category from a search engine (e.g. top 10 documents in the category, by searching Yahoo), and the servant then retrieved each of these high-ranking documents, and stored the documents in an indexed database (the database structure is shown in §5.2). The experiments reported in §6 made use of this indexed database structure. During our experiments, we noticed that the index structure resulted in significant performance impediments both in populating the CDB and in querying the CDB. Population using this index structure occurred at approximately 40,000 categories (400,000 documents) per month. A single phrase query to gather the aggregate statistics for 40,000 categories consumed 2 to 3 days per query. We have therefore begun experiments with the storage of documents in simple directory folders, with one directory folder per category, to determine if population and query performance can be improved in this manner.

Returning now to the controller-servant architecture implemented, the optimal number of categories for a servant to request depends on the size of the controller’s queue, and the number and speed of the servants. If the controller’s queue is long, and there are a small number of fast servants, each servant should request many categories, to reduce the number of individual requests for new category names to populate, to the controller. If the controller’s queue is short, each servant should request only one category at a time, so as to maximize parallel processing of this short queue amongst the servants. If too many categories were requested by a single servant, and there were none left in the controller’s queue, other servants would remain idle while the overloaded servant churned slowly through its queue. While we implemented only a simple load-balancing scheme, it is clear that CDB population routines would benefit from sophisticated load-balancing arrangements – the literature is replete with alternatives [87, 119].

For robustness, the data on each servant could be replicated on other servants, to ensure that there is no single point of failure – if a server goes down, another server either holds the data, or a team of servers will gather the data. We did not implement a replication scheme, though one would be advisable for a production quality system.

When a user inputted a search term, and wished to explore document categories and compute aggregate statistics for each category, the search term was received at the controller. The controller determined which servants held the documents for each document category that the user requested. The controller then contacted each relevant servant, and requested the aggregate statistics for that search term, for the categories which the user requested, and which were on that servant. This was an asynchronous process, meaning the controller did not block while it awaited results, and instead continued with other tasks. When the servant had completed the calculations, it contacted the controller with its results. If a server went down, it would complete the requests in its queue only if they were recent (i.e. if the controller hadn’t reassigned them already). Again, a replication scheme is advisable, but was not implemented. For instance, if the servant did not complete its calculations within a specified time, the controller should contact an alternative servant that possessed the data. This would ensure that no bottlenecks would arise when a servant went down.

A cache of query results (aggregate statistics) was maintained, to speed up repeat queries. For example, if the number of hits on “warm” for the “United States Cities with State Name” hierarchy had been recently computed, we only needed to re-compute the statistic for sub-categories whose document collection has changed since the last query. For the US location hierarchy (constituting approximately 40,000 categories), a complete cache of the number of word hits and document hits in all categories for a single search term, stored in a Comma-Separate-Value (CSV) file or Excel spreadsheet, consumes approximately 2MB of storage space. The CSV file provides near-instantaneous response times to repeat queries involving the same search term, in cases where the document-base is unchanged since the previous query. The query response time for repeat queries increases roughly proportionately^¹⁷ with the number of documents that have changed since the last query using that same search term.

In our experiments (see §6), we made use of ten Pentium machines, with 2.6 Ghz or greater processors, 1 GB of RAM, and 500 GB of hard drive space each, totaling approximately 2 Terrabytes of storage. The population and exploration routines were written in Python, and all data was stored in a MySQL 5.0 databases. We imported the classification schemes shown in Table 1, and then populated the Categorized Document Base by obtaining the top 10 documents for each category from Yahoo.com. The taxonomy importation process alone took many weeks, as each classification scheme was in its own format, and needed to be imported into a standard tabular format. The full text of all HTML documents was imported. Other document formats, such as occasional PDF and Word documents, were ignored. After taxonomy importation was completed, we began to populate the various categories with documents. A total of approximately 240,000 categories (2.4 million documents) were populated, over a period of 6 months in the latter half of 2007. For the experiments reported in §6 we made use of only a subset of these categories – approximately 40,000 categories in the taxonomy “United States Cities with State Name” which were gathered using 6 machines over a period of a few weeks in summer 2008. The remaining taxonomies and categories were used in industrial trials with client organizations (see §7), or for initial experimentation that was later abandoned^¹⁸.

Directory: ~sok
~sok -> Name: cisc 3220 – sample exam 1

Download 355.17 Kb.

Share with your friends:

1 2 3 4 5 6 7 8 9 10