As is evident from our earlier experimental evaluation (Section 6), the aggregate statistics obtained from categorized text vary in their usefulness. This can be due to a number of factors, including ambiguous terms, presence of negation and antonyms, alternative word forms, the tone of the text, intermingling of information from unrelated themes or from multiple taxons in single documents, document duplication, reporting biases, time and location specificity, relativity of reference points used for comparison (e.g. 40°F may be ‘warm’ for an Alaskan but not for a Floridian), human population biases, precision and recall of the underlying search engine and relevance of its results, number of documents per taxon, asymmetry in the number of child categories per parent taxon, disjoint phrases, distinguishing between expressions of consumer demand and expressions of market supply, and other issues. Further experiments are required to determine the influence of a number of possible alterations to our technique: for example, using short snippets of text instead of full documents, only storing certain documents types for each category (e.g. only news articles for that category or only encyclopedia articles for that category), or using more or fewer documents per category. For reasons of space, we leave it to future work to comment in more detail on the CDB limitations described above, the mechanisms that can be used to mitigate their influence, and the observed effects of alterations to our core technique.
9.SUMMARY
In this paper, we have illustrated an approach for populating and exploring Categorized Document Bases (CBDs). CDBs represent a helpful middle ground between unstructured and structured data, since the documents are well-organized (categorized), though not structured. The CDB is a rough tool, capable of producing plausible comparisons of categories against each other only when the categories are starkly different (as in the case of many non-population-sensitive industries).
When setting up the CDB, it is important that category descriptors are unambiguous and sufficient highly relevant documents exist for even obscure categories in the classification scheme. The aggregate statistics are independently useful, but can also be integrated with structured data for the categories – for example, using bubble charts or tables, and using category identifiers to cross-reference from the aggregate statistics for each document category to the traditional structured numeric data.
We assessed the reasonableness of our CDB approach through a number of experiments that compared our aggregate results for each category to closely related numeric data, to determine whether our proxy measures – aggregates derived from textual data – correlate at all with their quantitative counterparts obtained from well-accepted structured sources. Our experiments seem to indicate that, for a taxonomy such as the US GSA GLC list of US locations, where the CDB can be mechanically populated with relevant content for each category, the CDB appears to produce a plausible reflection of both natural and market phenomena in multiple industries, but only in industries where the locations under comparison diverge substantially. The results, therefore, appear to partially support the hypothesis at the start of this paper that our CDBs can allow mountains of text information on locations to be distilled into sensible comparisons of those locations.
We described some applications of our research, including market research, sales lead prospecting, and rapidly obtaining insights into new collections of topics or items. We also briefly documented a number of limitations we have found in the CDB population and exploration process – helpful repairs and alterations, which improve the quality of the results are outside of the scope of this paper and will be discussed in detail in future work.
In summary, we have described and evaluated a method for the creation and exploration of Categorized Document Bases, and shown, through varied experiments, that our method can be useful. Our experiments indicate that the CDB method should be generally useful when one wants a ranking of categories in non-population sensitive industries, and can tolerate some error, and independent data of good quality does not exist. It would appear that the CDB approach we have proposed presents a promising new approach for the benefaction of additional value from textual documents, but much further work is needed to refine the CDB construction and usage method presented.
10.ACKNOWLEDGMENTS
Our thanks to the following research assistants, who assisted with the implementation and evaluation of the features and algorithms described in the main text:
-
taxonomy importation scripts: Jason Gurwin, Debbie Chiou.
-
population routines: Joseph Leary, Ryan Mark Fleming, Shawn Zhou.
-
results visualization features: David Gorski, Ryan Namdar, Adam Altman, Ankit Choudari, Michael Pan, Myron Robinson, Mark Weinberger.
-
experimental evaluation: Shawn Zhou, Ava Zhiyang Yang, Aditya Mehrotra, Peng Chen, Anjay Kumar, Anjay Aushij, Erik Malmgren-Samuel.
Thanks are also due to:
-
John Ranieri and Ray Miller, of Du Pont Corporation ( http://www.dupont.com/ ), for championing CDB experiments within Du Pont, and providing feedback on the results.
-
Chris and Natasha Ashton, of PetPlan USA pet insurance ( http://www.gopetplan.com/ ), for the provision of PetPlan’s dog health insurance sales data.
-
The reviewers, whose suggestions were very valuable in improving the content of this paper.
11.REFERENCES
1.
|
Apte C.; Damerau F.; and Weiss S. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 3 (July 1994), 233-240.
|
2.
|
Apte C.; Damerau F.; and Weiss S. Text mining with decision trees and decision rules. In, Conference on Automated Learning and Discovery, Pittsburgh, PA, June, 1998, pp.1-4.
|
3.
|
Agirre E., and Edmonds P. (eds.) Word Sense Disambiguation: Algorithms and Applications. Dordrecht: Springer, 2007.
|
4.
|
Attardi G.; Gulli A.; and Sebastiani F. Automatic web page categorization by link and content analysis. In, Hutchinson C., and Lanzarone G. (eds.), Proceedings of the European Symposium on Telematics, Hypermedia, and Artificial Intelligence (THAI-99), 1999, pp.105-119.
|
5.
|
Allen RB.; Obry P.; and Littman M. An interface for navigating clustered document sets returned by queries. In, Proceedings of the Conference on Organizational Computing Systems, Milpitas, CA, November 1-4, 1993, pp.166-171.
|
6.
|
Behal A , Chen Y , Kieliszewski1 C, Lelescu A, He B, Cui J, Kreulen J, Rhodes J, Spangler WS. Business Insights Workbench – An Interactive Insights Discovery Solution. Lecture Notes in Computer Science Volume 4558. 2007. Pp. 834-843.
|
7.
|
Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
|
8.
|
Borko H., and Bernick M. Automatic document classification. Journal of the ACM, 10, 2, (April 1963), 151-162.
|
9.
|
Borko H., and Bernick M. Automatic document classification part II: additional experiments. Journal of the ACM, 11, 2, (April 1964), 138-151.
|
10.
|
Bhandarkar A.; Chandrasekar R.; Ramani S.; and Bhatnagar A. Intelligent categorization, archival and retrieval of information. In, Proceedings of the International Conference on Knowledge Based Computer Systems (KBCS ’89), Bombay, India, December 11-13, 1989, Lecture Notes in Computer Science, 444, Springer, 1990, pp. 309-320.
|
11.
|
Blair D.C. Searching biases in large interactive document retrieval systems. Journal of the American Society for Information Science, 31, (July 1980), 271-277.
|
12.
|
Bonferroni, C. E. “Il calcolo delle assicurazioni su gruppi di teste.” In Studi in Onore del Professore Salvatore Ortu Carboni. Rome: Italy, pp. 13-60, 1935.
|
13.
|
Bonferroni, C. E. “Teoria statistica delle classi e calcolo delle probabilità.” Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, 3-62, 1936.
|
14.
|
Butler D. Mashups mix data into global service. Nature, 439, (4 January 2006), 6-7.
|
15.
|
Bot R.S.; Wu Y.B.; Chen X.; and Li Q. A hybrid classifier approach for web retrieved documents classification. In, International Conference on Information Technology: Coding and Computing (ITCC), 2004, pp. 326-330.
|
16.
|
Bot R.S.; Wu Y.B.; Chen X.; and Li Q. Generating better concept hierarchies using automatic document classification. In, Conference on Information and Knowledge Management, 2005, pp. 281-282.
|
17.
|
Chen H., and Dumais S. Bringing order to the web: automatically categorizing search results. In, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. The Hague, The Netherlands. April 1-6, 2000, pp. 145-152.
|
18.
|
Chim H., and Deng X. A new suffix tree similarity measure for document clustering. In, 16th International World Wide Web Conference (WWW2007), Banff, Alberta, Canada, May 8-12, 2007, pp. 121-130.
|
19.
|
Chen H., and Ho T.K. Evaluation of decision forests on text categorization. In, Proceedings of the 7th SPIE Conference on Document Recognition and Retrieval, 2000, pp. 191-199.
|
20.
|
Cutting D.R.; Karger D.R.; Pedersen J.O.; and Tukey J.W. Scatter/Gather: a cluster-based approach to browsing large document collections. In, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. Copenhagen, Denmark, June 21-24, 1992, pp. 318-329.
|
21.
|
Calvo R.A.; Lee J.M.; and Li X. Managing content with automatic document classification. Journal of Digital Information, 5, 2, (2004), 1-15.
|
22.
|
Chen M.; LaPaugh A.; and Singh J.P. Categorizing information objects from user access patterns. In, Proceedings of the Eleventh International Conference on Information and Knowledge Management, 4-9 November 2002, pp. 365-372.
|
23.
|
Croft W.B. Clustering large files of documents using the single link method. Journal of the American Society of Information Science, 28, (1977), 341-344.
|
24.
|
Croft W.B. Organizing and Searching Large Files of Documents. PhD thesis, University of Cambridge. (1978).
|
25.
|
Cohen W.W., and Singer Y. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17, 2, (1999), 141-173.
|
26.
|
Chen H.; Schuffels C.; and Orwig R. Internet categorization and search: a self-organizing approach. Journal of Visual Communication and Image Representation, Special Issue on Digital Libraries, 7, 1, (1996), 88-102.
|
27.
|
Chau R.; Yeh C.; and Smith K.A. A neural network model for hierarchical multilingual text categorization. Advances in Neural Networks – ISNN 2005, Lecture Notes in Computer Science, 3497, (2005), 238-245.
|
28.
|
Chung W, Chen H. and Nunamaker J. A Visual Framework for Knowledge Discovery on the Web: An Empirical Study of Business Intelligence Exploration. Journal of Management Information Systems. 21(4). Spring 2005. Pp. 57 – 84.
|
29.
|
Dumais S., and Chen H. Hierarchical classification of web content. In, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Athens, Greece, July 24-28, 2000, pp. 256-263.
|
30.
|
Dumais S.; Cutrell E.; and Chen H. Optimizing search by showing results in context. In, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Seattle, WA, March 2001, pp. 277-284.
|
31.
|
Dworman G.O.; Kimbrough S.O.; and Patch C. On pattern-directed search of archives and collections. Journal of the American Society for Information Science, 51, 1, (2000), 14-23.
|
32.
|
Dagan I.; Karov Y.; and Roth D. Mistake-driven learning in text categorization. In, Cardie C. and Weischedel R. (eds.), Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, pp. 55-63.
|
33.
|
Datta A., and Thomas H. The cube data model: a conceptual model and algebra for on-line analytical processing in data warehouses. Decision Support Systems, 27, 3, (1999), 289-301.
|
34.
|
Eklund P.W., and Cole R.J. Information classification and retrieval using concept lattices. United States Patent 20060112108.
|
35.
|
Eder J.; Krumpholz A.; Biliris A.; and Panagos E. Self-maintained folder hierarchies as document repositories. In, Proceedings of the 2000 Kyoto International Conference on Digital Libraries: Research and Practice, Kyoto, Japan, November 13-16, 2000, pp. 356-363.
|
36.
|
Efthimiadis E.N. Query expansion. In, Martha E. Williams (ed.), Annual Review of Information Systems and Technology (ARIST), 31, 1996, pp. 121-187.
|
37.
|
Farkas J. Generating document clusters using thesauri and neural networks. In, Canadian Conference on Electrical and Computer Engineering, 1994, pp. 710-713.
|
38.
|
Fellbaum C. (ed.). WordNet: An electronic lexical database. Cambridge, Massachusetts: Bradford Books / MIT Press, 1998.
|
39.
|
Ferrari AJ, Gourley DJ. Johnson KA, Knabe FC, Mohta VB, Tunkelang D, and Walter JS. Hierarchical data-driven search and navigation system and method for information retrieval. US Patent 7062483. June 13, 2006.
|
40.
|
Fürnkranz J. Exploiting structural information for text classification on the WWW. In, Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis, August 1, 1997, pp. 487-498.
|
41.
|
Geffner S.; Agrawal D.; El Abbadi A.; and Smith T. Browsing large digital library collections using classification hierarchies. In, Proceedings of the Eighth International Conference on Information and Knowledge Management, Kansas City, MO, November 2-6, 1999, pp. 195-201.
|
42.
|
Gray J.; Chaudhuri S.; Bosworth A.; Layman A.; Reichart D.; and Venkatrao M.. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1, 1, (1997), 29-53.
|
43.
|
Gietz P. Report on automatic classification systems for the TERENA activity Portal Coordination. (19 June 2001). Available at: www.daasi.de/reports/Report-automatic-classification.html
|
44.
|
Goren-Bar D., and Kuflik T. Supporting user-subjective categorization with self-organizing maps and learning vector quantization. Journal of the American Society for Information Science and Technology, 56, 4, (2005), 345-355.
|
45.
|
Goren-Bar D.; Kuflik T.; and Lev D. Supervised learning for automatic classification of documents using self-organizing maps. In, Proceedings of the First DELOS Network of Excellence Workshop on “Information Seeking, Searching and Querying in Digital Libraries”, Zurich, Switzerland, December 11-12, 2000.
|
46.
|
Garfield E.; Malin MV.; and Small H. A system for automatic classification of scientific literature. Journal of the Indian Institute of Science, 57, 2, (1975), 61-74.
|
47.
|
Golub K. Automated subject classification of web documents. Journal of Documentation, 62, 3, (2006), 350-371.
|
48.
|
Godby J., and Stuler J. The Library of Congress Classification as a knowledge base for automatic subject categorization. In, Subject Retrieval in a Networked Environment (IFLA Preconference), Dublin, Ohio, August 2001.
|
49.
|
Guo G.; Wang H.; Bell D.; Bi Y.; and Greer K. KNN model-based approach in classification. In, Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE 2003), Catania, Sicily, Italy, 3-7 November 2003. Lecture Notes in Computer Science, 2888, Springer-Verlag, 2003, pp. 986-996
|
50.
|
Guo G.; Wang H.; Bell D.; Bi Y.; and Greer K. An kNN model-based approach and its application in text categorization. In, Proceedings of the 5th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing),2004, Lecture Notes in Computer Science, 2945, Springer-Verlag, pp. 559-570.
|
51.
|
Hearst, M. Clustering versus Faceted Categories for Information Exploration. Communications of the ACM. 49 (4). April 2006.
|
52.
|
Hearst M, English J, Sinha R, Swearingen K, and Yee P. Finding the Flow in Web Site Search. Communications of the ACM. 45 (9). September 2002. Pp.42-49.
|
53.
|
Holland JM, Kreulen JT, and Spangler WS. Method and system for identifying relationships between text documents and structured variables pertaining to the text documents. US Patent 7155668. December 26, 2006.
|
54.
|
Huffman S., and Damashek M. Acquaintance: a novel vector-space n-gram technique for document categorization. In, Proceedings of the 3rd Text Retrieval Conference (TREC 3), 1994, pp. 305-310.
|
55.
|
Hatzivassiloglou V.; Gravano L.; and Maganti A. An investigation of linguistic features and clustering algorithms for topical document clustering. In, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 2000, pp. 224-231.
|
56.
|
Hussin M.F., and Kamel M. Document clustering using hierarchical SOMART neural network. In, Proceedings of the International Joint Conference on Neural Networks, 20-24 July 2003, pp. 2238 – 2242.
|
57.
|
Hayes P.; Knecht L.E.; and Cellio M.J. A news story categorization system. In, Second Conference on Applied Natural Language Processing (ANLP-88), 1988, pp. 9-17. Reprinted in Sparck-Jones K., and Willett P. (eds.), Readings in Iinformation Retrieval, San Francisco, CA: Morgan Kaufmann, 1997, pp. 518-526.
|
58.
|
Han E.H.; Karypis G.; and Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. In, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001, pp. 53-65.
|
59.
|
Hofmann T. The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In, Proceedings of the International Joint Conference in Artificial Intelligence, 1999, pp. 682 – 687.
|
60.
|
Huynh D, Mazzochi S, and Karger D. Piggy Bank: Experience the Semantic Web Inside your Web Browser. Journal of Web Semantics. 5(1). 2007. Pp. 16-27.
|
61.
|
Iwayama M., and Tokunaga T. Hierarchical Bayesian clustering for automatic text classification. In, Proceedings of the International Joint Conference in Artificial Intelligence, 1995, pp. 1322-1327.
|
62.
|
Ide N. and Veronis J. Word sense disambiguation: the state of the art. Computational Linguistics, 24, 1, (1998), 1-40.
|
63.
|
Jenkins C.; Jackson M.; Burden P.; and Wallis J. Automatic classification of web resources using Java and Dewey decimal classification. Computer Networks and ISDN Systems, 30, (1998), 646-648.
|
64.
|
Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In, Proceedings of the Fourteenth International Conference on Machine Learning, July 8-12, 1997, pp. 143-151.
|
65.
|
Joachims T. Text categorization with Support Vector Machines: Learning with many relevant features. In, Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 1998, pp. 137-142.
|
66.
|
Jardine N., and van Rijsbergen C.J. The user of hierarchical clustering in information retrieval. Information Storage and Retrieval, 7, (1971), 217-240.
|
67.
|
Joachims T., and Sebastiani F. (eds.). Automated text categorization (special issue), Journal of Intelligent Information Systems, 18, (March-May 2002), 2-3.
|
68.
|
Käki M. Findex: search result categories help users when document ranking fails. In, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Portland, OR, April 2-7, 2005, pp. 131-140.
|
69.
|
Ko S.J.; Choi J.H.; and Lee J.H. Bayesian web document classification through optimizing association word. In, Proceedings of the 15th International Conference in Applied Artificial Intelligence, Laughborough, UK, Lecture Notes in Computer Science, 2718, (2003), 565-574.
|
70.
|
Koch T.; Day M.; Brümmer A.; Hiom D.; Peereboom M.; Poulter A.; and Worsfold E. The role of classification schemes in internet resource description and discovery. In, Work Package 3 of Telematics for Research project Development of a European Service for Information on Research and Education (DESIRE) (RE 1004), 1999.
|
71.
|
Kendall M. A New Measure of Rank Correlation. Biometrika, 30, (1938), 81-89.
|
72.
|
Kendall M. Rank Correlation Methods. London: Charles Griffin & Company Limited, 1948.
|
73.
|
Knepper M.M.; Fox K.L.; and Frieder O. Method for domain identification of documents in a document database. United States Patent 20060206483.
|
74.
|
Kules B.; Kustanowitz J.; and Shneiderman B. Categorizing web search results into meaningful and stable categories using fast-feature techniques. In, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, Chapel Hill, NC, June 11-15, 2006, pp. 210-219.
|
75.
|
Koller D., and Sahami M. Rule-based hierarchical document categorization for the World Wide Web. In, Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 170-178.
|
76.
|
Kita K.; Sasaki M.; and Ying T.X. Rule-based hierarchical document categorization for the World Wide Web. In, Asia Pacific Web Conference, 1998, pp. 269-273.
|
77.
|
Kummamuru K.; Lotlikar R.; Roy S.; Singal K.; and Krishnapuram R. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In, Proceedings of the 13th international conference on World Wide Web, New York, NY, May 17-20, 2004, pp. 658-665.
|
78.
|
Li X., and Calvo R.A. Hierarchical document classification using I bayes. In, 8th Australasian Document Computing Symposium, CSIRO, Canberra, December 2003.
|
79.
|
Leouski A.V., and Croft W.B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76, Amherst: Department of Computer Science, University of Massachusetts, 1996.
|
80.
|
Larkey L., and Croft W.B. Combining classifiers in text categorization. In, Proceedings of the 19th International Conference on Research and Development in Information Retrieval, 1996, pp. 289-297.
|
81.
|
Li Q.; Chen X.; Bot R.S.; and Wu Y.B. Improving concept hierarchy development for web returned documents using automatic classification. In, International Conference on Internet Computing, 2005, pp. 99-105.
|
82.
|
Labrou Y., and Finin T. Yahoo! As an ontology: using Yahoo! Categories to describe documents. In, Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM-99), Kansas City, MO, November 2-6, 1999, pp. 180-187.
|
83.
|
Lewis D.D., and Gale W.A. A sequential algorithm for training text classifiers. In, Croft W.B. and van Rijsbergen C.J. (eds.), Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR ’94), Dublin, Ireland, 1994, pp. 3-12.
|
84.
|
Lam W., and Ho C.Y. Using a generalized instance set for automatic text categorization. In, Proceedings of the 21st International Conference on Research and Development in Information Retrieval (SIGIR ’98), Melbourne, Australia, 1998, pp. 81-89.
|
85.
|
Liang J.Z. SVM multi-classifier and Web document classification. In, Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, Volume 3, August 26-29, 2004, pp. 1347-1351.
|
86.
|
Li Y.H., and Jain A.K. Classification of text documents. The Computer Journal, 41, 8, (1998), 537-546.
|
87.
|
Li Y., and Lan Z. A survey of load balancing in grid computing. Lecture Notes in Computer Science, 3314, (2005), 280-285.
|
88.
|
Li W.; Lee B.; Krausz F.; and Sahin K. Text classification by a neural network. In, Proceedings of the. 23rd Annual Summer Computer Simulation Conference, Baltimore, MD July 22-24, 1991, pp. 313-318.
|
89.
|
Lewis D.D., and Ringuette M. A comparison of two learning algorithms for text categorization. In, Third Annual Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.
|
90.
|
Lehnert W.S.; Soderland S.; Aronow D.; Feng F.; and Shmueli A. Inductive text classification for medical applications. Journal for Experimental and Theoretical Artificial Intelligence, 7, 1, (1995), 49–80.
|
91.
|
Lewis D.D.; Schapire R.E.; Callan J.P.; and Papka R. Training algorithms for linear text classifiers. In, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 18-22, 1996, pp. 298-306.
|
92.
|
Lin S.H.; Shih C.S.; Chen M.C.; Ho J.M.; Ko M.T.; and Huang Y.M. Extracting classification knowledge of internet documents with mining term associations: a semantic approach. In, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24-28, 1998, pp. 241-249.
|
93.
|
Lewis D.D.; Yang Y.; Rose T.G.; and Li F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, (2004), 361-397.
|
94.
|
MacLeod K. An application specific neural model for document clustering. In, Proceedings of the Fourth Annual Parallel Processing Symposium, 1, 1990, pp. 5-16.
|
95.
|
Marchionni G. Exploratory search: from finding to understanding. Communications of the ACM. 49 (4). April 2006. Pp 41-46.
|
96.
|
McGarry K. A survey of interestingness measures for knowledge discovery. The Knowledge Engineering Review, 20, 1, (March 2005), 39-61.
|
97.
|
Möller G.; Carstensen K.U.; Diekman B.; and Watjen H. Automatic classification of the World Wide Web using Universal Decimal Classification. In, McKenna B (ed.), 23rd International Online Information Meeting(London, England), Oxford: Learned Information Europe, 1999, pp. 231-237.
|
98.
|
Markov A., and Last M. A simple, structure-sensitive approach for web document classification. Advances in Web Intelligence. Lecture Notes in Computer Science, 3528, (2005), 293-298.
|
99.
|
Maarek Y.S., and Wecker A.J. The Librarian’s Assistant: automatically organizing on-line books into dynamic bookshelves. In, Proceedings of Intelligent Multimedia Information Retrieval Systems and Management (RIAO ’94), New York, NY, October 11-13, 1994.
|
100.
|
Nigam K.; McCallum A.; Thrun S.; and Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39,2/3, (2000), 103-134.
|
101.
|
Papka, R., and Allan J. Document classification using multiword features. In, Proceedings of the 7th International Conference on Information and Knowledge Management, Bethesda, MD, 1998, pp. 124-131.
|
102.
|
Pearson K. Mathematical contributions to the theory of evolution. III. Regression, heredity and panmixia. Philosophical Transactions of the Royal Society of London, 187, (1896), 253-318.
|
103.
|
Pierre J.M. On the automated classification of web sites. Linkoping Electronic Articles in Computer and Information Science, 6, 1, (2001), 1-12.
|
104.
|
Pollitt S. The key role of classification and indexing in view-based searching. 63rd IFLA General Conference and Council. International Federation of Library Associations and Institutions (IFLA). Copenhagen, Denmark. 31 August – 5 September 1997.
|
105.
|
van Rijsbergen C.J. Information Retrieval, 2nd Edition. London: Butterworths, 1979.
|
106.
|
Riloff E. Little words can make a big difference for text classification. In, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, July 9-13, 1995, pp. 130-136.
|
107.
|
Riloff E. Using learned extraction patterns for text classification. In, Wermter S., Riloff E., and Scheler G. (eds), Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Berlin: Springer-Verlag, 1996, pp. 275-289.
|
108.
|
Riloff E., and Lehnert W. Classifying texts using relevancy signatures. In, Proceedings of the Tenth National Conference on Artificial Intelligence, 1992, pp. 329-334.
|
109.
|
Riloff E., and Lehnert W. Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, 12, 3, (July 1994), 296-333.
|
110.
|
Rocchio J.J. Relevance feedback in information retrieval. In, Salton G. (ed.), The SMART Retrieval System: Experiments in Automatic Document Processing, Englewood Cliffs, NJ: Prentice-Hall, 1971, pp. 313-323.
|
111.
|
Rodden K. About 23 million documents match your query… In, ACM Conference on Human Factors in Computing Systems (ACM CHI’98), Los Angeles, CA, April 1998, pp. 64-65.
|
112.
|
Romano NC, Donovan C, Chen H, and Nunamaker J. A Methodology for Analyzing Web-Based Qualitative Data. Journal of Management Information Systems. 19(4). Spring 2003. Pp. 213 – 246.
|
113.
|
Ruiz M., and Srinivasan P. Hierarchical text categorization using neural networks. Information Retrieval, 5, 1, (2002), 87-118.
|
114.
|
Schraefel MMC, Wilson M, Russell A, and Smith DA: mSpace: improving information access to multimedia domains with multimodal exploratory search. Communications of the ACM. 49(4). April 2006. Pp. 47-49.
|
115.
|
Siegel S., and Castellan N.J. Nonparametric Statistics for the Behavioral Sciences, 2nd edition. London: McGraw-Hill, 1988.
|
116.
|
Schutze H. Automatic word sense discrimination. Computational Linguistics, 24, 1, (1998), 97-123.
|
117.
|
Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 34, 1, (2002), 1-47.
|
118.
|
Sebastiani F. Text categorization. In, Alessandro Zanasi (ed.), Text Mining and its Applications to Intelligence, CRM and Knowledge Management, Southampton, UK : WIT Press, , 2005, pp. 109-129.
|
119.
|
Shirazi B.A.; Kavi K.M.; and Hurson A. Scheduling and Load Balancing in Parallel and Distributed Systems. Los Alamitos, CA: IEEE Computer Society Press, 1995.
|
120.
|
Shafer K.E. Scorpion helps catalog the Web. Bulletin of the American Society for Information Science, 24, 1, (October/November 1997), 28-29.
|
121.
|
Spangler S and Kreulen J. Mining the Talk: Unlocking the Business Value in Unstructured Information. IBM Press. 2008.
|
122.
|
Spangler S, Kreulen JT, and Lessler J. Generating and Browsing Multiple Taxonomies Over a Document Collection. Journal of Management Information Systems. 19(4). Spring 2003. Pp. 191 – 212
|
123.
|
Spearman C. The proof and measurement of association between two things. American Journal of Psychology, 15, (1904), 72–101. Reprinted in: The American Journal of Psychology, 100, ¾, Special Centennial Issue, (Autumn – Winter, 1987), 441-471.
|
124.
|
Slonim M., and Tishby N. The power of word clusters for text classification. In, Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research, Darmstadt, Germany, 2001, pp. 1-12.
|
125.
|
Sun A.; Lim E.P.; and Ng W.K. Hierarchical text classification and evaluation. In, IEEE International Conference on Data Mining (ICDM), San Jose, CA, Nov 29-Dec 2, 2001, pp. 521-528.
|
126.
|
Svingen B. Using genetic programming for document classification. In, Proceedings of the 11th International Florida Artificial Intelligence Research Society Conference (FLAIRS98), 1998, pp. 63-67.
|
127.
|
Toth E. Innovative solutions in automatic classification: a brief summary. Libri, 52, 1, (2002), 48-53.
|
128.
|
Thompson R.; Shafer K.E.; and Vizine-Goetz D. Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject Assignment. In, 2nd ACM International Conference on Digital Libraries, Philadelphia, PA, 1997, pp. 37-46.
|
129.
|
Vlajic N., and Card H.C. Categorizing Web pages using modified ART. In, Canadian Conference on Electrical and Computer Engineering, Volume 1, 1998, pp. 313-316.
|
130.
|
Wang Y.; Hodges J.; and Tang B. Classification of Web documents using a naïve Bayes method. In, Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2003, pp. 560.
|
131.
|
Wei C-P, Chiang RHL, and Wu CC. Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach. Journal of Management Information Systems. 23(2). Fall 2006. pp. 173 – 201.
|
132.
|
Wei C-P, Hu PJ, and Le Y-H. Preserving User Preferences in Automated Document-Category Management: An Evolution-Based Approach. Journal of Management Information Systems. 25(4). Spring 2009. pp. 109 – 143.
|
133.
|
Willet P. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24, 5, (1988), 577-597.
|
134.
|
Worsfold E. Subject gateways – fulfilling the DESIRE for knowledge. Computer Networks and ISDN Systems, 30, 16, (30 September 1998), 1479-1489.
|
135.
|
Wiener E.D.; Pedersen J.O.; and Weigend A.S. A neural network approach to topic spotting. In, Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-95), Las Vegas, NV, 1995, pp. 317-332.
|
136.
|
Wu Y.B.; Shankar L.; and Chen X. Finding more useful information faster from web search results. In, Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, LA, 2003, pp. 568–571.
|
137.
|
Yang Y. An evaluation of statistical approaches for text categorization. Journal of Information Retrieval, 1, 1-2, (1999) 67-88.
|
138.
|
Yang Y., and Liu X. A re-examination of text categorization methods. In, Hearst M.A., Gey F., and Tong R. (eds.), Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, CA: ACM Press, 1999, pp. 42-49.
|
139.
|
Yang Y.; Slattery S.; and Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18, 2-3, (March, 2002), 219-241.
|
140.
|
Zamir O., and Etzioni O. Web document clustering: a feasibility demonstration. In, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24-28, 1998, pp. 46–54.
|
141.
|
Zamir O., and Etzioni O. Grouper: a dynamic clustering interface to Web search results. Proceeding of the Eighth International Conference on World Wide Web, Toronto, Canada, May 1999, pp. 1361-1374.
|
142.
|
Zamir O.; Korn J.; Fikes A.; and Lawrence S. Personalization of placed content ordering in search results. United States Patent Application 0050250580. Patent ID EP 1782286A1. Issued May 9, 2007.
|
143.
|
Zeng H.J.; He Q.C.; Chen Z.; Ma W.Y.; and Ma J. Learning to cluster web search results. In, Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004, pp. 210-217.
|
Share with your friends: |