Automated Acquisition of Domain Knowledge and Language Patterns, with Application to Language Metadata
Lois Boggess and Julia Hodges, Principal Investigators
Department of Computer Science
Mississippi State University
Contact Information
Lois Boggess or Julia Hodges
Department of Computer Science
Drawer 9636
Mississippi State, MS 39762
Phone: (601) 325-2756
Fax : (601) 325-8997
Email: lboggess@cs.msstate.edu, hodges@cs.msstate.edu
KUDZU/WebDoc home page
List of Supported Students and Staff
Bo Tang is a Computer Science doctoral student funded by this grant.
Lynellen Perry is a Computer Science doctoral student whose work for the project has been funded in part by this grant and in part by a grant from the Mississippi State University Office of Research.
Janna Shaffer has just joined the CS doctoral program and is funded by this project beginning this January.
Project Award Information
Award Number: IIS-9734807
Duration: 9/01/1998 - 8/30/2001
Title: Automated Acquisition of Domain Knowledge and Language Patterns, with Application to Language Metadata
Keywords
Content-based document indexing, natural language processing, information retrieval, text categorization.
Project Summary
This system maps World Wide Web documents to Library of Congress subject headings. Specifically, it is designed to automatically classify Web documents into subject headings of the Science (Q) and Technology (T) portions of the Library of Congress classification system, after being trained and tested on thousands of documents previously classified by library scientists and expert document analysts. The project represents a scaling up of knowledge-based methods and natural language processing techniques which have been successful in providing more than 80% of the descriptive phrases produced by expert document analysts for journal articles in a specific scientific domain.
Goals, Objectives, and Targeted Activities
This research furthers concept-based (in contrast to word-based) indexing of documents. In the months since we have started, we have concentrated on subjects related to the Science (Q) sections of the Library of Congress classification system - specifically Astronomy, Physics, Geology, and Mathematics. We have downloaded and stored the contents of several hundreds of web sites per major subject area, including ftp sites containing large collections of documents. These web sites have been classified into the Library of Congress subject headings by human specialists. The collection of data for training and testing purposes continues: in the coming year we will download and store sites in Biology, Computer Science, and most of the Engineering fields.
Our initial efforts in index generation have focused on building a knowledge base that includes a structured representation of the subject headings used by the Library of Congress and synonyms for those headings. In our earlier work with scientific journal articles, we found that some correct indexes were not recognized if the phrases that would normally serve as indicators for those indexes were buried in compound noun phrases. We are building a more structured representation of the phrases and synonym information, to make use of important sub-phrases within phrases. We have almost completed the development of the initial knowledge base. Once this is complete, we will test the effectiveness of the additional structure for the phrases and synonyms by measuring improvement in the recall and precision rates. We will also modify our statistical program to allow for dynamic updating of the knowledge base so that the index generation component will become more precise in its generation of the Library of Congress subject headings as it processes more and more documents.
We also target natural language processing errors which have serious impact on knowledge extraction, especially errors in determining the parts of speech of unrestricted text. We believe that reducing these errors will require expanding the context examined by the classifiers and that hybrid classifiers combining more than one kind of classification technique are the most likely sources of breakthrough in reducing remaining errors. To deal with context, we are training very large neural networks, with emphasis on distributing attention across the networks. In so doing, we have opened the context window from the three or four words typical for current part of speech taggers to as much as twelve words. Our best result so far reduces our most frequent serious error by 13%.
Our formal goals for the first year include the following:
- Develop a structured representation for a synonyms file for the Library of Congress headings, to allow recognition of at least 50% of important phrases within complex phrases.
- Demonstrate improvement of both precision and recall of the system using the structured synonyms file.
- Develop a statistical program to provide information needed for dynamic updating of the knowledge base and demonstrate improvement in precision without a drop in recall.
- Incorporate neural networks with extended contexts into word/tag estimators for unknown words for stochastic (n-gram) part-of-speech labeling.
- Train neural networks to improve on serious tagging errors related to unknown words, with a target of 25% reduction in serious errors.
- Prepare doctoral students for dissertation-level research, with a goal of two accepted dissertation proposals
.
Indication of Success
One unexpected early development is the possibility of developing a more reliable body of training and test data for machine learning projects based on classified web sites. OCLC (Online Computer Library Center, Inc., a service organization for libraries around the world, and the "parent" organization for the Dewey Decimal Classification System) made available to us thousands of records of web sites classified by Library of Congress subject headings. Web sites are notoriously volatile: Files move or disappear, and sites that remain in place undergo modification, so that classifications are no longer relevant. OCLC is looking into working independently and/or with us on establishing a corpus of accurately classified Web documents which are captured and will not undergo modification.
In the first months a prototype system has been constructed which incorporates subsets of the Library of Congress subject headings into a database that comprises a roughly hierarchical network. This database also contains Library of Congress information about synonymous relationships between phrases related to the subject headings. Study of large neural networks incorporating contexts of up to 12 words has resulted in comparison of approaches to evolving such networks in parallel. A paper detailing a promising parallel approach to evolving these large networks has been submitted (Boggess, 1999, in review). Recurrent neural networks are also being explored. A hybrid stochastic/neural network part of speech tagger is in progress
.
Project Impact
Two of the three doctoral students participating are women, and both women have targeted international conferences to which they will submit papers for which they will be first author, during this semester. Issues related to the project are being incorporated into a graduate level course on natural language processing this semester. The students of that course have the opportunity to implement experiments related to language processing, terminology, and/or classification of documents, using data collected by the project.
This project is possible thanks to donations of data and time from OCLC, a not-for-profit service and research organization affiliating more than 60,000 libraries worldwide. We have regular conference calls with two OCLC researchers. This exchange of ideas has led to plans to partition effort on joint goals. For example, it appears to be the case that OCLC will develop a repository of classified "web sites" (downloaded sites that will not change with time). They offer vast expertise in human document classification, and we have mutually exchanged advice, information, and software tools for language processing and classification
.
GPRA Outcome Goals
Connections between discoveries and their use in service to society.
A predecessor to this project was successful in providing 85% of the descriptive phrases indexing journal articles by content that highly trained document analysts provided, for the field of physical chemistry. This project seeks to perform the same task, for the fields represented by Library of Congress Q and T call letters (most of science and technology) for much more diverse documents (journal articles are highly constrained as to format, in comparison with web pages). Library of Congress subject headings range from broad (e.g., "Astronomy", "Tobacco", "Volcanoes") to specific (e.g., "Actinomycetales, Research", "Aeronautics in astronomy, United States [and] Infrared Astronomy, Research"). Success of this research will in effect allow web documents to be indexed by a long-established, widely-adopted controlled vocabulary.
Timely and relevant information on the national and international science and engineering enterprise.
If paired with a web crawler, the final product of this research will be a system with the capability to produce far fewer spurious returns from searches for information on the web, without requiring the user to know the exact words that indicate relevance in the target document, and with a high percentage of the relevant sites presented to the user. It should be noted, however, that while the system will have the described capability, it is not within the resources of the research team to collect and establish the training data required for a system to respond to all existing Library of Congress subject headings in the sciences and technology.
Project References
Boggess, Lois. 1999. Evolution of large feedforward networks. Submitted to IJCNN'99.
Boggess, Lois and Lynellen D. S. Perry. 1997. Real world auto-tagging of scientific text. In Proceedings of the Tenth International Florida Artificial Intellligence Research Symposium (FLAIRS-97), 253-257.
Hodges, Julia, Shiyun Yie, Sonal Kulkarni, and Ray Reighart. 1997. Generation and evaluation of indexes for chemistry articles. Journal of Intelligent Information Systems 8 (1):57-76
Hodges, Julia, Shiyun Yie, Ray Reighart, and Lois Boggess. 1996. An automated system that assists in the generation of document indexes. Natural Language Engineering 2(2): 137-160
Perry, Lynellen D. S. 1998. Identifying part-of-speech patterns for automatic tagging. In Proceedings of the International Joint Conference on Neural Networks (IJCNN '98), at the IEEE World Congress on Computational Intelligence. 1873-1876.
KUDZU home page, with links to previous work
Area Background
Our work is a confluence of two areas of expertise: corpus-based natural language processing and knowledge bases. It can be seen from at least three perspectives: information retrieval, broad-based study of the characteristics of language, and knowledge bases themselves. Our focus has been on extraction of information from a body of text from some particular domain. Sometimes the purpose of the extraction has been to add to or update existing knowledge already embedded in a knowledge base. Sometimes, as in the present classification task, the purpose has been to identify the most relevant concepts associated with that text. In either case, there are many ways that the same concepts can be expressed by writers who have the full freedom of a natural language such as English in which to express their ideas. Consequently, the system must handle vocabulary and grammar that has not been anticipated, and relate previously unseen entities in a text to known concepts. In doing so, we examine the properties of language in the aggregate. We use clustering algorithms to examine the contexts in which words are used, with the result that we group words and concepts by "the company they keep". An important aspect of the work is to weigh the evidence that favors one classification or interpretation over competing classifications and interpretations.
Area References
Jurafsky, Daniel, and James H. Martin. 1999. Speech and Language Processing. Prentice Hall. Target publication date of August 1999. It covers most recent developments in corpus-based speech and language processing, and provides mathematical and theoretical background for those who need it. Available on the web: contact Jurafsky or Martin at jurafsky@cs.colorado.edu or martin@cs.colorado.edu
Charniak, Eugene. 1993. Statistical Language Learning. MIT Press. Not as comprehensive or up-to-date, but covers statistical issues well, and has the advantage of brevity.
Allen, James. 1995. Natural Language Learning, 2nd edition. Benjamin/Cummings. Chapter 7 has the math and worked examples of statistical methods for handling unrestricted text at the lexical and syntactic levels.
Weiss, Sholom, and Casimir A. Kulikowski. 1991. Computer Systems That Learn. Very readable intro to machine learning systems as classifiers.