Brought to you by EarthWeb
IT Library Logo

Click Here!
Click Here!

Search the site:
 
EXPERT SEARCH -----
Programming Languages
Databases
Security
Web Services
Network Services
Middleware
Components
Operating Systems
User Interfaces
Groupware & Collaboration
Content Management
Productivity Applications
Hardware
Fun & Games

EarthWeb Direct EarthWeb Direct Fatbrain Auctions Support Source Answers

EarthWeb sites
Crossnodes
Datamation
Developer.com
DICE
EarthWeb.com
EarthWeb Direct
ERP Hub
Gamelan
GoCertify.com
HTMLGoodies
Intranet Journal
IT Knowledge
IT Library
JavaGoodies
JARS
JavaScripts.com
open source IT
RoadCoders
Y2K Info

Previous Table of Contents Next


3.5.2.2. Shallow Semantic Techniques Used in the MUC-Like Systems

In subsection 3.3 above, we mentioned the "semantic" rules used by an MUC-5 system, LINK, to execute the tagging of strings of capitalized words. These rules are a good example of the special-purpose rules used by almost all the MUC systems to recognize quickly, without any in-depth syntactic or semantic analysis but by utilizing, e.g., local pattern matching, the semantic category of particular phrasal units, like company names, places, people's names, dates, acronyms, currencies, and equipment names.

However, the use of "shallow" (and, consequently, ad least partially ad-hoc) semantic techniques is really pervasive in the MUC systems. When dealing, as it is normal in the knowledge extraction domain, with considerable amounts of linguistic data, an important problem concerns that of reducing as much as possible the amount of the "knowledge engineering" effort, i.e., the effort necessary to build up the knowledge sources (mainly, dictionaries and rules) necessary to process correctly the texts. In order to be economically convenient, many MUC systems have then chosen to make use in general of "shallow knowledge" (i.e., knowledge very domain-specific, and so difficult to generalize and to reuse in other applications) (see also Cowie and Lehnert, 1996). The "shallow knowledge hypothesis" can be formulated in this way: (1) it should be possible to acquire automatically, at very low cost, this type of shallow, ad-hoc knowledge; (2) if the above is correct (i.e., if the automatic acquisition of shallow knowledge is really cost-effective), it may be interesting to make use of this sort of knowledge even if it should prove to be adequate only for the application at hand, in a single domain, and for the utilization of a single system.

An example of a system that adheres completely to the shallow knowledge hypothesis is given by CIRCUS, a (relatively) successful MUC-5 system developed jointly by the Department of Computer Science of the University of Massachusetts (UMass) and by Hughes Research Laboratories. The general approach of CIRCUS to the knowledge extraction problem consists, in fact, of automating as much as possible the construction of domain-specific dictionaries and other language-related resources, so that information extraction can be customized for specific applications with a minimal amount of human assistance. CIRCUS is then organized around seven trainable language components, which can be developed in few hours by domain experts who have no background in natural language or machine learning, and are intended to handle (1) lexical recognition and part-of-speech tagging; (2) dictionary generation (using the AutoSlog system); (3) semantic features tagging; (4) noun phrase analysis; (5) limited coreference resolution; (6) domain object recognition, and (7) relational link recognition.

Of all these components, the most well-known is AutoSlog, a tool for the automatic construction of dictionaries of "concept nodes" (CN), developed at UMass and used, in CIRCUS, in conjunction with the sentence analyzer -- AutoSlog makes use of predefined "key templates" to analyze a training corpus and construct then the CN structures. To give only a very simplified example of the use of the AutoSlog results, let us consider the following CN generated, in a MUC-5 context, after examination of a training corpus concerning the microelectronics domain (ME):

(CN %ME-ENTITY-NAME-SUBJECT-VERB-AND-INFINITIVE-PLANS-TO-MARKET%)

When dealing with a fragment of "real" microelectronics text like: "...Nikon Corp. plans to market the NSR-1755EX8A, a new stepper intended for use in the production of 64-Mbit DRAMs ..." -- already processed by previous trainable modules that have tagged, e.g., the sequence "plans to market" as "plans (verb) to (infinitive) market (verb)" -- the previous CN is triggered by the presence of "plans to market" and validated by the concordance between the tags and the constraints inserted in the CN. "Nikon Corp." is then picked up by the CN, and categorized as "me-entity." This example is sufficient to show that, in this type of approach, there is little of really general and transferable to other domains. The procedure is, however, very cost-effective: the CIRCUS dictionary used in MUC-5 was based exclusively on the CN definitions obtained from AutoSlog (3017 CN definitions for the "joint venture," and 4220 for the "microelectronics" dictionary). No hand-coded or manually altered definitions were added to these results.

3.5.3. High-Level Languages for Meaning Representation

As already stated, the final aim of an in-depth semantic analysis is that of producing an image of the "meaning" of a particular statement making use of some advanced representation languages. This image should consist in some form of a combination of high-level world concepts, pertaining to a type of "interlingua" independent from the particular type of application (NL interfaces to DBs, populating the knowledge base of an ES, case-based reasoning, intelligent information retrieval, etc.) and, to some extent, from the particular knowledge proper to a specific application domain. Until now, no agreement exists about the characteristics of this representation language, that should be particularly well adapted to the representation of linguistic knowledge; after having mentioned, first, the predicate logic tradition, we will describe briefly two languages, Conceptual Graphs (CGs) and NKRL, that have been often used to represent this particular type of knowledge and that can be considered, at least partly, as a distant progeny of the Schankian Conceptual Dependency approach (see subsection 2.3).


Previous Table of Contents Next

footer nav
Use of this site is subject certain Terms & Conditions.
Copyright (c) 1996-1999 EarthWeb, Inc.. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Please read our privacy policy for details.