Brought to you by EarthWeb
IT Library Logo

Click Here!
Click Here!

Search the site:
 
EXPERT SEARCH -----
Programming Languages
Databases
Security
Web Services
Network Services
Middleware
Components
Operating Systems
User Interfaces
Groupware & Collaboration
Content Management
Productivity Applications
Hardware
Fun & Games

EarthWeb Direct EarthWeb Direct Fatbrain Auctions Support Source Answers

EarthWeb sites
Crossnodes
Datamation
Developer.com
DICE
EarthWeb.com
EarthWeb Direct
ERP Hub
Gamelan
GoCertify.com
HTMLGoodies
Intranet Journal
IT Knowledge
IT Library
JavaGoodies
JARS
JavaScripts.com
open source IT
RoadCoders
Y2K Info

Previous Table of Contents Next


The first problem -- which concerns, in short, finding a way of reducing the size of the lexicon -- is classically dealt with having recourse to a branch of linguistics, i.e., "morphological analysis." This consists of examining how words are built up of more basic meaning units called "morphemes" ("root forms"), where a word may consist of a root form plus additional affixes. We have already mentioned phenomena like the formation of plural forms from the singular ones. In more general terms, a word like, e.g., "familiar" can be considered as a "root form": from this adjective, we can derive another adjective, "unfamiliar," by adding the prefix "-un"; by adding the suffixes "-ity" and "-ly" we can derive, respectively, the uncount name "familiarity" and the adverb "familiarly"; a more complex derivation rule could also allow us to establish a link with the verb "familiarize," etc. We will now, therefore, store in the lexicon only the root forms (infinitive for the verbs, singular for the nouns, singular masculine for the adjectives, etc.), and supply the analysis system with a morphological analyzer that make use of a list of affixes (suffixes and prefixes) and a set of rules that describes, for a given language, how pieces are combined. Please note that the algorithms implementing the morphological analysis are far from being trivial, giving the presence, in all the languages, of several exceptions with respect to the standard rules of formation: to give a very simple example, in English "go" plus the suffix "-ed" for the past tense becomes "went." The computational techniques used for implementing the morphological analysis algorithms are usually based on some form of finite-state technology as, e.g., in the Koskenniemi's "two-level model" (1983), which makes use of a set of parallel finite-state transducers. Please note that, after the morphological analysis, the result can consist of the production of several grammatical hypotheses, see the example of "rose" before (singular noun, past tense verb, adjective). These ambiguities are normally solved during the syntactical analysis phase (e.g., the presence of an article at the beginning of a sentence normally requires the set up of a noun group, and the rejection of a verbal form as the immediate, following item). Their settlement in the semantic phase is more unusual; it may also be the case that the sentence is inherently ambiguous.


TABLE 1
Tagging Rules for String of Capitalized Words
 
1. If a word is a known acronym (e.g., DRAM) or an abbreviation that is normally capitalized (e.g., "Mbit"), then just pass the word as a regular lexeme.
2. If the string ends in a word like "Corp" or "Co," tag the string as a company name.
3. If a string is followed by a word like "President" or "Spokesman," and then another string, make the first part a company name and the rest a person name.
4. If a string is followed by a comma and then a state name, tag it as a city/state pair.

The second problem can be reduced to what is now called "full automated tagging," i.e., to the problem of finding a set of procedures that, given an input sequence (normally, a sentence), are able to automatically determining the correct part(s) of speech for each word of the sequence also in the presence of "unknown words." Historically, two main approaches have been used, a "rule-based" approach and a "probabilistic" approach; more recently, some work has been proposed making use of connectionist models.

The rule-based approach covers, in reality, a broad spectrum of techniques. They range from the use of very simple, empirical and ad-hoc sets of rules, often suitable only for particular categories of texts, to complex systems that try to induce not only the morphological characteristics of an unknown word, but also its syntactic and semantic properties. As an example of the first category, we have collected in Table 1 some rules used, for MUC-5 (see subsection 2.4), in the LINK system built up at the Artificial Intelligence Laboratory of the University of Michigan (Ann Arbor, MI) by Steven Lytinen and colleagues -- in MUC-5 (1993), LINK was used to analyze articles concerning the microelectronics domain (more exactly, four types of chip fabrication technologies). The rules of Table 1 are relative to the tagging of strings of capitalized words that have been pre-isolated, within the input text, by a preprocessor called "Tokenizer."

As an example of a more complex approach, we can mention here some work done at the GE Research and Development Center (Schenectady, NY) by Paul Jacobs and Uri Zernik in the late 1980s. In this approach, a variety of linguistic and conceptual clues are used to "guess" the general category of a new word. For example, faced with an input from the domain of corporate takeovers like: "Warnaco received another merger offer ...," where all the words in the input string are known excepting "merger," an initial analysis making use of morphological and syntactic clues produces an "analysis chart" where "merger" is analyzed as (1) a possible comparative adjective meaning more "merge," (2) a noun derived from the nominalization of a verb, or (3) a single adjective. Introducing now the three possibilities in the syntactic analysis for the whole sentence, it is possible to construct a "hypothesis chart" where the noun phrase "another merger offer" is interpreted as (1) a determiner, "another," associated with a "modified noun phrase" (comparative adjective "merger" + noun "offer"); (2) "another" associated with a "compound noun phrase" (noun "merger" + noun "offer"); (3) "another" associated with a "modified noun phrase" (adjective "merger" + noun "offer"). From this, three conceptual structures can be deduced: (1) an offer that is "merger" (perhaps larger) than a previous known offer; (2) an offer for a "merger" that is now some unknown type of transaction; (3) an offer having as a property the quality of being a "merger." Making use of other conceptual operations, like the utilization of the previous context of the sentence where an offer of acquiring Warnaco is mentioned, the hypothesis "merger" is a noun, corresponding to a concept in the style "company-acquisition" is privileged; this last hypothesis is confirmed by a second encounter with "merger" under the form: "The merger was completed last week," which allows, e.g., one to rule out the hypothesis of a new offer that is "merger" than the preceding one.

Given however the general renewal of interest in the use of statistical methods in NLP, provoked mainly by the successes obtained through the use of these models in speech processing, the probabilistic approach to the "tagging" problem in general is now widely used.


Previous Table of Contents Next

footer nav
Use of this site is subject certain Terms & Conditions.
Copyright (c) 1996-1999 EarthWeb, Inc.. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Please read our privacy policy for details.