Brought to you by EarthWeb
IT Library Logo

Click Here!
Click Here!

Search the site:
 
EXPERT SEARCH -----
Programming Languages
Databases
Security
Web Services
Network Services
Middleware
Components
Operating Systems
User Interfaces
Groupware & Collaboration
Content Management
Productivity Applications
Hardware
Fun & Games

EarthWeb Direct EarthWeb Direct Fatbrain Auctions Support Source Answers

EarthWeb sites
Crossnodes
Datamation
Developer.com
DICE
EarthWeb.com
EarthWeb Direct
ERP Hub
Gamelan
GoCertify.com
HTMLGoodies
Intranet Journal
IT Knowledge
IT Library
JavaGoodies
JARS
JavaScripts.com
open source IT
RoadCoders
Y2K Info

Previous Table of Contents Next


3.3. MORPHOLOGICAL ANALYSIS

For simplicity's sake, we collect under this label a series of operations that consist of (at least) (1) segmenting an input sequence into words (lexical analysis); (2) executing, in case, some normalization operations on the segmented words; (3) executing the morphological analysis proper; (4) producing, as final result, a sequence of words where each of them is associated, minimally, with the corresponding part(s) of speech.

Given the limits of this chapter, we cannot deal here in detail with the segmentation problems proper to the lexical analysis. In the case of written language, automatic recognition -- i.e., the task of automatically transforming a sequence of spatial-organized graphical marks into characters and words -- is satisfactorily dealt with by using off-the-shelf OCR (Optical Character Recognition) software for neatly printed texts. There are also some successes in the field of handwriting recognition, particularly for isolated hand-printed characters and words and in constrained domains like postal addresses and bank checks. With respect to speech recognition, i.e., the conversion of an acoustic signal captured, e.g., by a microphone, into a stream of words, we have already mentioned in subsection 3.1 some well-known problems that make this task so difficult. In spite of these problems, dramatic improvements in speech recognition technology have been accomplished in the last decade, and applications based on speech input capabilities are now commercially available (voice dialling, e.g., call home, call routing, e.g., "I would like to make a collect call," simple data entry like entering a credit card number, etc.). The actual technologies for passing from an acoustic signal to a sequence of word hypothesis are all based on statistical methods. For example, the most well-known technique used to recognize the different speech sounds is the so-called "Hidden Markov Model" (HMM); an HMM is a composition of two stochastic processes, a hidden Markov chain that takes into account temporal variability, and an observable process that describes spectral variability. After a "speech recognizer" has converted the observed acoustic signal into the corresponding orthographic representation, a "recognizer" makes its estimations about the identity of the words in the spoken sequence by choosing from a finite vocabulary of words that can be recognized. A statistical model used for these tasks is based on the joint distribution p(W,O) of the sequence of spoken words W and the corresponding observed sequence of acoustic information O.

Having isolated the words in the input sequence, the first step of the analysis concerns the computation of their grammatical category (part of speech, "tag"), i.e., the association with information like "noun," "verb," "adjective," etc.; this particular information will then be used in all the subsequent phases of the procedure. If the analysis system could be permanently associated with a list (a "lexicon") including all the words of a given language the system is capable to deal with at a given moment, associated with their proper category (plus, in case, additional information concerning, e.g., the semantic category, like "animate" vs. "inanimate"), the problem should be reduced to a search of the word under examination among the words of the lexicon. Please note that, in practice, this last can be structured in a very sophisticated way for efficiency's sake. In reality, the determination of the grammatical category can be more or less difficult according to the type of word we are dealing with: a "grammatical" word or a "lexical" word.

Grammatical words include the adjectives, excepting the qualitative adjectives, the articles, the predeterminers ("such," "half," "both," etc.), the prepositions, and the pronouns. Generally, they have a very precise grammatical function; their (reduced) number is stable, i.e., they are not normally interested in the evolutionary phenomena that modify the language. Therefore, they do not present any particular difficulty with respect to the computation of their grammatical category, given that the hypothesis of inserting in the lexicon, a priori, a list of all the grammatical words of a particular language accompanied by their category is not particularly odd.

The problem is neatly more difficult for the lexical words: this category includes the adverbs, the qualitative adjectives, the nouns, and the verbs. They account for the large majority of terms of the given language, and they are largely affected by the evolutionary phenomena proper to the language. We are now faced with two types of problems:

  • The first is mainly an optimization problem. Even if we decide to renounce to insert in the lexicon all the vocabulary proper to a given NL, and we specialize the lexicon according to a particular application, or class of applications, there is still the problem of how to build up this tool without being obliged to list all the possible "regular forms" (lexemes) that can still appear in the input sequence -- we do not take here into consideration all the phenomena of "ill-formedness" like misspelling, ungrammaticality, syntactic and semantic constraint violations, etc. Please note that forms like "say" and "says," or "rose" and "roses," are all different lexemes that can be found in the input, and that will be associated with different categories: infinitive verb, third person verb, singular noun (but "rose" can also be an adjective, or the past tense of "rise"), or plural nouns. Verbal forms are much more abundant in the Latin languages than in English; moreover, case phenomena multiply the noun forms in German or in the Slavonic languages, etc.
  • The second, a more important problem, is a "knowledge acquisition" problem. Given that, in reality, it is practically impossible to circumscribe the vocabulary of an application or class of applications and that, moreover, new lexical words can always be created, how to face the problem that consists of finding in the input a word that cannot be reduced to one of the forms listed in the lexicon, i.e., a word that cannot be "recognized" with respect to its proper part of speech.


Previous Table of Contents Next

footer nav
Use of this site is subject certain Terms & Conditions.
Copyright (c) 1996-1999 EarthWeb, Inc.. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Please read our privacy policy for details.