![]() |
|||
![]()
|
![]() |
![]() |
![]() |
A well-known, general module in this style is POST (Part-Of-Speech Tagger), developed at BBN Systems and Technologies (Cambridge, MA) and used in their PLUM data extraction from texts system, one of the best MUC systems (see subsection 2.4). The problem to be solved consists, given a particular word sequence W = (w1, w2, ... wn) where we suppose, at the moment, that all the words wi are known, of finding the corresponding, most likely tag sequence T = (t1, t2, ... tn). According to the Bayes' rule, the conditional probability of a given tag sequence T given the word sequence W, p(T|W), depends on the product of p(T), a priori probability of the tag sequence T, and p(W|T), conditional probability of the occurrence of the word sequence W given that the sequence T of tags occurred. The solution of the above problem requires then (1) the evaluation of all the possible tag sequences in order to calculate p(T|W) for each of them, and (2) the choice as the most likely tag sequence T of the sequence that maximize p(T|W). Given that the resolution formula for p(T|W) can be rewritten as the product [cap pi] of n terms, where each term is the product of the conditional probability of the tag ti given all the previous tags and the conditional probability of the word wi given also all the previous tags, it is evident that the computation of p(T|W) can be particularly complex. In order to reduce the complexity of the problem, POST makes two simplifying assumptions. The first consists of assuming that each word wi in the sequence W depends only on ti; this allows us to make use in the previous product [cap pi], instead of the conditional probability of the word wi given all the previous tags, only of p(wi|ti), i.e., the probability of each word given a tag (lexical probabilities). The second consists of the adoption of some sort of "n-Gram model" which, for the problem at hand, is reduced to evaluating the conditional probability of the tag ti taking into account, instead of the whole sequence of previous tags, (1) only the tag ti-1 assigned to the word wi-1 (bi-tag models), or (2) this last tag plus the tag ti-2 (three-tag models). While the bi-tag model is faster at processing time, the tri-tag model has a lower error rate. To make use of this approach, a system like POST must obviously be "trained," i.e., it must have at its disposal a (relatively large) corpus where all the tags have already been manually assigned. In this case, and adopting a three-tag model, the probability p(t|t2, t1) that a tag t occurs after tags t1 and t2 can be estimated by computing the number of times t occurs after tags t1 and t2 divided by the number of times t1 and t2 are followed by any third tag. A training set is also necessary for estimating p(wi|ti) for an observed word wi. Please take into consideration the fact that, no matter how large the training corpus, it is not possible to observe all the possible pairs or triples of tags; if a particular triple t1 t2 t3 is not present in the corpus, it is still possible to estimate p(t3|t2,t1) making use of techniques called "padding," based on the number of times m we can observe triples beginning with t1 t2 and on the number j of possible distinct tags. Having now obtained all the required conditional probabilities, it is now possible, for a sequence of known words W, to determine the sequence T of tags for which p(T|W) is maximized. Using the approach summarized above, POST is able to calculate sequences of tags with an error rate that fluctuates between 3.30 and 3.87% according to the size of the training corpus. In the presence now of "unknown words" -- i.e., words that a system like POST has never seen before -- the values p(wi|ti) computed for a known word wi thanks to the use of a training set cannot be utilized. Given the poor results obtained under these conditions through the use of "blind" statistical methods, POST has included in its strategy the usual morphological clues exploited by the "symbolic" approaches. POST takes into account, in particular, four categories of morphological features: inflectional endings (-ed, -s, -ing), derivational endings (including -ion, -al, -ive, -ly), hyphenation, and capitalization. The probability that a particular word wi will occur given a particular tag depends now on the probability that the previous features will appear given this particular tag; it now becomes: p(wi|ti) = p(unknown-word|ti) * p(Capital-feature|ti) * p(endings, hyph|ti) Making use of this modified statistical model, the error rate for the unknown words decreases to 15/18%. We will end this short description of POST by mentioning the possibility of running this module according to an alternative modality, in which POST return the set of most likely tags for each word rather than a single tag. We can conclude this subsection about "morphology" and "lexicon" by mentioning a recent trend in these fields; it consists of the construction of very large, semantically structured lexical databases including rich information intended to support not only grammatical analysis, but also large sections of syntactic and (mainly) semantic analysis. Two well-known projects in this field are Miller's WordNet and Yokoi's Electronic Dictionary (EDR). In the first system, English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms ("synsets"); the synsets, and the single word forms, can then be linked by simple sorts of semantic relations like antonymy, hyponymy, meronymy, etc. The Electronic Dictionary is organized around several interrelated dictionaries. For example, The Word Dictionary includes (1) the tools (morphological and syntactic information) necessary for determining the syntactic structure of each sentence, and (2) the tools (semantic information) that, for each word in a given sentence, identify all the possible concepts that correspond to that word. The Concept Dictionary contains information on the 400,000 concepts listed in the Word Dictionary; one of its main functions is supplying the tools needed to produce a semantic (conceptual) representation of a given sentence. The Co-occurrence Dictionary describes collocational information in the form of binary relations, and it is used mainly to select words in the target language when executing mechanical translation operations, etc.
|
![]() |
|
Use of this site is subject certain Terms & Conditions. Copyright (c) 1996-1999 EarthWeb, Inc.. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Please read our privacy policy for details. |