NSA Word Spotting

5 August 1999. Add DC on n-gram analysis.

3 August 1999. Word "NOT" added to paragraph 6 by DC.

2 August 1999. Thanks to Duncan Campbell.

Date: Thu, 05 Aug 1999 01:33:07 +0100
To: ukcrypto@maillist.ox.ac.uk
From: Duncan Campbell <duncan@gn.apc.org>
Subject: Re: Question for Duncan Campbell re: Word-Spotting Capabilities

The topic spotting methods that NSA is working on are based on n-gram analysis, which in my crude way I understand to be based on a comparison of n-dimensional matrices setting out the relative probablity of any text string of length n in two corpuses (corpi ?) of texts. One is the surveillance data, which can be massive, the second is the seed corpus, a chosen set of documents which are about the topic of interest.

In other words, you could show the computer the last six months of uk-crypto, and then say, find me anybody else talking about this stuff in the world's communications. The topic spotting system then ranks orders the target communications as to how closely the topics match to the uk-crypto corpus.

NSA has patented this method, and claims that it is completely language independent (true, if each corpus is in the same language) and highly effective despite high error rates (which seems very plausible). It is this latter claim that makes me suspect that it may work when they apply it to phoneme strings in the speech recognition problem. If you can do that, you don't need to go through the actual transcription phase.

I find the method elegant, as it neatly sidesteps all the well-understood problems of Boolean based searches.

Duncan Campbell

[In response to:]

> >>>Seems to me that topic spotting is a more useful goal anyway. Even if
> >>>you have 100% accuracy in word spotting, you will generate too many
> >>>false positive hits when that word appears out of context (eg "Boy did
> >>>the Yankees bomb tonight").

> >> Topic spotting where the transcription is so bad it introduced
> >> a 70-80% error rate in the words? Do tell how........

> >Er, I didn't say it was possible, I said that I thought it was a more
> >useful goal to aim for than spotting isolated words.

> What I meant was, if you cannot spot the words on a noisy any-voice
> channel (to such an extent they are 70 or 80% wrong),
>
> how you gonna spot topics in the transcribed words?

2 August 1999

The first item in this strand asked about the basis for my conclusion in the recent European Parliament report (http://www.gn.apc.org/duncan/stoa_cover.htm) that "word spotting" methods did not exist in any useful deployable form for sigint analysis and message selection when dealing with high capacity voice communications interception - i.e., typically, the form of analysis required to select messages of intelligence interest from a digital or analogue multiplex carrying thousands of simultaneous telephone calls.

The writer appears not to have read the technical annexe to the IC2000 report (http://www.gn.apc.org/duncan/ic2kreport.htm#Annexe) but asserts as fact what some third party thought it said "I was recently in touch with Guy Polis, who tells me that ..." The annex in fact explains the reasons why, although highly trained speech recognisers can be installed on modern desktops and function with a reasonably low error rate, such systems are not translatable into broadband surveillance systems, and provide no basis for supposing that such a capability can exist.

There have been many reports of such a capability based on Nicky Hager's book Secret Power (1996). In fact, Nicky's account of the ECHELON system in New Zealand identifies ECHELON and its critical component, the DICTIONARY computers, as functioning only against machine readable signals - that is to say data, e-mail, (OCR'd) faxes, telex and the like. Indeed, he points out that New Zealand does not have the Sigint personnel to listen to phone calls.

In the passage quoted, Nicky referred to a different book, Spy World (1994) written by Mike Frost and Mike Grattan. This does refer to an NSA-designed suitcase called ORATORY, which was used for Sigint collection in hostile city environment. If it was true that a 1990 era suitcase could contain not just a microwave downconverter and full associated demux equipment, plus multichannel speech recognition equipment with word spotting built in, and all associated recorders and control computers, that would indeed be a remarkable black box.

Mike Frost was a former employee of the Canadian sigint agency CSE. Mike Grattan was a journalist. Mike Grattan wrote the book. My understanding is that Mike Frost has subsequently made it clear that ORATORY's capacities were mis-stated by Mike Grattan in the book, and that ORATORY functions only to recognise keywords in intercepted telex type traffic; ie machine readable signals.

The major original investigations of Sigint and Echelon in the last ten years have NOT suggested that word-spotting in voice channels is part of Echelon. The opposite is reported. These include my own reports, the recent Australian Channel 9 documentary, a British TV report which uncovered a DICTIONARY computer in London (targetted only on telex) and the landmark Baltimore Sun series in 1995 by Scott Shane and Tom Bowman (which quite specifically reports that NSA had not achieved this task). However, secondary reporting by others has commonly added such a claim.

The conclusion that a capability to word-spot does not exist is based on (a) detailed study of the literature, including the work done at NSA's behest in annual DARPA sponsored workshops; (b) sources with direct inside knowledge; and (c) reliable published journalistic sources.

The results in each arm are the same. No speaker-independent word recognition system can produce anything ressembling an acceptable error rate (false positivies and false negatives) for Sigint use as a message selection technique. Inside sources say quite specifically that much as they would like to have deployed such a system, it has been unachievable. Noting that ORATORY has been misdescribed, no reliable or first hand journalistic source say that the technique has been achieved. Several, including Shane and Bowman, indicate the opposite.

I cannot of course identify confidential sources, but there are two I can quote. One is Rear Admiral Bobby Inman, former NSA director, who told me in a 1993 interview that "I have wasted more US taxpayers dollars trying to do that (word spotting in speech) than anything else in my intelligence career."

The second is Professor Steve Young, a UK director of the cutting edge speech recognition firm Entropic (mentioned by John Young), who said last month "It is true that word spotting is not effective -- I don't know anybody these days still trying to do it."

Entropic Inc are among the the world leaders in speech recognition using Hidden Markov Models (HMMs). There is nothing surprising about the amount of literature around about HMMs; all today's speech recognition packages use this system. The only other speech recognition game in town, using neural networks, does not produce better results.

It is true that all the early work in speech recognition was inspired by the Sigint agencies. Thirty years later, like the Internet itself, the civilian applications have proved the more usable. Steve Young also comments : "The better approach, as you suggest, is to do a full transcription and then use the text for topic spotting and/or information retrieval."

In summary : Voice message (i.e., phone call) selection can be done on called or calling telecommunications address (including the so-called "wild card" criteria, selecting all message from a particular city suburb and/or at a particular time); or by individual speaker recognition. Word spotting is not available, but as computational power increases, topic spotting by running continuous speech recognition engines on a per-channel basis will become affordable, at first for high value targets

Duncan Campbell

Date: Tue, 27 Jul 1999 21:00:23 -0400
To: ukcrypto@maillist.ox.ac.uk
From: Paul Wolf <paulwolf@icdc.com>
Subject: Question for Duncan Campbell re: Word-Spotting Capabilities
Cc: duncan@cwcom.net

July 25th, 1999

Dear Mr. Campbell:

When Nicky Hager's book, Secret Power, was printed in 1996, a lot of people, including myself, became frightened by the idea that the NSA might have the capacity to monitor, in particular, all or most of the telephone traffic in the United States, and use whatever methods there are to analyze it. The legacy of COINTELPRO shows that putting this much power in the hands of intelligence agencies, especially when they are not subject to appropriate oversight, invites disaster.

In the context of other books written about the NSA in the eighties and early nineties, and considering advances in computer technology since then, Hager's description of the ECHELON system and his opinion that NSA could monitor a vast number of simultaneous phone calls, if not the majority of all telephone calls, did not seem all that far-fetched.

In your April 1999 report to the Eurpoean Parliament, Interception Capabilities 2000, you make a very bold statement, that the system Hager describes may be able to monitor emails and faxes, but the technology to convert speech to text on a scale massive enough to encompass a significant portion of telephone traffic does not yet exist. Specifically, you wrote:

"Contrary to reports in the press, effective "word spotting" search systems automatically to select telephone calls of intelligence interest are not yet available, despite 30 years of research."

When I read this, I wondered how you could categorically make such a statement. Since you did not back it up in any way in your report, it left me wondering. I have been interested in this subject since Hager's book was published, and maintain a webpage on it at

http://www.icdc.com/~paulwolf/echelon.htm

I was recently in touch with Guy Polis, who tells me that you base your assessment of NSA's capacity for speech to text translation on the state of the art in commercial software. To assume that the NSA does not possess anything more powerful than what is available commercially seems to me to be absurd. Can you please explain your basis for making the statement I've quoted above, in a way that I may quote you?

Please note that I've blind copied this message to a number of other people who regularly write about this subject, including reporters in the US and in Europe. As a rule I keep the names on that list confidential, so please consider this to be a question I'm asking you in public. I'd like to take your reply and forward it to the same list.

I'm sorry to put you on the spot. I would not do so if the topic were not so serious. Thanks in advance for your help.

Best regards,

Paul Wolf

919-469-6660

Date: Sat, 31 Jul 1999 08:19:48 -0400
To: ukcrypto@maillist.ox.ac.uk
From: John Young <jya@pipeline.com>
Subject: Re: Question for Duncan Campbell re: Word-Spotting Capabilities

Inspired by Duncan Campbell's exemplary research on voice recognition, we posed on our Web site a similar question to Paul Wolf's, on the state of the technology. There were some quite informative responses:

One person wrote that he knew of a US company that boasted it had a product that could do several dozen simultaneous translations on the fly, only available to the natsec realm.

While this was initially greeted with skepticism (Duncan for one), a bit of digging in the US Patent Office showed that there are well over 300 patents utilizing an algorithm applied by the Hidden Markov Model (HMM). It has great power to do pattern analysis and is much in use by the voice and text recognition industry (though not limited to that).

ATT has done a lot of research in voice recognition, primarily for public telephonic purposes, but the literature suggests that classified work was (is being) done for the mil/gov. A classic primer in HMM was co-written by an ATT scientist, Rabiner, in 1986:

[Rabiner & Juang, 1986]

    @article(RabinerandJuangASSP-86,
            author = {Rabiner, L. R. and Juang, B. H. },
            title = {An introduction to hidden {M}arkov models},
            journal = {IEEE ASSP Magazine},
            month = {January},
            pages = {4-15},
            year = 1986

Another person provided the name of a private researcher who did contract work for NSA on voice recognition. Sorry to say that the person has not responded to our inquiry about the topic.This person has impressive credentials in computational linguistics listed on the Web. For more on CL and links to even more see:

http://www.georgetown.edu/compling/home.html

Further, research on computational linguistics is booming, and a lot of it concerns voice recognition, translation and others applications of human interaction with machines. A Web search on the topic will turn up a wide range of endeavors. HMM is a favorite tool. A UK/US company sells a popular program for using it.

http://www.entropic.com/

As ever, commercial products lag those used for intelligence, so we are told. Whether that is true only the spooks know and never ever tell -- or do they? Some now run companies which capitalize on what were once state secrets. As Duncan and others have shown, one must read the Hidden Markov Modelling of their products, body language and gaps in speech -- pattern analysis -- to grasp what NDA and the OSA forbid openly.

Source: http://www.altavista.com/cgi-bin/query?pg=q&kl=XX&stype=stext&q=n-gram&search.x=16&search.y=4

1. N-Gram/Vector Model Basics

N-Gram/Vector Model Basics. For the purposes of this project, each word in a document or query (including a leading and trailing space) is divided into...
URL: www.cs.wisc.edu/~hasti/cs838-2/basics.html
Last modified 13-Dec-96 - page size 3K - in English [ Translate ]

2. N-Gram Prototype

N-Gram Prototype. An N-Gram is a simple mathematical manipulation applied to compare text documents with difference sizes. An N-Gram program counts the...
URL: csgrad.cs.vt.edu/~lzhang/cs6704/ngram.htm
Last modified 30-Apr-98 - page size 958 bytes - in English (Win-1252) [ Translate ]

3. IN - N-gram Matching

IN - N-gram Matching. This method is based on counting the number of matching n-grams between words, and can be enhanced through the use of clustering...
URL: ei.cs.vt.edu/~cs5604/f95/cs5604cnIN/IN-ngram.html
Last modified 20-Aug-95 - page size 722 bytes - in English [ Translate ]

4. N-gram Experiment

N-gram...
URL: www.cam.sri.com/tr/crc045/paper/node5.html
Last modified 27-Mar-97 - page size 7K - in English [ Translate ]

5. N-Gram

J a n ' s & J o h n ' s C D C o l l e c t i o n. N-Gram. Caroline Lavelle. Back to: Label. Index. Artist. Genre. Label. Rating. Classical by Composer....
URL: jan.tile.net/cd/ngram.html
Last modified 26-May-99 - page size 1K - in English [ Translate ]

6. Precise n-gram Probabilities from Stochastic Context-free Grammars

Precise n-gram Probabilities from Stochastic Context-free Grammars...
URL: www.icsi.berkeley.edu/~stolcke/papers/acl94/paper-html.html
Last modified 29-Jun-96 - page size 4K - in English [ Translate ]

7. N-Gram Prototype

N-Gram Prototype. An N-Gram is a simple mathematical manipulation applied to a text document. To create an N-Gram, count the number of occurences of...
URL: ad1440.net/~devnull/work/ngram/
Last modified 5-Jun-97 - page size 7K - in English [ Translate ]

8. Discovery/N-Gram Press Release

WARNER MUSIC GROUP'S DISCOVERY RECORDS TO DISTRIBUTE "AMBIENT" RECORDINGS WITH WILLIAM ØRBIT'S U.K.-BASED N-GRAM LABEL. Three Artists - Caroline Lavelle,..
URL: www.williamorbit.com/discovery/press.html
Last modified 30-Dec-98 - page size 5K - in English [ Translate ]

9. N-gram Models

N-gram Models. In n-gram language models, each word depends probabilistically on the n-1 preceding words: P(w1...wn) = P(wi | wi-n+1...wi-1) Taken as it...
URL: www.ling.gu.se/~nivre/kurser/wwwstat/langmodel/ngram.html
Last modified 11-Sep-98 - page size 1K - in English [ Translate ]

10. School of Computer Studies: Mr Eric S Atwell

Mr Eric S Atwell. email: eric@scs.leeds.ac.uk tel: +44 113 233 5761 fax: +44 113 233 5468 room: 9.1c personal home page:...
URL: agora.leeds.ac.uk/scs/public/staff/eric.html
Last modified 3-Nov-98 - page size 5K - in English [ Translate ]

Result Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [Next >>]