• Home
  • Natural Language Processing, lecture notes

Natural Language Processing, lecture notes

Lecture notes for the course DIT410/TIN171 Artificial Intelligence

Peter Ljunglöf
16 March 2012

Links to NLP videos and online demos: http://www.clt.gu.se/wiki/nlp-resources

Natural language

A natural language is any language naturally used by humans. (http://en.wikipedia.org/wiki/Language)

It can be written-only:

or spoken-only:

  • most of the existing c:a 7000 languages in the world

or gestured:

or symbol-based:

it can even be constructed:

There is no good distinction between "language" and "dialect":

Natural language processing

NLP has several names:

  • natural language processing (NLP)
  • natural language engineering
  • human language technologies
  • language technology
  • computational linguistics
  • etcetera…

Natural language processing…

Main NLP applications

Automatic translation

Human-computer dialogue

Question answering

Text mining


  • visually impaired:
    • speech synthesis: screen readers, VoiceXML
    • speech recognition: dictation, dialogue systems
    • automatic Braille terminals
  • hearing impaired:
    • speech recognition and synthesis 
    • sign language recognition and synthesis
    • real-time sign language translation of TV programs
  • elderly:
    • can have problems with seeing, hearing, short-term memory, fine motor skills, loneliness
    • possible NLP technologies: speech recognition and synthesis, automatic summarisation, dialogue systems, chatbots
  • communicative disorders:
    • alternative and augmentative communication (AAC)
    • speech and dialogue technologies can help communicating with the society
  • reading:

Difficult problems in NLP: Ambiguity

Ambiguity is one of the most difficult NLP problems. And it is everywhere!

Newspaper headlines

Newspaper headlines are extra prone to ambiguities, since they often lack function words.

  • Infant abducted from hospital safe --- lexical ambiguity (safe)
  • British left waffles on Falklands --- lexical amb. (left, waffles)
  • Jails for women in need of a facelift --- structural amb. (in need)
  • Enraged cow injures farmer with axe --- structural (with axe)
  • Stolen painting found by tree --- word sense (by)
  • Miners refuse to work after death --- reference (after death)
  • Jail releases upset judges --- lexical (releases, upset)
  • Drunk gets nine months in violin case --- word sense (case)
  • Teacher strikes idle kids --- lexical (strikes)
  • Squad helps dog bite victim --- lexical (bite)
  • Prostate cancer more common in men --- reference (more common)
  • Smithsonian may cancel bombing of Japan exhibits --- structural (exhibits)
  • Juvenile court to try shooting defendant --- lexical (try)
  • Two sisters reunited after 18 years in checkout counter --- structural (in counter)
  • Two Soviet ships collide, one dies --- reference (one)
  • Taxiförare dödade man med bil --- structural (med bil)
  • Förbud mot droger utan verkan --- structural (utan verkan)

Phonological ambiguity

  • "Eye halve a spelling checker
    It came with my pea sea
    It plainly marks four my revue
    Miss steaks eye kin knot sea."

Lexical ambiguity

  • one word -- several meanings = word senses
    • "by" is a preposition with 8 senses (New Oxford American Dictionary)
    • "case" is a noun with 4 senses
  • different words -- same spelling (or pronunciation)
    • "safe" is a noun and an adjective
    • "left" is a noun, an adjective and past tense of the verb "leave"
  • there is no general consensus of when we have one word with several senses, or different words
  • most lexical ambiguities automatically lead to structural differences:
    • ((jail) releases (upset judges)) vs. ((jail releases) upset (judges))
    • ((time) flies (like an arrow)) vs. ((fruit flies) like (a banana))

Structural ambiguity

  • Attachment ambiguity
    • adjectives: "Tibetan history teacher"; "old men and women"
    • prepositions: "I once shot an elephant in my pajamas. How he got into my pajamas, I'll never know." (Groucho Marx)
    • "I saw the man with the telescope" / "I saw the man with the dog"
  • Garden path sentences
    • "the horse raced past the barn fell"
    • "the old man the boat"
    • "the complex houses married and single soldiers and their families"

Semantic ambiguity

  • Quantifier scope:
    • "every man loves a woman" / "some woman admires every man"
    • "no news is good news" / "no war is a good war"
    • "too many cooks spoil the soup" / "too many parents spoil their children"
    • "in New York City, a pedestrian is hit by a car every ten minutes."
  • Pronoun scope:
    • "Mary told her mother that she was pregnant."
  • Ellipsis:
    • "Kim noticed two typos before Lee did." --- did Lee notice the same typos?
    • "Eva worked hard and passed the exam. Adam too." --- what did Adam do?

Pragmatic ambiguity

  • Speech-act ambiguity:
    • "Do you know the time?" --- "yes"
    • "Can you close the window?" --- "sure I can, I'm already five years old"
  • Contextual ambiguity:
    • "you have a green light"
    • if you are in a car, then perhaps the traffic light has changed
    • if you are talking to you boss at work, then perhaps you can go ahead with your project
    • or, there could be a green lamp somewhere in you room

Difficult problems in NLP: Sparse data

The second very big problem is lack of data.

Hapax legomena

Hapaxes are words that occur only once within a corpus.

  • about 44% of the words (types, not tokens) in the novel Moby Dick are hapaxes
  • about 55% of the word types in the Swedish Parole corpus (28 Mwords)

Reading: http://en.wikipedia.org/wiki/Hapax_legomenon

Sparse data and ambiguity

Hapaxes are just one aspect of the general problem of sparse data:

  • "bank" has 5 noun meanings and 4 verb meanings, according to New Oxford Dictionary
  • how many occurrences of each sense do we need to get reliable statistics?

Furthermore, the senses are very context-dependent:

  • e.g., in newspaper text, river banks are much less common than financial banks

Hapax n-grams

  • about 75% of the bigrams in Swedish Parole occur only once

Hapax phrase structures

  • about 50% of the syntactical constructions in the Penn Treebank occur only once


To solve the sparse data problem we need statistical smoothing techniques:

  • Laplace smooting, Witten-Bell, Good-Turing, Kneser-Ney, etc.
  • but this is not enough – we also need more data

Main levels of abstraction

Roughly the NLP tasks can be categorised into the following abstraction levels:

  • Phonetics/phonology: Speech sounds, acoustics
  • Morphology: Parts of words; suffix, infix, affix
  • Lexical: Words, parts-of-speech, inflection
  • Syntax: Grammatical structure, parsing
  • Semantics: In-sentence meaning, 1st order logic
  • Discourse: Anaphora resolution, text structure
  • Pragmatics: Context-dependence, presupposition

The lower in the list, the more "AI"-like the problems are. In general.

Main NLP approaches

Symbolic / rule-based approaches

  • uses hand-crafted linguistic knowledge:
    • formal linguistics, logics, formal systems
    • grammars, rules, theorem provers
  • often bad coverage
    • on the other hand, deep analysis
  • works well for limited domains
    • time-table information, weather forecasts, MP3 player, etc.

Data-driven / statistical approaches

  • uses lots of linguistic data (lexica, corpora):
    • statistics, databases, evaluation metrics
    • statistical models, machine learning
  • better coverage
    • on the other hand, shallow analysis
  • better suited for unrestricted domains
    • information retrieval, web searches, chatbots, etc.

Hybrid approaches

  • combining the best of both worlds, or at least trying
  • this is a very hot current research trend

Main NLP tasks: Audiovisual

Speech synthesis / Text-to-speech (TTS)

There are two main techniques; formant synthesis:

  • based on mathematical models
  • often sounds "artificial"
  • easy to modify, e.g., to make it sound like a female/male/child
  • cheap: needs no recorded data

and concatenative synthesis: 

  • concatenation of segments of recorded speech
  • sounds much more "natural" than formant TTS
  • difficult to modify, since it's based on a real human
  • expensive: requires lots of manually annotated recordings
  • variants: diphone (smaller) and unit-selection (better)

There are still lots of interesting problems: 

  • multilingual utterances:
    • "multilingual is called flerspråklig in Swedish"
    • "the president of Georgia is Mikheil Saakashvili"
  • ambiguity (homographs):
    • record: /'rekərd/ or /ri'kôrd/
    • entrance: /'entrəns/ or /en'trans/
    • learned: /'lərnid/ or /lərnd/
    • produce: /prə'd(y)o͞os/ or /'präd(y)o͞os/
  • prosody:
    • intonation, emotion, dialect, gender, age


Automatic speech recognition (ASR)

All successful ASR systems are statistical, based on Hidden Markov Models (HMM):

  • uses an acoustic model and a language model
  • requires huge amounts of annotated recordings

ASR is a very difficult problem:

  • coarticulation: the sounds representing successive letters blend into each other
  • there are often no pauses between successive words
  • word error rates on some tasks are less than 1%; on others they can be as high as 50%

interesting research problems: 

  • speaker adaptation
  • multilingual utterances
  • dialects
  • recognising prosodic information:
    • intonation, emotion, dialect, gender, age


Other related areas

Recognition/generation of gestures/facial expressions:

  • useful for sign languages, or augmenting speech synthesis by animated mouth, facial emotions or body posture

Optical character recognition (OCR):

  • character error rate ca 1% for type-written text in Latin script
  • much worse for hand-written texts, non-Latin scripts, or historical texts

Main NLP tasks: Segmentation

Token segmentation / Word tokenisation

For English and Swedish, words are often separated by whitespace; but there are still problems:

  • multi-word units:
    • "inter alia", "Kalle Anka"
  • abbreviations:
    • "e. g." (English), "t. ex." or "t ex" (Swedish)
  • compounds:
    • "flaggstångsknoppspolerare", "Kalle Ankatidning"
  • clitics:
    • "doesn't", "you're"
  • split words by line-breaking:
    • "news- / paper" (dash should be removed)
    • "co- / education" (dash should be kept)
  • split compounds:
    • "bil- och båtägare" (Swedish)
  • special tokens:
    • currency, phone numbers, URLs, email addresses, etc.

Some languages (e.g., Chinese, Japanese, Thai) do not mark word boundaries in text, which makes the tokenisation problem much harder.


Morphological segmentation

Separate words into individual morphemes and identify the class of the morphemes

  • this is not a serious problem in English
  • slightly more problematic in Swedish: especially compound words are difficult
  • very problematic in agglutinative languages, such as Turkish:
    • uygar-laş-tır-ama-dık-lar-ımız-dan-mış-sınız-casına
    • "as if you are among those whom we were not able to civilize"
  • …or Finnish:
    • lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas
    • "technical warrant officer trainee specialized in aircraft jet engines"


Sentence breaking / splitting

This is not always trivial, especially not in unrestricted text

  • sentences can have recursive structure: "I say 'Hi there!' to her."
  • abbreviations (mr., mrs., e.g.) can interfer, and they can share punctuation with the end-of-sentence
  • sentences can continue between line breaks, e.g., in bullet lists

Main NLP tasks: Tagging and chunking

Part-of-speech tagging

Assigning one (or more) part-of-speech tags to each word in a text.

  • many words are ambiguous:
    • "book" can be a noun or verb
    • "set" can be a noun, verb or adjective
    • "out" can be at least five different parts of speech
  • unknown words:
    • we need heuristics to decide their part-of-speech 


  • rule-based tagging:
    • hand-crafted rewrite rules
  • transformation-based tagging:
    • automatically learned rewrite rules
  • statistical tagging:
    • HMM-based n-gram tagging
  • other machine learning approaches:
    • memory-based learning, descision trees, support vector machines (SVM), maximum entropy markov models (MEMM), conditional random fields (CRF), perceptron, etc.

The part-of-speech tagset: 

  • depends on the language, and your theory of grammar
  • English: between 20 and 500 different tags
  • other languages can have many many more POS tags

Current state-of-the-art for English:

  • baseline: ~90% (if we always assign each word its most probable tag)
  • best: ~97% (with a tagset of ~40 tags)


Text chunking

Dividing a text in syntactically correlated parts of words

  • most common is noun phrase (NP) chunking
  • the chunks should be non-recursive; i.e., not full NP's, but instead NP's without prepositional phrases


  • machine learning methods from a manually annotated corpus
  • hand-crafted grammars or rewrite systems 


Named entity recognition (NER)

Determine which items in a text map to proper names (e.g., people or places), 

  • and what the type of each such name is (e.g., person, location, organization)

Names are often capitalised in English (but not always), and most capitalised words are names (but not always)

  • in German, all nouns are capitalised
  • in French and Spanish, names serving as adjectives are not capitalised
  • many non-Latin scripts don't have capitalisation at all


Main NLP tasks: Syntax


Specify the grammatical structure of the sentences in a language.

Sometimes we use context-free grammars (also known as BNF grammars), but more often higher-order grammar formalisms such as:

  • GF: Grammatical Framework
  • HPSG: Head-Driven Phrase-Structure Grammar
  • TAG: Tree-Adjoining Grammar
  • LFG: Lexical-Functional Grammar
  • DG: Dependency Grammar



Parsing = determining the parse tree of a given sentence. NLP parsing is very different from parsing of programming languages: 

  • most sentences are ambiguous, even massively ambiguous
    • for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human)
  • programming languages, on the other hand, are never ambiguous
    • totally different types of parsing algorithms
  • we need an algorithm that returns the most probable tree


  • hand-written grammars:
    • sometimes context-free, but more often higher-order grammar formalisms like the ones above
  • grammars trained from annotated corpora (treebanks):
    • automatically or semi-automatically
    • sometimes the grammar is skipped – the parser is trained directly from the corpus
  • hybrid methods:
    • hand-written backbone in some grammar formalism
    • probabilities and parsing heuristics trained from a corpus


Text generation

The opposite of parsing; generating utterances from syntactic (and semantic) structure.

  • sometimes we use higher-order grammar formalisms, but there are lots of other approaches
  • we need heuristics for deciding which surface structure is the "best"
    • classical AI techniques can be used, such as planning
  • the "best" generated sentence depends on the application, the audience and the context


Main NLP tasks: Meaning: semantics, pragmatics

Computational semantics

Specify the formal meaning of sentences/utterances, and of longer texts

  • the classic example (by Richard Montague, 1970) is first-order logic with lambda calculus
    • type theory can be used as an alternative
  • discourse representation theory (DRT) can handle meaning across sentence boundaries
    • type theory with records is another alternative
  • latest trend is minimal recursion semantics (MRS)
    • MRS uses underspecification to reduce the number of ambiguities


Word sense disambiguation

To select the word meaning which makes the most sense in context

  • typically given a list of words and associated word senses:
    • e.g. from a dictionary of from an online resource such as WordNet.


Coreference resolution

To determine which words refer to the same objects in a text

  • anaphora resolution is the most common example; matching pronouns with the nouns or names that they refer to 


Relationship extraction

To identify the relationships among named entities in a text

  • example: "Albert's niece Ann got engaged to John."
  • inferred relations: DaughterOfSibling(Albert,Ann); Engaged(Ann, John)


Speech act classification

To classify the speech act of utterances in a discourse 

  • possible speech acts: yes-no question, content question, statement, assertion, etc.


Main NLP tasks: Text-level analysis

Automatic summarisation

Produce a readable summary of a text, or of several texts; such as newspaper articles or patient journals


Document classification / text categorisation

To assign documents to one or more categories, based on their content


Information retrieval (IR)

Storing, searching and retrieving information from texts or databases


Information extraction (IE)

Extracting semantic information from text; this covers tasks such as named entity recognition, coreference resolution, relationship extraction, etc.



To the top

Page updated: 2012-03-13 11:35

Send as email
Print page
Show as pdf