Natural Language Processing Fundamentals

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It bridges the gap between human communication and machine comprehension, combining insights from linguistics, computer science, and machine learning.1


Core Components of NLP

NLP can be divided into two complementary areas:

Natural Language Understanding (NLU)

The ability of machines to comprehend text or speech:

  • Extract meaning and intent
  • Identify entities and relationships
  • Understand context and nuance
  • Handle ambiguity and implicit meaning

Natural Language Generation (NLG)

The ability of machines to produce human-like text:

  • Generate coherent sentences and paragraphs
  • Maintain context across long outputs
  • Adapt style and tone appropriately
  • Create summaries, translations, and responses

Text Preprocessing Pipeline

Before analysis, raw text must be transformed into a structured format:

1. Text Normalization

StepDescriptionExample
LowercasingConvert to uniform case“Hello World” → “hello world”
Unicode normalizationStandardize character encoding“café” → “cafe”
Noise removalStrip irrelevant charactersRemove HTML tags, special symbols
Whitespace normalizationStandardize spacingMultiple spaces → single space

2. Tokenization

Breaking text into discrete units (tokens):

Input:  "Natural language processing is fascinating."
Output: ["Natural", "language", "processing", "is", "fascinating", "."]

Tokenization approaches:

  • Word tokenization: Split on whitespace and punctuation
  • Subword tokenization: Break words into meaningful subunits (BPE, WordPiece)
  • Character tokenization: Individual characters as tokens
  • Sentence tokenization: Split document into sentences

3. Stopword Removal

Filtering common words with low semantic value:

Before: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
After:  ["quick", "brown", "fox", "jumps", "lazy", "dog"]

Note: Modern deep learning approaches often retain stopwords as they provide contextual information.

4. Stemming and Lemmatization

Reducing words to their base forms:

MethodApproach“running” →“better” →
StemmingRule-based suffix removal“run”“better”
LemmatizationDictionary-based with POS“run”“good”

Stemming algorithms:

  • Porter Stemmer (most common)
  • Snowball Stemmer
  • Lancaster Stemmer (aggressive)

Linguistic Concepts

Part-of-Speech (POS) Tagging

Assigning grammatical categories to words:

The/DET quick/ADJ brown/ADJ fox/NOUN jumps/VERB over/ADP the/DET lazy/ADJ dog/NOUN

Common POS tags:

  • NOUN: Nouns
  • VERB: Verbs
  • ADJ: Adjectives
  • ADV: Adverbs
  • DET: Determiners
  • PREP/ADP: Prepositions

Named Entity Recognition (NER)

Identifying and classifying named entities:

[Apple]ORG announced that [Tim Cook]PERSON will visit [Tokyo]LOC on [Monday]DATE.

Entity types:

  • PERSON: People, characters
  • ORG: Organizations, companies
  • LOC: Locations, places
  • DATE/TIME: Temporal expressions
  • MONEY: Monetary values
  • PRODUCT: Products

Dependency Parsing

Analyzing grammatical structure and word relationships:

       ROOT
         |
       jumps
      /  |  \
   fox  over  .
   /     |
 The   dog
 |      |
brown  lazy

Coreference Resolution

Identifying when different expressions refer to the same entity:

"John went to the store. He bought milk."
John ← He (coreference)

Classical NLP Techniques

Bag of Words (BoW)

Represents text as word frequency vectors, ignoring order:

Document 1: "I love NLP"
Document 2: "I love machine learning"

Vocabulary: [I, love, NLP, machine, learning]

BoW(Doc 1): [1, 1, 1, 0, 0]
BoW(Doc 2): [1, 1, 0, 1, 1]

TF-IDF (Term Frequency-Inverse Document Frequency)

Weights words by importance within and across documents:

$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) $$

Where:

  • $\text{TF}(t, d)$ = frequency of term $t$ in document $d$
  • $N$ = total number of documents
  • $\text{DF}(t)$ = number of documents containing term $t$

N-grams

Contiguous sequences of n items:

Text: "natural language processing"

Unigrams (n=1): ["natural", "language", "processing"]
Bigrams (n=2):  ["natural language", "language processing"]
Trigrams (n=3): ["natural language processing"]

Core NLP Tasks

Text Classification

Assigning predefined categories to text:

  • Spam detection
  • Sentiment analysis
  • Topic categorization
  • Intent classification

Sentiment Analysis

Determining emotional tone:

AspectDescription
PolarityPositive, negative, neutral
SubjectivityOpinion vs. fact
EmotionJoy, anger, fear, etc.
Aspect-basedSentiment per feature

Machine Translation

Converting text between languages:

  • Statistical Machine Translation (SMT)
  • Neural Machine Translation (NMT)
  • Transformer-based models (current state-of-the-art)

Question Answering

Extracting answers from text:

Context: "The Eiffel Tower was built in 1889."
Question: "When was the Eiffel Tower built?"
Answer: "1889"

Types:

  • Extractive QA: Extract span from context
  • Generative QA: Generate answer from understanding
  • Open-domain QA: Search and answer from large corpora

Text Summarization

Condensing documents:

TypeMethod
ExtractiveSelect important sentences
AbstractiveGenerate new summary text

Modern Deep Learning Approaches

Evolution of NLP Models

EraApproachKey Models
Pre-2013Rule-based, statisticalN-grams, HMMs
2013-2017Word embeddingsWord2Vec, GloVe, FastText
2017-2018Contextualized embeddingsELMo, ULMFit
2018-presentTransformer-basedBERT, GPT, T5, LLaMA

Word Embeddings

Dense vector representations capturing semantic meaning:

  • Word2Vec - Prediction-based embeddings
  • GloVe - Count-based embeddings
  • FastText - Subword-aware embeddings

Transformer Architecture

The foundation of modern NLP:

  • Attention Mechanisms - Focus on relevant parts of input
  • Transformers - Self-attention based architecture
  • BERT - Bidirectional understanding
  • GPT - Autoregressive generation

NLP Pipeline Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Raw Text                              │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│              Text Preprocessing                              │
│   (Normalization → Tokenization → Stopwords → Stemming)     │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│              Feature Extraction                              │
│   (BoW, TF-IDF, Word Embeddings, Contextual Embeddings)     │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│                    Model / Task                              │
│   (Classification, NER, QA, Summarization, Generation)      │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│                    Output / Action                           │
└─────────────────────────────────────────────────────────────┘

Challenges in NLP

  1. Ambiguity: Words and sentences with multiple meanings
  2. Context dependence: Meaning varies with surrounding text
  3. Implicit meaning: Sarcasm, idioms, cultural references
  4. Low-resource languages: Limited training data
  5. Domain specificity: Technical jargon, specialized vocabulary
  6. Evolving language: New words, changing usage

  • Vector Embeddings - Foundation of modern NLP representations
  • Word2Vec - Pioneering word embedding technique
  • BERT - State-of-the-art language understanding
  • Transformers - Architecture behind modern NLP
  • Semantic Search - NLP application for information retrieval

References


  1. Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft). Stanford University. https://web.stanford.edu/~jurafsky/slp3/ ↩︎