Natural Language Processing Fundamentals

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It bridges the gap between human communication and machine comprehension, combining insights from linguistics, computer science, and machine learning.¹

Core Components of NLP

NLP can be divided into two complementary areas:

Natural Language Understanding (NLU)

The ability of machines to comprehend text or speech:

Extract meaning and intent
Identify entities and relationships
Understand context and nuance
Handle ambiguity and implicit meaning

Natural Language Generation (NLG)

The ability of machines to produce human-like text:

Generate coherent sentences and paragraphs
Maintain context across long outputs
Adapt style and tone appropriately
Create summaries, translations, and responses

Text Preprocessing Pipeline

Before analysis, raw text must be transformed into a structured format:

1. Text Normalization

Step	Description	Example
Lowercasing	Convert to uniform case	“Hello World” → “hello world”
Unicode normalization	Standardize character encoding	“café” → “cafe”
Noise removal	Strip irrelevant characters	Remove HTML tags, special symbols
Whitespace normalization	Standardize spacing	Multiple spaces → single space

2. Tokenization

Breaking text into discrete units (tokens):

Input:  "Natural language processing is fascinating."
Output: ["Natural", "language", "processing", "is", "fascinating", "."]

Tokenization approaches:

Word tokenization: Split on whitespace and punctuation
Subword tokenization: Break words into meaningful subunits (BPE, WordPiece)
Character tokenization: Individual characters as tokens
Sentence tokenization: Split document into sentences

3. Stopword Removal

Filtering common words with low semantic value:

Before: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
After:  ["quick", "brown", "fox", "jumps", "lazy", "dog"]

Note: Modern deep learning approaches often retain stopwords as they provide contextual information.

4. Stemming and Lemmatization

Reducing words to their base forms:

Method	Approach	“running” →	“better” →
Stemming	Rule-based suffix removal	“run”	“better”
Lemmatization	Dictionary-based with POS	“run”	“good”

Stemming algorithms:

Porter Stemmer (most common)
Snowball Stemmer
Lancaster Stemmer (aggressive)

Linguistic Concepts

Part-of-Speech (POS) Tagging

Assigning grammatical categories to words:

The/DET quick/ADJ brown/ADJ fox/NOUN jumps/VERB over/ADP the/DET lazy/ADJ dog/NOUN

Common POS tags:

NOUN: Nouns
VERB: Verbs
ADJ: Adjectives
ADV: Adverbs
DET: Determiners
PREP/ADP: Prepositions

Named Entity Recognition (NER)

Identifying and classifying named entities:

[Apple]ORG announced that [Tim Cook]PERSON will visit [Tokyo]LOC on [Monday]DATE.

Entity types:

PERSON: People, characters
ORG: Organizations, companies
LOC: Locations, places
DATE/TIME: Temporal expressions
MONEY: Monetary values
PRODUCT: Products

Dependency Parsing

Analyzing grammatical structure and word relationships:

       ROOT
         |
       jumps
      /  |  \
   fox  over  .
   /     |
 The   dog
 |      |
brown  lazy

Coreference Resolution

Identifying when different expressions refer to the same entity:

"John went to the store. He bought milk."
John ← He (coreference)

Classical NLP Techniques

Bag of Words (BoW)

Represents text as word frequency vectors, ignoring order:

Document 1: "I love NLP"
Document 2: "I love machine learning"

Vocabulary: [I, love, NLP, machine, learning]

BoW(Doc 1): [1, 1, 1, 0, 0]
BoW(Doc 2): [1, 1, 0, 1, 1]

TF-IDF (Term Frequency-Inverse Document Frequency)

Weights words by importance within and across documents:

$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) $$

Where:

$\text{TF}(t, d)$ = frequency of term $t$ in document $d$
$N$ = total number of documents
$\text{DF}(t)$ = number of documents containing term $t$

N-grams

Contiguous sequences of n items:

Text: "natural language processing"

Unigrams (n=1): ["natural", "language", "processing"]
Bigrams (n=2):  ["natural language", "language processing"]
Trigrams (n=3): ["natural language processing"]

Core NLP Tasks

Text Classification

Assigning predefined categories to text:

Spam detection
Sentiment analysis
Topic categorization
Intent classification

Sentiment Analysis

Determining emotional tone:

Aspect	Description
Polarity	Positive, negative, neutral
Subjectivity	Opinion vs. fact
Emotion	Joy, anger, fear, etc.
Aspect-based	Sentiment per feature

Machine Translation

Converting text between languages:

Statistical Machine Translation (SMT)
Neural Machine Translation (NMT)
Transformer-based models (current state-of-the-art)

Question Answering

Extracting answers from text:

Context: "The Eiffel Tower was built in 1889."
Question: "When was the Eiffel Tower built?"
Answer: "1889"

Types:

Extractive QA: Extract span from context
Generative QA: Generate answer from understanding
Open-domain QA: Search and answer from large corpora

Text Summarization

Condensing documents:

Type	Method
Extractive	Select important sentences
Abstractive	Generate new summary text

Modern Deep Learning Approaches

Evolution of NLP Models

Era	Approach	Key Models
Pre-2013	Rule-based, statistical	N-grams, HMMs
2013-2017	Word embeddings	Word2Vec, GloVe, FastText
2017-2018	Contextualized embeddings	ELMo, ULMFit
2018-present	Transformer-based	BERT, GPT, T5, LLaMA

Word Embeddings

Dense vector representations capturing semantic meaning:

Word2Vec - Prediction-based embeddings
GloVe - Count-based embeddings
FastText - Subword-aware embeddings

Transformer Architecture

The foundation of modern NLP:

Attention Mechanisms - Focus on relevant parts of input
Transformers - Self-attention based architecture
BERT - Bidirectional understanding
GPT - Autoregressive generation

NLP Pipeline Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Raw Text                              │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Text Preprocessing                              │
│   (Normalization → Tokenization → Stopwords → Stemming)     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              Feature Extraction                              │
│   (BoW, TF-IDF, Word Embeddings, Contextual Embeddings)     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Model / Task                              │
│   (Classification, NER, QA, Summarization, Generation)      │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Output / Action                           │
└─────────────────────────────────────────────────────────────┘

Challenges in NLP

Ambiguity: Words and sentences with multiple meanings
Context dependence: Meaning varies with surrounding text
Implicit meaning: Sarcasm, idioms, cultural references
Low-resource languages: Limited training data
Domain specificity: Technical jargon, specialized vocabulary
Evolving language: New words, changing usage

Vector Embeddings - Foundation of modern NLP representations
Word2Vec - Pioneering word embedding technique
BERT - State-of-the-art language understanding
Transformers - Architecture behind modern NLP
Semantic Search - NLP application for information retrieval

References

Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft). Stanford University. https://web.stanford.edu/~jurafsky/slp3/ ↩︎