Natural Language Processing Fundamentals
Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It bridges the gap between human communication and machine comprehension, combining insights from linguistics, computer science, and machine learning.1
Core Components of NLP
NLP can be divided into two complementary areas:
Natural Language Understanding (NLU)
The ability of machines to comprehend text or speech:
- Extract meaning and intent
- Identify entities and relationships
- Understand context and nuance
- Handle ambiguity and implicit meaning
Natural Language Generation (NLG)
The ability of machines to produce human-like text:
- Generate coherent sentences and paragraphs
- Maintain context across long outputs
- Adapt style and tone appropriately
- Create summaries, translations, and responses
Text Preprocessing Pipeline
Before analysis, raw text must be transformed into a structured format:
1. Text Normalization
| Step | Description | Example |
|---|---|---|
| Lowercasing | Convert to uniform case | “Hello World” → “hello world” |
| Unicode normalization | Standardize character encoding | “café” → “cafe” |
| Noise removal | Strip irrelevant characters | Remove HTML tags, special symbols |
| Whitespace normalization | Standardize spacing | Multiple spaces → single space |
2. Tokenization
Breaking text into discrete units (tokens):
Input: "Natural language processing is fascinating."
Output: ["Natural", "language", "processing", "is", "fascinating", "."]
Tokenization approaches:
- Word tokenization: Split on whitespace and punctuation
- Subword tokenization: Break words into meaningful subunits (BPE, WordPiece)
- Character tokenization: Individual characters as tokens
- Sentence tokenization: Split document into sentences
3. Stopword Removal
Filtering common words with low semantic value:
Before: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
After: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
Note: Modern deep learning approaches often retain stopwords as they provide contextual information.
4. Stemming and Lemmatization
Reducing words to their base forms:
| Method | Approach | “running” → | “better” → |
|---|---|---|---|
| Stemming | Rule-based suffix removal | “run” | “better” |
| Lemmatization | Dictionary-based with POS | “run” | “good” |
Stemming algorithms:
- Porter Stemmer (most common)
- Snowball Stemmer
- Lancaster Stemmer (aggressive)
Linguistic Concepts
Part-of-Speech (POS) Tagging
Assigning grammatical categories to words:
The/DET quick/ADJ brown/ADJ fox/NOUN jumps/VERB over/ADP the/DET lazy/ADJ dog/NOUN
Common POS tags:
- NOUN: Nouns
- VERB: Verbs
- ADJ: Adjectives
- ADV: Adverbs
- DET: Determiners
- PREP/ADP: Prepositions
Named Entity Recognition (NER)
Identifying and classifying named entities:
[Apple]ORG announced that [Tim Cook]PERSON will visit [Tokyo]LOC on [Monday]DATE.
Entity types:
- PERSON: People, characters
- ORG: Organizations, companies
- LOC: Locations, places
- DATE/TIME: Temporal expressions
- MONEY: Monetary values
- PRODUCT: Products
Dependency Parsing
Analyzing grammatical structure and word relationships:
ROOT
|
jumps
/ | \
fox over .
/ |
The dog
| |
brown lazy
Coreference Resolution
Identifying when different expressions refer to the same entity:
"John went to the store. He bought milk."
John ← He (coreference)
Classical NLP Techniques
Bag of Words (BoW)
Represents text as word frequency vectors, ignoring order:
Document 1: "I love NLP"
Document 2: "I love machine learning"
Vocabulary: [I, love, NLP, machine, learning]
BoW(Doc 1): [1, 1, 1, 0, 0]
BoW(Doc 2): [1, 1, 0, 1, 1]
TF-IDF (Term Frequency-Inverse Document Frequency)
Weights words by importance within and across documents:
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) $$
Where:
- $\text{TF}(t, d)$ = frequency of term $t$ in document $d$
- $N$ = total number of documents
- $\text{DF}(t)$ = number of documents containing term $t$
N-grams
Contiguous sequences of n items:
Text: "natural language processing"
Unigrams (n=1): ["natural", "language", "processing"]
Bigrams (n=2): ["natural language", "language processing"]
Trigrams (n=3): ["natural language processing"]
Core NLP Tasks
Text Classification
Assigning predefined categories to text:
- Spam detection
- Sentiment analysis
- Topic categorization
- Intent classification
Sentiment Analysis
Determining emotional tone:
| Aspect | Description |
|---|---|
| Polarity | Positive, negative, neutral |
| Subjectivity | Opinion vs. fact |
| Emotion | Joy, anger, fear, etc. |
| Aspect-based | Sentiment per feature |
Machine Translation
Converting text between languages:
- Statistical Machine Translation (SMT)
- Neural Machine Translation (NMT)
- Transformer-based models (current state-of-the-art)
Question Answering
Extracting answers from text:
Context: "The Eiffel Tower was built in 1889."
Question: "When was the Eiffel Tower built?"
Answer: "1889"
Types:
- Extractive QA: Extract span from context
- Generative QA: Generate answer from understanding
- Open-domain QA: Search and answer from large corpora
Text Summarization
Condensing documents:
| Type | Method |
|---|---|
| Extractive | Select important sentences |
| Abstractive | Generate new summary text |
Modern Deep Learning Approaches
Evolution of NLP Models
| Era | Approach | Key Models |
|---|---|---|
| Pre-2013 | Rule-based, statistical | N-grams, HMMs |
| 2013-2017 | Word embeddings | Word2Vec, GloVe, FastText |
| 2017-2018 | Contextualized embeddings | ELMo, ULMFit |
| 2018-present | Transformer-based | BERT, GPT, T5, LLaMA |
Word Embeddings
Dense vector representations capturing semantic meaning:
- Word2Vec - Prediction-based embeddings
- GloVe - Count-based embeddings
- FastText - Subword-aware embeddings
Transformer Architecture
The foundation of modern NLP:
- Attention Mechanisms - Focus on relevant parts of input
- Transformers - Self-attention based architecture
- BERT - Bidirectional understanding
- GPT - Autoregressive generation
NLP Pipeline Architecture
┌─────────────────────────────────────────────────────────────┐
│ Raw Text │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Text Preprocessing │
│ (Normalization → Tokenization → Stopwords → Stemming) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Feature Extraction │
│ (BoW, TF-IDF, Word Embeddings, Contextual Embeddings) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Model / Task │
│ (Classification, NER, QA, Summarization, Generation) │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Output / Action │
└─────────────────────────────────────────────────────────────┘
Challenges in NLP
- Ambiguity: Words and sentences with multiple meanings
- Context dependence: Meaning varies with surrounding text
- Implicit meaning: Sarcasm, idioms, cultural references
- Low-resource languages: Limited training data
- Domain specificity: Technical jargon, specialized vocabulary
- Evolving language: New words, changing usage
Related Topics
- Vector Embeddings - Foundation of modern NLP representations
- Word2Vec - Pioneering word embedding technique
- BERT - State-of-the-art language understanding
- Transformers - Architecture behind modern NLP
- Semantic Search - NLP application for information retrieval
References
Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft). Stanford University. https://web.stanford.edu/~jurafsky/slp3/ ↩︎