Community Detection

Community detection is a fundamental problem in network analysis that aims to identify groups of nodes (communities) that are more densely connected internally than with the rest of the network. These methods reveal the hidden structure within complex systems, from social networks to biological pathways.¹

What is a Community?

A community (also called cluster, module, or group) is a subset of nodes in a network that share more connections among themselves than with nodes outside the subset.

Formal definition:

For a community $C$ in graph $G = (V, E)$:

Internal degree $k_i^{in}$: edges from node $i$ to other nodes in $C$
External degree $k_i^{out}$: edges from node $i$ to nodes outside $C$

A node $i$ belongs to community $C$ if:

$$ k_i^{in} > k_i^{out} $$

Visual representation:

    ┌─────────────────┐         ┌─────────────────┐
    │   Community A   │         │   Community B   │
    │                 │         │                 │
    │   ○───○───○     │         │     ○───○       │
    │    \ / \ /      │───weak──│    / \ / \      │
    │     ○───○       │  links  │   ○───○───○     │
    │                 │         │                 │
    └─────────────────┘         └─────────────────┘
         Dense                       Dense
      connections                 connections

Community Detection vs. Clustering

While related, these approaches differ fundamentally:

Aspect	Clustering	Community Detection
Input	Feature vectors (data points)	Graph structure (nodes + edges)
Similarity	Distance metrics (Euclidean, cosine)	Structural connectivity patterns
Goal	Group similar objects	Find densely connected subgraphs
Context	General ML, data analysis	Network analysis, graph theory
Algorithms	K-means, DBSCAN, hierarchical	Louvain, Leiden, Label Propagation

Key distinction: Clustering operates on attribute similarity, while community detection operates on topological structure.

Modularity: The Optimization Target

Modularity ($Q$) is the most widely used quality metric for community structure, measuring how well a network is partitioned.²

Mathematical Definition

$$ Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(c_i, c_j) $$

Where:

$A_{ij}$ = adjacency matrix element (1 if edge exists, 0 otherwise)
$k_i, k_j$ = degrees of nodes $i$ and $j$
$m$ = total number of edges
$c_i, c_j$ = communities of nodes $i$ and $j$
$\delta(c_i, c_j)$ = 1 if $c_i = c_j$, 0 otherwise

Interpretation

Q Value	Interpretation
Q ≈ 0	Random partitioning, no community structure
Q > 0.3	Significant community structure
Q > 0.7	Strong community structure
Q = 1	Theoretical maximum (rarely achieved)

Modularity Resolution Limit

A known limitation: modularity optimization may fail to detect communities smaller than:

$$ \sqrt{2m} $$

This is called the resolution limit problem, where small communities get merged in large networks.

Community Detection Algorithms

Optimization-Based Methods

Louvain Algorithm

A greedy modularity optimization approach with two phases:³

Local optimization: Each node moves to the community that maximizes modularity gain
Aggregation: Communities become super-nodes, repeat until no improvement

Complexity: $O(n \log n)$ for sparse networks

See: Louvain Algorithm

Leiden Algorithm

Improvement over Louvain addressing the disconnected communities problem:

Local moving: Similar to Louvain
Refinement: Merge nodes within communities for better quality
Aggregation: Create network of communities

Key advantage: Guarantees well-connected communities

See: Leiden Algorithm

Propagation-Based Methods

Label Propagation Algorithm (LPA)

Simple, fast algorithm based on label spreading:

Initialize each node with a unique label
Iteratively update each node’s label to the most frequent among neighbors
Stop when labels stabilize

Complexity: $O(m)$ - near-linear time Limitation: Non-deterministic results

Semi-Supervised Label Propagation

Extension that incorporates known community memberships as constraints.

Spectral Methods

Spectral Clustering

Uses eigenvectors of the graph Laplacian:

Compute graph Laplacian: $L = D - A$
Find first $k$ eigenvectors of $L$
Cluster nodes using K-means on eigenvector coordinates

Laplacian matrices:

Unnormalized: $L = D - A$
Symmetric normalized: $L_{sym} = I - D^{-1/2}AD^{-1/2}$
Random walk: $L_{rw} = I - D^{-1}A$

Hierarchical Methods

Girvan-Newman Algorithm

Divisive hierarchical approach based on edge betweenness:

Calculate betweenness centrality for all edges
Remove edge with highest betweenness
Recalculate betweenness and repeat
Build dendrogram of communities

Complexity: $O(m^2 n)$ - expensive for large networks

Overlapping Community Detection

Real networks often have overlapping communities (a person belongs to multiple social groups).

Methods for Overlapping Communities

Algorithm	Approach
COPRA	Label propagation with multiple labels
BigCLAM	Matrix factorization
DEMON	Local community expansion
SLPA	Speaker-listener label propagation

Evaluation Metrics

Internal Metrics (No Ground Truth)

Metric	Description
Modularity	Fraction of edges within communities vs. expected
Conductance	Ratio of external to internal edges
Coverage	Fraction of intra-community edges
Performance	Fraction of correctly classified node pairs

External Metrics (With Ground Truth)

Metric	Description
NMI (Normalized Mutual Information)	Information-theoretic similarity
ARI (Adjusted Rand Index)	Pair-counting measure adjusted for chance
Purity	Fraction of nodes in majority class per cluster
F1 Score	Harmonic mean of precision and recall

Applications

Domain	Application
Social Networks	Friend group detection, influence analysis
Biology	Protein interaction modules, gene regulatory networks
Citation Networks	Research topic clustering, collaboration analysis
E-commerce	Customer segmentation, product recommendation
Cybersecurity	Fraud ring detection, botnet identification
Knowledge Graphs	Entity grouping, ontology discovery

Implementation with NetworkX

import networkx as nx
from networkx.algorithms import community

# Create a graph
G = nx.karate_club_graph()

# Louvain community detection
communities = community.louvain_communities(G, seed=42)

# Girvan-Newman (hierarchical)
comp = community.girvan_newman(G)
first_level = next(comp)

# Label Propagation
lpa_communities = community.label_propagation_communities(G)

# Calculate modularity
modularity = community.modularity(G, communities)
print(f"Modularity: {modularity:.4f}")

Algorithm Comparison

Algorithm	Time Complexity	Memory	Deterministic	Overlapping
Louvain	O(n log n)	Low	No	No
Leiden	O(n log n)	Low	No	No
Label Propagation	O(m)	Very Low	No	No
Spectral	O(n³)	High	Yes	No
Girvan-Newman	O(m²n)	Medium	Yes	No
COPRA	O(km)	Low	No	Yes

Best Practices

Start with Louvain/Leiden: Fast, effective baseline for most networks
Validate with multiple metrics: Don’t rely on modularity alone
Run algorithms multiple times: Non-deterministic methods need averaging
Consider resolution: Use multi-resolution analysis for hierarchical structure
Check connectivity: Ensure detected communities are connected
Visualize results: Inspect communities for sanity check

Louvain Algorithm - Fast modularity optimization
Leiden Algorithm - Improved community detection
Graph Neural Networks - Deep learning on graphs
Vector Embeddings - Node embeddings for similarity

References

Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3-5), 75-174. https://arxiv.org/abs/0906.0612 ↩︎
Newman, M. E. J. (2006). Modularity and community structure in networks. PNAS, 103(23), 8577-8582. https://www.pnas.org/doi/10.1073/pnas.0601602103 ↩︎
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, 2008(10), P10008. https://arxiv.org/abs/0803.0476 ↩︎