Community Detection

Community detection is a fundamental problem in network analysis that aims to identify groups of nodes (communities) that are more densely connected internally than with the rest of the network. These methods reveal the hidden structure within complex systems, from social networks to biological pathways.1


What is a Community?

A community (also called cluster, module, or group) is a subset of nodes in a network that share more connections among themselves than with nodes outside the subset.

Formal definition:

For a community $C$ in graph $G = (V, E)$:

  • Internal degree $k_i^{in}$: edges from node $i$ to other nodes in $C$
  • External degree $k_i^{out}$: edges from node $i$ to nodes outside $C$

A node $i$ belongs to community $C$ if:

$$ k_i^{in} > k_i^{out} $$

Visual representation:

    ┌─────────────────┐         ┌─────────────────┐
    │   Community A   │         │   Community B   │
    │                 │         │                 │
    │   ○───○───○     │         │     ○───○       │
    │    \ / \ /      │───weak──│    / \ / \      │
    │     ○───○       │  links  │   ○───○───○     │
    │                 │         │                 │
    └─────────────────┘         └─────────────────┘
         Dense                       Dense
      connections                 connections

Community Detection vs. Clustering

While related, these approaches differ fundamentally:

AspectClusteringCommunity Detection
InputFeature vectors (data points)Graph structure (nodes + edges)
SimilarityDistance metrics (Euclidean, cosine)Structural connectivity patterns
GoalGroup similar objectsFind densely connected subgraphs
ContextGeneral ML, data analysisNetwork analysis, graph theory
AlgorithmsK-means, DBSCAN, hierarchicalLouvain, Leiden, Label Propagation

Key distinction: Clustering operates on attribute similarity, while community detection operates on topological structure.


Modularity: The Optimization Target

Modularity ($Q$) is the most widely used quality metric for community structure, measuring how well a network is partitioned.2

Mathematical Definition

$$ Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(c_i, c_j) $$

Where:

  • $A_{ij}$ = adjacency matrix element (1 if edge exists, 0 otherwise)
  • $k_i, k_j$ = degrees of nodes $i$ and $j$
  • $m$ = total number of edges
  • $c_i, c_j$ = communities of nodes $i$ and $j$
  • $\delta(c_i, c_j)$ = 1 if $c_i = c_j$, 0 otherwise

Interpretation

Q ValueInterpretation
Q ≈ 0Random partitioning, no community structure
Q > 0.3Significant community structure
Q > 0.7Strong community structure
Q = 1Theoretical maximum (rarely achieved)

Modularity Resolution Limit

A known limitation: modularity optimization may fail to detect communities smaller than:

$$ \sqrt{2m} $$

This is called the resolution limit problem, where small communities get merged in large networks.


Community Detection Algorithms

Optimization-Based Methods

Louvain Algorithm

A greedy modularity optimization approach with two phases:3

  1. Local optimization: Each node moves to the community that maximizes modularity gain
  2. Aggregation: Communities become super-nodes, repeat until no improvement

Complexity: $O(n \log n)$ for sparse networks

See: Louvain Algorithm

Leiden Algorithm

Improvement over Louvain addressing the disconnected communities problem:

  1. Local moving: Similar to Louvain
  2. Refinement: Merge nodes within communities for better quality
  3. Aggregation: Create network of communities

Key advantage: Guarantees well-connected communities

See: Leiden Algorithm


Propagation-Based Methods

Label Propagation Algorithm (LPA)

Simple, fast algorithm based on label spreading:

  1. Initialize each node with a unique label
  2. Iteratively update each node’s label to the most frequent among neighbors
  3. Stop when labels stabilize

Complexity: $O(m)$ - near-linear time Limitation: Non-deterministic results

Semi-Supervised Label Propagation

Extension that incorporates known community memberships as constraints.


Spectral Methods

Spectral Clustering

Uses eigenvectors of the graph Laplacian:

  1. Compute graph Laplacian: $L = D - A$
  2. Find first $k$ eigenvectors of $L$
  3. Cluster nodes using K-means on eigenvector coordinates

Laplacian matrices:

  • Unnormalized: $L = D - A$
  • Symmetric normalized: $L_{sym} = I - D^{-1/2}AD^{-1/2}$
  • Random walk: $L_{rw} = I - D^{-1}A$

Hierarchical Methods

Girvan-Newman Algorithm

Divisive hierarchical approach based on edge betweenness:

  1. Calculate betweenness centrality for all edges
  2. Remove edge with highest betweenness
  3. Recalculate betweenness and repeat
  4. Build dendrogram of communities

Complexity: $O(m^2 n)$ - expensive for large networks


Overlapping Community Detection

Real networks often have overlapping communities (a person belongs to multiple social groups).

Methods for Overlapping Communities

AlgorithmApproach
COPRALabel propagation with multiple labels
BigCLAMMatrix factorization
DEMONLocal community expansion
SLPASpeaker-listener label propagation

Evaluation Metrics

Internal Metrics (No Ground Truth)

MetricDescription
ModularityFraction of edges within communities vs. expected
ConductanceRatio of external to internal edges
CoverageFraction of intra-community edges
PerformanceFraction of correctly classified node pairs

External Metrics (With Ground Truth)

MetricDescription
NMI (Normalized Mutual Information)Information-theoretic similarity
ARI (Adjusted Rand Index)Pair-counting measure adjusted for chance
PurityFraction of nodes in majority class per cluster
F1 ScoreHarmonic mean of precision and recall

Applications

DomainApplication
Social NetworksFriend group detection, influence analysis
BiologyProtein interaction modules, gene regulatory networks
Citation NetworksResearch topic clustering, collaboration analysis
E-commerceCustomer segmentation, product recommendation
CybersecurityFraud ring detection, botnet identification
Knowledge GraphsEntity grouping, ontology discovery

Implementation with NetworkX

import networkx as nx
from networkx.algorithms import community

# Create a graph
G = nx.karate_club_graph()

# Louvain community detection
communities = community.louvain_communities(G, seed=42)

# Girvan-Newman (hierarchical)
comp = community.girvan_newman(G)
first_level = next(comp)

# Label Propagation
lpa_communities = community.label_propagation_communities(G)

# Calculate modularity
modularity = community.modularity(G, communities)
print(f"Modularity: {modularity:.4f}")

Algorithm Comparison

AlgorithmTime ComplexityMemoryDeterministicOverlapping
LouvainO(n log n)LowNoNo
LeidenO(n log n)LowNoNo
Label PropagationO(m)Very LowNoNo
SpectralO(n³)HighYesNo
Girvan-NewmanO(m²n)MediumYesNo
COPRAO(km)LowNoYes

Best Practices

  1. Start with Louvain/Leiden: Fast, effective baseline for most networks
  2. Validate with multiple metrics: Don’t rely on modularity alone
  3. Run algorithms multiple times: Non-deterministic methods need averaging
  4. Consider resolution: Use multi-resolution analysis for hierarchical structure
  5. Check connectivity: Ensure detected communities are connected
  6. Visualize results: Inspect communities for sanity check

  • Louvain Algorithm - Fast modularity optimization
  • Leiden Algorithm - Improved community detection
  • Graph Neural Networks - Deep learning on graphs
  • Vector Embeddings - Node embeddings for similarity

References


  1. Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3-5), 75-174. https://arxiv.org/abs/0906.0612 ↩︎

  2. Newman, M. E. J. (2006). Modularity and community structure in networks. PNAS, 103(23), 8577-8582. https://www.pnas.org/doi/10.1073/pnas.0601602103 ↩︎

  3. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, 2008(10), P10008. https://arxiv.org/abs/0803.0476 ↩︎