Quiz: Theory of GNNs: Expressiveness and the WL Test¶
Test your understanding of GNN expressiveness, the WL hierarchy, over-squashing, and permutation equivariance.
Question 1¶
The 1-WL test (Weisfeiler-Lehman test) is used to assess graph isomorphism. What does it mean when two graphs are declared "non-distinguishable" by 1-WL?
- The graphs have the same number of nodes and edges but different topologies
- The 1-WL color histograms are identical at convergence — the test cannot rule out isomorphism, though the graphs may still be non-isomorphic
- The graphs are provably isomorphic and any GNN will assign them identical representations
- The graphs both contain regular subgraphs of the same degree
Show Answer
The correct answer is B. The 1-WL test iteratively refines node color labels based on neighbor multisets. If the final histograms differ, the graphs are definitively non-isomorphic. If they match, the test is inconclusive — it cannot prove isomorphism. GNNs are bounded by 1-WL: if 1-WL cannot distinguish two graphs, no message-passing GNN can assign them different representations. But 1-WL matching does not mean the graphs are isomorphic — there exist non-isomorphic graphs that fool 1-WL.
Concept Tested: Weisfeiler-Leman Test (1-WL), GNN Expressiveness
Question 2¶
GIN (Graph Isomorphism Network) achieves the maximum expressiveness of any message-passing GNN bounded by 1-WL. What two properties must its aggregation function satisfy?
- Lipschitz continuity and differentiability everywhere
- Commutativity and associativity, ensuring the aggregation order does not matter
- Injectivity over multisets (distinct multisets map to distinct outputs) and use of sum aggregation
- Permutation invariance and output normalization to the unit sphere
Show Answer
The correct answer is C. Xu et al. (2019) prove that a GNN is as expressive as the 1-WL test if and only if its aggregation is injective over multisets — different multisets of neighbor representations map to different aggregate values. Sum aggregation over injective functions achieves this; mean and max do not (they can conflate distinct multisets). GIN implements this as \(\mathbf{h}_v^{(l+1)} = \text{MLP}\!\left((1+\varepsilon)\,\mathbf{h}_v^{(l)} + \sum_{u \in \mathcal{N}(v)} \mathbf{h}_u^{(l)}\right)\), where the \((1+\varepsilon)\) term ensures the center node's contribution is distinguishable.
Concept Tested: Graph Isomorphism Network (GIN), Sum Aggregation Injectivity
Question 3¶
Regular graphs are a known failure case for 1-WL and all message-passing GNNs. Why can't a GNN distinguish two non-isomorphic k-regular graphs?
- Regular graphs are always isomorphic to each other, so no distinction is possible or needed
- The adjacency matrix of a regular graph is always a permutation matrix, preventing eigenvalue analysis
- Regular graphs have no cycles, making the WL color refinement trivially converge in one step
- All nodes in a k-regular graph have the same degree k, so every node receives identical messages at each layer — no structural information differentiates nodes or distinguishes the graphs
Show Answer
The correct answer is D. In a k-regular graph, every node has exactly k neighbors. At layer 1 of WL, every node receives k copies of the initial color — the same aggregate for all nodes. After any number of layers, all nodes in any k-regular graph on the same number of nodes receive identical aggregate messages. Therefore, no message-passing GNN can assign different representations to nodes across two non-isomorphic k-regular graphs, even though the graphs have completely different structures.
Concept Tested: WL Test Limitations (Regular Graphs), Aggregator Expressiveness (Sum vs. Mean vs. Max)
Question 4¶
Over-squashing in deep GNNs refers to information loss in long-range propagation. What causes it?
- The gradient of the loss vanishing as it flows backward through more than three GNN layers
- Applying too many non-linear activations, which progressively remove negative values from representations
- Exponentially many nodes' information being compressed into a fixed-size vector as the receptive field grows exponentially with depth
- Edge weights being normalized below a threshold, effectively removing edges from the computation
Show Answer
The correct answer is C. With \(L\) message passing layers and average degree \(d\), the \(L\)-hop neighborhood contains \(O(d^L)\) nodes — exponentially many. All their information must be compressed into a single fixed-dimensional vector at the center node. The "squashing" occurs because a \(d\)-dimensional vector has a fixed information capacity, yet it must summarize exponentially many inputs. Long-range signals from distant nodes are diluted to near-zero, making deep GNNs effectively blind to distant structural information.
Concept Tested: Over-Squashing, Cheeger Constant
Question 5¶
The Cheeger constant h of a graph measures the quality of the minimum edge cut relative to volume. How does h relate to over-squashing?
- Graphs with h = 0 (disconnected graphs) have no over-squashing because messages never cross component boundaries
- Graphs with high Cheeger constant suffer more over-squashing because the graph is too well-connected for message passing to converge
- The Cheeger constant measures the rate of gradient vanishing, not information squashing
- Graphs with low Cheeger constant (bottleneck edges) suffer more over-squashing because information must pass through narrow cuts with exponentially diluted signals
Show Answer
The correct answer is D. The Cheeger constant h quantifies how "bottle-necked" a graph is. Low h means there is a small edge cut separating two large portions of the graph — information flowing between those portions must pass through few edges. In message passing, this creates over-squashing: a node on one side of the bottleneck must compress information from the entire other side through those few edges. Alon et al. (2021) formalized this connection, showing over-squashing is worst when the graph has low expansion (low h).
Concept Tested: Cheeger Constant, Over-Squashing
Question 6¶
A GNN that is permutation equivariant satisfies \(f(P_\sigma \mathbf{X},\, P_\sigma \mathbf{A} P_\sigma^\top) = P_\sigma f(\mathbf{X}, \mathbf{A})\). What does this mean in plain terms?
- The GNN produces the same output regardless of the input graph's topology
- If you permute the node ordering, the GNN's output is permuted in the same way — the model is sensitive to node identity, not node index
- The GNN produces the same scalar output for all permutations of the node ordering (permutation invariant, not equivariant)
- The GNN requires node features to be sorted before processing
Show Answer
The correct answer is B. Permutation equivariance means that relabeling nodes (applying permutation \(\sigma\) to row/column indices) changes the output in the same consistent way. If node 3 and node 7 swap labels, the output representation for those nodes swaps accordingly — no information is created or destroyed. This is the correct property for node-level tasks. Graph-level tasks need permutation invariance (the output is unchanged under any relabeling), which can be achieved by applying a permutation-invariant readout after equivariant layers.
Concept Tested: Permutation Equivariance, Permutation Invariance
Question 7¶
The k-WL hierarchy provides GNNs with greater expressiveness than 1-WL by using higher-order color refinement. What is the fundamental cost of moving from 1-WL to 2-WL?
- 2-WL requires computing eigenvalues of the adjacency matrix, adding \(O(N^3)\) preprocessing
- 2-WL can only process directed graphs, limiting its applicability
- 2-WL assigns colors to pairs of nodes instead of individual nodes, increasing complexity from \(O(N)\) to \(O(N^2)\) per refinement step
- 2-WL is not implementable as a neural network layer
Show Answer
The correct answer is C. 1-WL operates on individual nodes: the state space is \(N\)-dimensional. 2-WL operates on pairs of nodes: the state space is \(N^2\)-dimensional, requiring \(O(N^2)\) memory and \(O(N^3)\) or \(O(N^2 E)\) computation per step. Higher-order \(k\)-WL uses \(k\)-tuples of nodes with \(O(N^k)\) cost. This exponential blowup makes \(k\)-WL directly impractical for large graphs, motivating approximations like ring graph neural networks and subgraph GNNs that trade expressiveness for scalability.
Concept Tested: k-WL Hierarchy
Question 8¶
Mean aggregation cannot distinguish the multiset {1, 1, 2} from {1, 2, 2} when the neighborhood sizes are different. Which specific counter-example demonstrates max aggregation's failure?
- Max aggregation cannot distinguish a node with neighbors {1, 2, 3} from one with neighbors {3, 3, 3}
- Max aggregation cannot distinguish a node with neighbors {5} from one with neighbors {5, 5, 5, 5}
- Max aggregation cannot distinguish a node with neighbors {1, 1, 2} from one with neighbors {2, 2, 1}
- Max aggregation cannot distinguish a node with neighbors {2, 3} from one with no neighbors
Show Answer
The correct answer is B. Max aggregation takes the element-wise maximum and ignores how many times a value appears. A node with one neighbor whose representation is 5 and a node with four neighbors all with representation 5 both produce max output 5 — the counts are erased. This means max aggregation cannot represent "how many neighbors have feature value X" — a critical piece of structural information. Sum preserves these counts; mean and max both lose them in different ways.
Concept Tested: Aggregator Expressiveness (Sum vs. Mean vs. Max)
Question 9¶
GIN's update rule is \(\mathbf{h}_v^{(l+1)} = \text{MLP}\!\left((1+\varepsilon)\,\mathbf{h}_v^{(l)} + \sum_{u \in \mathcal{N}(v)} \mathbf{h}_u^{(l)}\right)\). Why is the \((1+\varepsilon) \cdot \mathbf{h}_v^{(l)}\) term necessary?
- It ensures the center node's own representation is distinguishable from its neighbors' contributions, preventing conflation of the center with an additional identical neighbor
- It adds a residual connection that prevents over-smoothing in deep networks
- It normalizes the self-contribution by the node's degree to prevent high-degree nodes from dominating
- It scales the learning rate per node based on the local graph structure
Show Answer
The correct answer is A. Without the \((1+\varepsilon)\) term, GIN's aggregation would be \(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \mathbf{h}_u\) — treating the center node as just another neighbor. If a node \(v\) has a neighbor \(u\) with \(\mathbf{h}_u = \mathbf{h}_v\), the center's contribution is indistinguishable from adding one more copy of that neighbor's features. The \((1+\varepsilon)\) scalar (\(\varepsilon\) can be learned or set to 0) ensures \(v\)'s self-embedding has a different weight than its neighbors', preserving the distinction. This is crucial for achieving the 1-WL bound.
Concept Tested: Graph Isomorphism Network (GIN), Sum Aggregation Injectivity
Question 10¶
Over-smoothing (spectral view) can be understood through the graph Laplacian's eigenvalues. What happens to the frequency components of node signals as GCN layers increase?
- High-frequency components (sharp differences between neighbors) are progressively suppressed, leaving only the low-frequency global signal
- All frequency components are amplified equally, causing representation norms to explode
- High-frequency components are amplified while low-frequency components are suppressed
- The signal alternates between high and low frequency at every other layer, causing oscillations
Show Answer
The correct answer is A. The GCN propagation matrix \(\tilde{D}^{-1/2} \tilde{A}\, \tilde{D}^{-1/2}\) has eigenvalues in \([-1, 1]\), with eigenvalue 1 for the smoothest (constant) signal and eigenvalue close to \(-1\) for the most oscillatory (high-frequency) signal. After \(L\) applications, eigenvalue \(\lambda_k\) is raised to the \(L\)-th power: \(\lambda_k^L \to 0\) for \(|\lambda_k| < 1\). Only the DC component (eigenvalue 1, constant signal) survives — all other frequency components decay exponentially. This is the spectral explanation for why deep GCNs make all node representations identical.
Concept Tested: Over-Smoothing (Spectral View), Weisfeiler-Leman Test (1-WL)