CNN for Graph: Notes on IPAM UCLA talk- Part I

This post is about my understanding of Xavier Bresson's talk at IPAM UCLA on "Convolutional Neural Networks on Graphs".

This post has 3 section: the first section talks about how convolution and pooling operations are defined on graphs, the second section talks about different graph CNN architectures proposed so far and the last section talks about Graph Neural Network - a framework to deal with arbitrary shape graphs.

Institute for Pure & Applied Mathematics(IPAM), UCLA recently organized a workshop on New Deep Learning Techniques with many great talks on emerging methods in deep learning. The concept of applying convolution on non-Euclidean data is interesting in its in own way. Xavier Bresson presented a nice overview of it in his talk at the workshop. In this post, I have discussed some of the concepts, that I found worth noting down.
CNNs are successful in solving many problems in computer vision, NLP, speech, that seemed impossible to solve a few years back. The data in these domains have compositional structure and CNNs are very good at leveraging it for learning features.
CNNs (Deep Learning, in general) is like a Swiss Knife of Data Science. You can use it to solve so many problems -Xavier Bresson
Another important reason CNN became so popular and are being used to solve many problems is the well-defined convolution and pooling operations for images and speech. It was possible because this data can be nicely represented on the Euclidean grid, e.g. 2D grid for images, 1D grid for speech and NLP. However, a lot of data that is available in real-world does not lie on the Euclidean space. For example, communication network, 3D shapes in computer graphics, brain functional network, social network, etc. These networks do not have any specific structure. Hence, applying the convolution operation on this irregular manifolds/graphs is not straightforward. The rest of the post will discuss on how we can achieve this.

Convolution on the non-Euclidean domain:

This section describes how convolution and pooling operation can be defined on non-Euclidean space.

1. Convolution operation in the spectral domain:

To apply convolution on any data, it requires the data to have compositionality property. Hence, we make an assumption that "The non-Euclidean data also follow compositionality property" and we re-define convolution operation on the graph using the concepts from spectral graph theory - the study of properties of the Laplacian matrix or adjacency matrix associated with a graph.
For using spectral methods, the graphs need to be of fixed size i.e. the function defined on the graph can vary but topology remains the same. For example, brain functional graph. Depending on the task a person does, a different part of the brain gets activated and hence fMRI shows different signal on the fixed graph. We can study these signals, say, to classify the activity of a person.
The important operator that is used in Spectral Graph Theory is Graph Laplacian operator. It is a linear heat diffusion operator on the graph and its value will be small if the function on which we apply is a smooth function. Otherwise, it has very high values if the function is wavy (oscillates more).
In graph domain, there are two important representation of graph Laplacian:

Normalized Laplacian: \(\Delta=I-D^{-1/2}AD^{-1/2}\)
Random walk Laplacian: \(\Delta=I-D^{-1}A\), where A is an adjacency matrix and D is a diagonal degree matrix.

The eigendecomposition of graph Laplacian is given by \(\Delta=\Phi^T\Lambda\Phi\), where \(\Phi = \{\phi_1,\phi_2,..\phi_n\}\) is \(n\) x \(n\) Fourier matrix, \(\phi_i\) represents a eigenvector corresponding to eigenvalue \(\lambda_i\), \(\Lambda=diag(\lambda_1,\lambda_2,...\lambda_n)\). Laplacian eigenvectors are also called as Fourier basis function. [Eigenvectors of graph Laplacian are related to the topology of the graph. Therefore first few eigenvector are useful in finding the communities in the graph (this is core concept in spectral clustering)]

Extending concepts of convolution on discrete Euclidean space to spectral domain:

Convolution of two vectors \(f\) and \(g\) in the discrete Euclidean domain is given by a multiplication of Circulant matrix of \(g\) and vector \(f\). Circulant matrix can be diagonalized by Fourier basis of \(g\) and hence convolution can be written as \(f*g = \Phi(\Phi^Tg\circ\Phi^Tf)\)
A similar concept can be applied for graphs:

\(f*g = \Phi(\Phi^Tg\circ\Phi^Tf)\)

\(= \Phi diag(\hat{g_1}\cdots\hat{g_n})\Phi^Tf\)

\(= \Phi \hat{g(\Lambda)}\Phi^Tf\)

\(= \hat{g}(\Phi\Lambda\Phi^T)f\)

\(= \hat{g}(\Delta)f\)

Convolution on graph is nothing but applying Laplacian to \(f\) with some spectral function \(\hat{g}\).

2. Pooling operation on graph:

The pooling operation on graph is also called as graph coarsening or graph subsampling. Here, we want to merge the nodes with similar properties into a sigle super-node. We do this multiple times to get multiple levels of abstract graphs. Here is a nice video on graph coarsening by Udacity.

This problem of graph coarsening is equivalent to graph partitioning. A simple method of graph partitioning would be to use normalized graph cut algorithm. However, we could use any graph partitioning algorithm for pooling operation. The method used in a paper by Defferrard et al, 2016 is heavy-edge matching.

Different graph CNN architectures

The previous section talked about how convolution and pooling operations are defined for graph data. Here, I am briefly mentioning about different graph CNN architectures proposed so far in the literature.

Vanilla spectral graph ConvNet (Bruna et al, 2013):

This algorithm actually computes the eigendecomposition of graph Laplacian and learns filters in the spectral domain.
However, the learned filters are not spatially localized and also the filter parameters are dependent on input size.
As FFT is not defined for graphs, the convolution operation has \(O(n^2)\) computation cost.
Filters learned by this method are basis-dependent i.e. filters learned on one topology won't work on other topology.

SplineNet (Henaff et al, 2015):

This paper uses Parseval's identity to achieve the spatial locality.
To make the filter parameter independent of input size, the paper proposed parameterization of filters by smooth basis function such as Spline. The spline coefficients are also learned by backprop. Hence the \(O(1)\) number of parameters per layer (Precisely saying it is equal to the number of spline coefficients).
However, the computation cost is high and filters are still basis dependent-creating problem in generalization.

ChebNet (Defferrard et al, 2016):

In this paper, spectral functions are represented by polynomials of Laplacian eigenvalues, because of this, we can control the support of the filters.
Another important advantage of using spectral polynomial filters is linear time computation complexity, as we don't have to compute eigenvalues and eigenvectors. (It is assumed that the graph is sparsely connected).
The problem of basis-dependency still persists

ConvNets for Variable Graphs

So far we were discussing how to apply convolution on the fixed graph. The methods proposed for it has the main problem of lack of generalizability of filters. This is because the filters are learned in the specral domain. It is achieved using eigendecomposition of graph Laplacian and it is dependent on the topology of the graph. This means the spectral filters learned from one graph cannot be used on different topology graph. Even if we want to use methods mentioned in the previous section, we have to align Fourier basis of two graphs, which is equivalent to graph matching problem.

In this section, we will see how to learn generalizable filters on arbitrary graphs.

Here the problem setting is different, we have a bunch of graphs of different size and shape, each of these graphs have signal on it. The task is to analyze these graphs for say, classification or regression.
The framework used to solve such problems is Graph Neural Network(GNN) introduced by Scarselli et al, 2009.
The minimal inner structure needed to design a neural network for a graph is:

Invariant to vertex indexing: because we don't want to solve graph matching problem when we get a new graph.
Locality
Weight sharing: applying convolution operation
Independent of input size.

So we want to learn \(h_i\) the representation for node \(i\) as a function of its direct neighbors (for directed graph it would be nodes with incoming edges).

\(h_i=f_{GNN}(\{h_j:j\rightarrow i\})\)

Different authors define the function \(f_{GNN}\) differently. Scarselli et al used multi layer perceptron as function. Li et al, 2015 used GRU cells to define GNN function.
A recent paper by Sukhbaatar et al, 2016 define multilayer convolution operation. Here convolution is not any special operation, but as the algorithm considers each node as the center and uses the representation of the same node in previous layer and its direct neighbors to learn new representation for current layer, this concept is similar to convolution, hence the name.

Link to the video: Xavier Bresson: "Convolutional Neural Networks on Graphs"

Summary:

Convolution on images: O(1) number of parameters per filter i.e. independent of input image size.
Convolution operation has O(n) computation complexity.
In Spectral Graph Theory, graph Laplacian is a good measure of smoothness of function.
The main goal in spectral graph theory is to find the smoothest orthonormal basis on the graph. Hence we use graph Laplacian and find a function that minimizes the Laplacian. The solution is eigendecomposition of Laplacian.
Fourier basis/modes of a graph Laplacian captures the graph structure. So if we use first few eigenvectors, it can give us communities in the graph.
Convolution in the discrete Euclidean domain is the multiplication of circulant matrix and vector. It can be made easy by diagonalizing the matrix. We use similar analogy for convolution on graph in spectral domain
Convolution is just applying Laplacian to function on the graph. Problem with this method: not shift invariant, learned filters are not generalizable and FFT is not defined for the graph.
Graph downsampling is equivalent to graph partitioning.
Concepts like Heavy-edge matching algorithm for graph coarsening, Parseval's identity for spatial localization, parameterizing filters by spline to make the number of filter parameters independent of input size.
Graph Neural Network framework for dealing with a set of arbitrary size graphs. Most commonly used architecture is RNN(or variant) based.

Research Diary

Search This Blog