#### Note

To create a toy binary-community dataset from CORA, We first extract\n all two-class pairs from the original CORA 7 classes. For each pair, we\n treat each class as one community, and find the largest subgraph that\n at least contain one cross-community edge as the training example. As\n a result, there are a total of 21 training samples in this mini-dataset.

\n\nHere we visualize one of the training samples and its community structure:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import networkx as nx\nimport matplotlib.pyplot as plt\n\ntrain_set = dgl.data.CoraBinary()\nG1, pmpd1, label1 = train_set[1]\nnx_G1 = G1.to_networkx()\n\ndef visualize(labels, g):\n pos = nx.spring_layout(g, seed=1)\n plt.figure(figsize=(8, 8))\n plt.axis('off')\n nx.draw_networkx(g, pos=pos, node_size=50, cmap=plt.get_cmap('coolwarm'),\n node_color=labels, edge_color='k',\n arrows=False, width=0.5, style='dotted', with_labels=False)\nvisualize(label1, nx_G1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interested readers can go to the original paper to see how to generalize\nto multi communities case.\n\nCommunity Detection in a Supervised Setting\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nCommunity Detection problem could be tackled with both supervised and\nunsupervised approaches. Same as the original paper, we formulate\nCommunity Detection in a supervised setting as follows:\n\n- Each training example consists of $(G, L)$, where $G$ is a\n directed graph $(V, E)$. For each node $v$ in $V$, we\n assign a ground truth community label $z_v \\in \\{0,1\\}$.\n- The parameterized model $f(G, \\theta)$ predicts a label set\n $\\tilde{Z} = f(G)$ for nodes $V$.\n- For each example $(G,L)$, the model learns to minimize a specially\n designed loss function (equivariant loss) $L_{equivariant} =\n (\\tilde{Z}\uff0cZ)$\n\n#### Note

In this supervised setting, the model naturally predicts a \"label\" for\n each community. However, community assignment should be equivariant to\n label permutations. To achieve this, in each forward process, we take\n the minimum among losses calculated from all possible permutations of\n labels.\n\n Mathematically, this means\n $L_{equivariant} = \\underset{\\pi \\in S_c} {min}-\\log(\\hat{\\pi}, \\pi)$,\n where $S_c$ is the set of all permutations of labels, and\n $\\hat{\\pi}$ is the set of predicted labels,\n $- \\log(\\hat{\\pi},\\pi)$ denotes negative log likelihood.\n\n For instance, for a toy graph with node $\\{1,2,3,4\\}$ and\n community assignment $\\{A, A, A, B\\}$, with each node's label\n $l \\in \\{0,1\\}$,The group of all possible permutations\n $S_c = \\{\\{0,0,0,1\\}, \\{1,1,1,0\\}\\}$.

\n\nLine Graph Neural network: key ideas\n------------------------------------\nAn key innovation in this paper is the use of line-graph.\nUnlike models in previous tutorials, message passing happens not only on the\noriginal graph, e.g. the binary community subgraph from CORA, but also on the\nline-graph associated with the original graph.\n\nWhat's a line-graph ?\n~~~~~~~~~~~~~~~~~~~~~\nIn graph theory, line graph is a graph representation that encodes the\nedge adjacency structure in the original graph.\n\nSpecifically, a line-graph $L(G)$ turns an edge of the original graph `G`\ninto a node. This is illustrated with the graph below (taken from the\npaper)\n\n.. figure:: https://i.imgur.com/4WO5jEm.png\n :alt: lg\n :align: center\n\nHere, $e_{A}:= \uff08i\\rightarrow j\uff09$ and $e_{B}:= (j\\rightarrow k)$\nare two edges in the original graph $G$. In line graph $G_L$,\nthey correspond to nodes $v^{l}_{A}, v^{l}_{B}$.\n\nThe next natural question is, how to connect nodes in line-graph\uff1f How to\nconnect two \"edges\"? Here, we use the following connection rule:\n\nTwo nodes $v^{l}_{A}$, $v^{l}_{B}$ in `lg` are connected if\nthe corresponding two edges $e_{A}, e_{B}$ in `g` share one and only \none node:\n$e_{A}$'s destination node is $e_{B}$'s source node\n($j$).\n\n#### Note

Mathematically, this definition corresponds to a notion called non-backtracking\n operator:\n $B_{(i \\rightarrow j), (\\hat{i} \\rightarrow \\hat{j})}$\n $= \\begin{cases}\n 1 \\text{ if } j = \\hat{i}, \\hat{j} \\neq i\\\\\n 0 \\text{ otherwise} \\end{cases}$\n where an edge is formed if $B_{node1, node2} = 1$.

\n\n\nOne layer in LGNN -- algorithm structure\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nLGNN chains up a series of line-graph neural network layers. The graph\nrepresentation $x$ and its line-graph companion $y$ evolve with\nthe dataflow as follows,\n\n.. figure:: https://i.imgur.com/bZGGIGp.png\n :alt: alg\n :align: center\n\nAt the $k$-th layer, the $i$-th neuron of the $l$-th\nchannel updates its embedding $x^{(k+1)}_{i,l}$ with:\n\n\\begin{align}\\begin{split}\n x^{(k+1)}_{i,l} ={}&\\rho[x^{(k)}_{i}\\theta^{(k)}_{1,l}\n +(Dx^{(k)})_{i}\\theta^{(k)}_{2,l} \\\\\n &+\\sum^{J-1}_{j=0}(A^{2^{j}}x^{k})_{i}\\theta^{(k)}_{3+j,l}\\\\\n &+[\\{\\text{Pm},\\text{Pd}\\}y^{(k)}]_{i}\\theta^{(k)}_{3+J,l}] \\\\\n &+\\text{skip-connection}\n \\qquad i \\in V, l = 1,2,3, ... b_{k+1}/2\n \\end{split}\\end{align}\n\nThen, the line-graph representation $y^{(k+1)}_{i,l}$ with,\n\n\\begin{align}\\begin{split}\n y^{(k+1)}_{i',l^{'}} = {}&\\rho[y^{(k)}_{i^{'}}\\gamma^{(k)}_{1,l^{'}}+\n (D_{L(G)}y^{(k)})_{i^{'}}\\gamma^{(k)}_{2,l^{'}}\\\\\n &+\\sum^{J-1}_{j=0}(A_{L(G)}^{2^{j}}y^{k})_{i}\\gamma^{(k)}_{3+j,l^{'}}\\\\\n &+[\\{\\text{Pm},\\text{Pd}\\}^{T}x^{(k+1)}]_{i^{'}}\\gamma^{(k)}_{3+J,l^{'}}]\\\\\n &+\\text{skip-connection}\n \\qquad i^{'} \\in V_{l}, l^{'} = 1,2,3, ... b^{'}_{k+1}/2\n \\end{split}\\end{align}\n\nWhere $\\text{skip-connection}$ refers to performing the same operation without the non-linearity\n$\\rho$, and with linear projection $\\theta_\\{\\frac{b_{k+1}}{2} + 1, ..., b_{k+1}-1, b_{k+1}\\}$\nand $\\gamma_\\{\\frac{b_{k+1}}{2} + 1, ..., b_{k+1}-1, b_{k+1}\\}$.\n\nImplement LGNN in DGL\n---------------------\nGeneral idea\n~~~~~~~~~~~~\nThe above equations look intimidating. However, we observe the following:\n\n- The two equations are symmetric and can be implemented as two instances\n of the same class with different parameters.\n Mainly, the first equation operates on graph representation $x$,\n whereas the second operates on line-graph\n representation $y$. Let us denote this abstraction as $f$. Then\n the first is $f(x,y; \\theta_x)$, and the second\n is $f(y,x, \\theta_y)$. That is, they are parameterized to compute\n representations of the original graph and its\n companion line graph, respectively.\n\n- Each equation consists of 4 terms (take the first one as an example):\n\n - $x^{(k)}\\theta^{(k)}_{1,l}$, a linear projection of previous\n layer's output $x^{(k)}$, denote as $\\text{prev}(x)$.\n - $(Dx^{(k)})\\theta^{(k)}_{2,l}$, a linear projection of degree\n operator on $x^{(k)}$, denote as $\\text{deg}(x)$.\n - $\\sum^{J-1}_{j=0}(A^{2^{j}}x^{(k)})\\theta^{(k)}_{3+j,l}$,\n a summation of $2^{j}$ adjacency operator on $x^{(k)}$,\n denote as $\\text{radius}(x)$\n - $[\\{Pm,Pd\\}y^{(k)}]\\theta^{(k)}_{3+J,l}$, fusing another\n graph's embedding information using incidence matrix\n $\\{Pm, Pd\\}$, followed with a linear projection,\n denote as $\\text{fuse}(y)$.\n\n- In addition, each of the terms are performed again with different\n parameters, and without the nonlinearity after the sum.\n Therefore, $f$ could be written as:\n\n .. math::\n \\begin{split}\n f(x^{(k)},y^{(k)}) = {}\\rho[&\\text{prev}(x^{(k-1)}) + \\text{deg}(x^{(k-1)}) +\\text{radius}(x^{k-1})\n +\\text{fuse}(y^{(k)})]\\\\\n +&\\text{prev}(x^{(k-1)}) + \\text{deg}(x^{(k-1)}) +\\text{radius}(x^{k-1}) +\\text{fuse}(y^{(k)})\n \\end{split}\n\n- Two equations are chained up in the following order :\n\n .. math::\n \\begin{split}\n x^{(k+1)} = {}& f(x^{(k)}, y^{(k)})\\\\\n y^{(k+1)} = {}& f(y^{(k)}, x^{(k+1)})\n \\end{split}\n\nWith these observations, we proceed to implementation.\nThe important point is we are to use different strategies for these terms.\n\n