PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

analysis.tex (17170B)


      1 \section{Preliminary Analysis and Proof of Principle}
      2 \label{sec:graph:analysis}
      3 In this section, we want to ensure the soundness of graph-based approaches by providing some statistics about a large relation extraction dataset.
      4 In particular, we start by building an attributed multigraph as described in Section~\ref{sec:graph:encoding}.
      5 We focus on \textsc{t-re}x (Section~\ref{sec:datasets:trex}, \cite{trex}), an alignment (Section~\ref{sec:relation extraction:distant supervision}) of Wikipedia with Wikidata.
      6 This dataset has the advantage of being both large and publicly available.
      7 Note that the graph we analyze in this section is not a knowledge base.
      8 Each arc is both labeled with a relation and attributed with a sentence.
      9 The fact that several arcs are incident to a vertex does not necessarily imply that the corresponding entity is linked by several relations, only that it was mentioned multiple times.
     10 
     11 \begin{marginfigure}[-97mm]
     12 	\centering
     13 	\renderDegrees{mainmatter/graph/T-REx degrees.xml}
     14 	\vspace{-5mm}
     15 	\scaption[\textsc{t-re}x vertices degree distribution.]{
     16 		\textsc{t-re}x vertices degree distribution.
     17 		The lines give the frequency of vertices with the given in- and out-degree in the dataset.
     18 		Note that both axes are log-scaled.
     19 		This plot was cut at a degree of \maxdisplayeddegree, which corresponds to a minimum frequency of \(10^{-5}\) out of a total of \numberarcs{} arcs.
     20 		In reality, the vertex with the maximum degree is ``United States of America'' \wdent{30} with an \maxdegreetype-degree of \maxdegree.
     21 		The asymmetry between the distribution of in-degrees and out-degrees can be explained by the fact that knowledge bases prefer to encode many-to-one relations instead of their one-to-many converse.
     22 		\label{fig:graph:degree distribution}
     23 	}
     24 \end{marginfigure}
     25 
     26 Figure~\ref{fig:graph:degree distribution} shows the distribution of vertices' degrees in the graph associated with \textsc{t-re}x.
     27 The first thing we can notice about this graph is that it is \emph{scale-free}.
     28 This means that a random vertex \(v\in\entitySet\) has degree \(\gfdegree(v)=k\) with probability \(P(k) \propto k^{-\gamma}\) for a parameter \(\gamma\) which depends on the graph.
     29 In other words, the distribution of degrees follows a power law.
     30 In a scale-free graph, a lot of vertices have few neighbors.
     31 In contrast, the distribution of degrees in a random Erdős--Rényi graph%
     32 \sidenote[][-22mm]{
     33 	There are several different ways to sample random graphs; the Erdős--Rényi model is one of them.
     34 	In this model, arcs are incrementally added between two uniformly chosen vertices.
     35 	In contrast, if vertices with already high degrees are selected more often (the Barabási--Albert model), the resulting graph is scale-free.
     36 }
     37 is expected to follow a binomial distribution.
     38 Scale-free graphs occur in a number of contexts, such as social networks and graphs of linked web pages.
     39 Most unsupervised relation extraction datasets and knowledge bases should be expected to be scale-free.
     40 This needs to be kept in mind when designing graph-processing algorithms for relation extraction.
     41 Indeed most vertices have a small neighborhood, so we might be tempted to take neighbors of neighbors carelessly.
     42 However, scale-free graphs have a very small diameter%
     43 \sidenote{
     44 	The diameter of a graph is the length of the longest shortest-path:
     45 	\begin{equation*}
     46 		D = \max_{u,v\in \entitySet} \delta(u, v),
     47 	\end{equation*}
     48 	where \(\delta(u, v)\) is the length of the shortest path from \(u\) to \(v\).
     49 }
     50 \(D\in O(\log\log n)\).
     51 This means that we can quickly reach most vertices following a small number of arcs.
     52 This is in part due to the fact that some vertices have very high degree, for example in \textsc{t-re}x, the vertex ``United States of America'' \wdent{30} is highly connected with \(\gfdegree(\wdent{30})=1\,697\,334\).
     53 In particular, this implies that by considering neighbors of neighbors, we quickly need to consider the whole graph; this is particularly problematic for graph convolutional networks described in Section~\ref{sec:graph:related work}.
     54 
     55 We now come to the main incentive for taking a graph-based approach to the unsupervised relation extraction task:
     56 \begin{spacedblock}
     57 	\strong{Hypothesis:}
     58 	\emph{In the relation extraction problem, we can get additional information from the neighborhood of a sample.}
     59 \end{spacedblock}
     60 To test this hypothesis, we compute statistics on the distribution of neighbors.
     61 However, as we just saw, the support of this distribution is of high dimension.
     62 Hence, we look at the statistics of paths in our multigraph.%
     63 \sidenote{
     64 	Paths of length \(k\) are in a domain of size \(|\relationSet|^k\), whereas neighbors are in a domain of size \(|\relationSet|^{\upDelta(G)}\) with \(\upDelta(G)\) designating the maximum degree in \(G\).
     65 	By studying paths of length 3, we are effectively studying a subsampled neighborhood of the central arc.
     66 }
     67 As a graph theory reminder, we can formally define a path as follows:
     68 \begin{itemize}[nosep]
     69 	\item A \emph{walk} on length \(n\) is a sequence of arcs \(a_1, a_2, \ldots, a_n \in \arcSet\) such that \(\gftarget(a_{i-1}) = \gfsource(a_i)\) for all \(i=2, \dotsc, n\).
     70 	\item A \emph{trail} is a walk with \(a_i \neq a_j\) for all \(1\leq i < j \leq n\) (arcs do not repeat).
     71 		In practice this means that \((s, \vctr{e})\) do not repeat.
     72 		It is not a statement about relations conveyed by these arcs; it is entirely possible that for some \(i\), \(j\) we have \(\gfrelation(a_i)=\gfrelation(a_j)\).
     73     \item A \emph{path} is a trail with \(\gfsource(a_i) \neq \gfsource(a_j)\) for all \(1\leq i < j \leq n\) (vertices do not repeat).
     74 \end{itemize}
     75 It is also possible to base these definitions on \emph{open walks}, which are walks where \(\gfsource(a_1) \neq \gftarget(a_n)\) (the walk does not end where it started).
     76 We base the discussion of this section around the following random path:
     77 \begin{center}
     78 	\input{mainmatter/graph/3-path.tex}
     79 \end{center}
     80 Using these definitions, we can restate our hypothesis.
     81 \begin{marginparagraph}
     82 	The symbol \(\notindependent\) is used to mean ``not independent'':
     83 	\begin{equation*}
     84 		\rndm{a}\notindependent\rndm{b} \iff P(\rndm{a}, \rndm{b})\neq P(\rndm{a})P(\rndm{b})
     85 	\end{equation*}
     86 \end{marginparagraph}
     87 In this path, we expect \(\rndm{r}_2\notindependent\rndm{r}_1\) and \(\rndm{r}_2\notindependent\rndm{r}_3\).
     88 However, enumerating all possible paths in a graph with \(n=2\,819\,966\) vertices and \(m=19\,392\,185\) arcs is not practical.
     89 
     90 To approximate path statistics, we turn to sampling.
     91 However, uniformly sampling paths is not straightforward.
     92 As a first intuition, to uniformly sample a path of length 1---that is, an arc---we can use the following procedure:
     93 \begin{marginparagraph}
     94 	\(\operatorname{Cat}(\entitySet, f)\) refers to the Categorical distribution over the set \(\entitySet\) where the probability of picking \(e\in \entitySet\) is \(f(e)\).
     95 	The \(2m\) appears from the normalization factor \(\sum_{e\in \entitySet} \gfdegree(e)=2m\).
     96 \end{marginparagraph}
     97 \begin{enumerate}[nosep]
     98 	\item Sample an entity \(e_1\) weighted by its degree,\\
     99 		\(e_1\sim \operatorname{Cat}\left(\entitySet, e\mapsto \gfdegree(e) \divslash 2m\right)\)
    100 	\item Uniformly sample an arc incident to the entity \(e_1\).\\\(a\sim \uniformDistribution(\gfincidents(e_1))\)
    101 \end{enumerate}
    102 The first vertex we select must be weighted by how many paths start there, and since paths of length 1 are arcs, we weight each vertex by its degree.%
    103 \sidenote[][-11mm]{
    104 	To give an intuition, we can also think of what would happen if we chose both the entity and incident arc uniformly.
    105 	An arc that links two entities otherwise unrelated to any other entities is likely to be sampled since sampling any of its two endpoints as \(e_1\) would guarantee we select this arc.
    106 	On% XXX manual page break
    107 }
    108 If we want to sample paths of length 2, the first vertex must be selected according to the number of paths of length 2 starting there.
    109 Then the second vertex is selected among the neighbors of the first weighted by the number of paths of length 1 starting there, etc.
    110 
    111 \leavevmode
    112 \begin{marginparagraph}[-115mm]% XXX manual page break
    113 	the other hand, an arc whose both endpoints have high degrees has little chance of being sampled since even if one of its endpoints is selected as \(e_1\) in the first step, the arc is unlikely to be selected in the second step.
    114 \end{marginparagraph}
    115 Sadly enough, counting paths is \#P-complete%
    116 \sidenote{A functional complexity class at least as hard as NP-complete.}
    117 \parencite{path_counting_sharp_p} so we must rely on the regularity of our graph and turn to approximate algorithms.
    118 We propose to use the number of walks as an approximation of the number of paths.%
    119 \sidenotemark%No room here, moved to next page.
    120 A classical result on simple graphs \(G=(V, E)\) is that the powers of the adjacency matrix \(\mtrx{M}\) count the number of walks between pairs of vertices.
    121 For any two vertices \(u, v\in V\), the value \(m^k_{uv}\)---to be interpreted as \((\mtrx{M}^k)_{uv}\)---is the number of walks of length \(k\) from \(u\) to \(v\).
    122 In the case of our multigraph, if we wish to count walks, the adjacency matrix should contain the number of arcs---that is, the number of walks of length 1---between vertices.
    123 
    124 \sidenotetext{
    125 	Other approximations of path counting exist \parencite{path_counting_estimation}, but the approach we propose is particularly suited to our multigraph.
    126 	In particular, the shape parameter \(\gamma\) of our degree distribution is relatively small, which produces a large number of outliers.
    127 	Our importance-sampling-based approach allows us to reduce the variance of the frequency estimations.
    128 }
    129 \begin{algorithm}
    130 	\centering
    131 	\begin{minipage}[b]{9cm}
    132 		\input{mainmatter/graph/path counting.tex}
    133 	\end{minipage}
    134 	\scaption[Path counting algorithm]{
    135 		Path counting algorithm.
    136 		The higher the number of iterations of the main loop, the more precise the results will be.
    137 		In our experiments, we used one billion iterations.
    138 		The inner for loop builds the walk \(\vctr{a}\).
    139 		If it is a correct path, the relation type of the path is added to the counter with importance weight \(w\).
    140 		For numerical stability, we actually compute \(w\) in log-space.
    141 		The initial factor \(n=|\entitySet|\) in \(w\) comes from the preceding uniform sampling of \(v\) from \(\entitySet\), which is part of the computation of \(\symcal{F}^k\).
    142 		\label{alg:graph:path counting}
    143 	}
    144 	\vspace{-5mm}
    145 \end{algorithm}
    146 
    147 We could then build a Monte Carlo estimate by following the naive procedure above of sampling vertices one by one according to the number of walks starting with them.
    148 Let's call \(\symcal{W}^k\) this distribution over walks of length \(k\).
    149 Sampling from \(\symcal{W}^k\) is particularly slow since it involves sampling from a categorical distribution over thousands of elements.
    150 Since we only want to evaluate a (counting) function over an expectation \(\expectation_{\vctr{a}\sim\symcal{W}^k}\), we can instead perform importance sampling.
    151 We use the substitute distribution \(\symcal{F}^k\) that uniformly selects a random neighbor at each step.
    152 To make this trick work, we only need to compute the importance weights \(\frac{\symcal{W}^k(\vctr{a})}{\symcal{F}^k(\vctr{a})}\) for all walks \(\vctr{a}\in\arcSet^k\).
    153 Since \(\symcal{W}^k\) is the uniform distribution over all walks, it is constant \(\symcal{W}^k(\vctr{a}) = (\symbf{1}\transpose\mtrx{M}^k\symbf{1})^{-1}\).
    154 On the other hand \(\symcal{F}^k(\vctr{a})\) can be trivially computed as the product of inverse degrees of \(a_i\).
    155 The resulting counting procedure is listed as Algorithm~\ref{alg:graph:path counting}.
    156 We still need to reject non-paths at the end of the main loop.
    157 Note that this algorithm is not exact since the importance weights \(w\) are computed from the number of walks, not paths.
    158 
    159 \begin{table}[t]
    160 	\centering
    161 	\input{mainmatter/graph/paths frequencies.tex}%
    162 	\scaption*[Frequencies of some paths of length 3 in \textsc{t-re}x.]{
    163 		Frequencies of some paths of length 3 in \textsc{t-re}x.
    164 		The first column gives the approximate per mille frequency of paths with the given type.
    165 		It is computed as the importance weight attributed to the path by the counter \(C\) in Algorithm~\ref{alg:graph:path counting} divided by the sum of all importance weights in \(C\).
    166 		We use \({}_\textsc{st}\) as an abbreviation of ``sport team.''
    167 		The path in the first row is the most frequent one in the dataset; other paths were selected for illustrative purposes.
    168 		The last path was sampled a single time with an importance weight of 0.89.
    169 		\label{tab:graph:paths frequencies}
    170 	}
    171 \end{table}
    172 
    173 Using this algorithm on one billion samples from \textsc{t-re}x, we find that the most common paths of length three are related to geopolitical relations,%
    174 \sidenote{This is not surprising as most general knowledge datasets are dominated by geopolitical entities and relations.}
    175 see Table~\ref{tab:graph:paths frequencies}.
    176 Let us now turn to statistics that could help relation extraction models.
    177 To showcase the dependency between a sample's relation \(\rndm{r}_2\) and its neighbors \(\rndm{r}_1\) and \(\rndm{r}_3\), we investigate the distribution \(P(\rndm{r}_2\mid\rndm{r}_1, \rndm{r}_3)\).
    178 In other words, given a sample, we want to see how its relation is influenced by the relations of two neighboring samples.
    179 
    180 The first value we can look at is the entropy%
    181 \sidenote{
    182 	This is not a conditional entropy.
    183 	The context relations \(r_1\), \(r_3\) are fixed; they correspond to elementary events, not random variables (as shown by the fact that they are italicized, not upshape).
    184 }
    185 \(\entropy(\rndm{r}_2\mid r_1, r_3)\).
    186 For example, in the case of \(r_1=\textsl{sport}\) and \(r_3=\widebreve{\textsl{member of}_\textsc{st}}\), all observed values of \(\rndm{r}_2\) are given in Table~\ref{tab:graph:paths frequencies}.
    187 All of them were \(\widebreve{\textsl{sport}}\) with the exception of a single path, which means that \(\entropy(\rndm{r}_2\mid r_1, r_3)\approx 0\).
    188 In other words, if we are given a sample \((s, \vctr{e})\in\dataSet\) and we suspect another sentence containing \(e_1\) to convey \textsl{sport} and another containing \(e_2\) to convey \(\widebreve{\textsl{member of}_\textsc{st}}\), we can be almost certain that the sample \((s, \vctr{e})\) conveys \(\widebreve{\textsl{sport}}\).
    189 
    190 \begin{marginparagraph}
    191 	As a reference for the remainder of this section, the distribution of relation in \textsc{t-re}x has an entropy of \(\entropy(\rndm{r})\approx 6.26\ \text{bits}\).
    192 	This is for a domain of \(|\relationSet|=1\,316\) relations.
    193 \end{marginparagraph}
    194 To measure this type of dependency at the level of the dataset, we can look at the following value:
    195 \begin{equation*}
    196 	\kl\left(P(\rndm{r}_2\mid r_1, r_3) \middlerel{\|} P(\rndm{r}_2)\right)
    197 \end{equation*}
    198 \begin{marginparagraph}
    199 	To give a first intuition of what this value represents, we take once again the trivial example of \(r_1=\textsl{sport}\) and \(r_3=\widebreve{\textsl{member of}_\textsc{st}}\).
    200 	In this case, \(\kl\left(P(\rndm{r}_2\mid r_1, r_3) \middlerel{\|} P(\rndm{r}_2)\right) \approx 5.47\ \text{bits}\).
    201 	This is due to the fact that encoding \(\rndm{r}_2\) given its neighbors necessitates close to 0~bits (as shown in Table~\ref{tab:graph:paths frequencies}, \(\rndm{r}_2\) almost always takes the value \(\widebreve{\textsl{sport}}\)) but encoding \(\widebreve{\textsl{sport}}\) among all possible relations in \(\relationSet\) necessitates 5.47~bits (which is a bit less than most relations since \(\widebreve{\textsl{sport}}\) commonly appears in \textsc{t-re}x).
    202 \end{marginparagraph}
    203 The Kullback--Leibler divergence is also called the \emph{relative entropy}.
    204 Indeed, \(\kl(P\mathrel{\|}Q)\) can be interpreted as the additional quantity of information needed to encode \(P\) using the (suboptimal) entropy encoding given by \(Q\).
    205 If this value is 0, it means that no additional information was provided by \(r_1\) and \(r_3\).
    206 When marginalizing over all possible contexts \(r_1\), \(r_3\), we obtain the mutual information between the relation of a sample \(r_2\) and the relation of two of its neighbors.
    207 On \textsc{t-re}x, we observe:
    208 \begin{equation*}
    209 	\operatorname{I}(\rndm{r}_2; \rndm{r}_1, \rndm{r}_3) \approx 6.95\ \text{bits}
    210 \end{equation*}
    211 In other words, we can gain 6.95 bits of information simply by modeling two neighbors (one per entity).
    212 These 6.95 bits can be interpreted as the number of bits needed to perfectly encode \(\rndm{r}_2\) given \(\rndm{r}_1\), \(\rndm{r}_3\) (the conditional entropy \(\entropy(\rndm{r}_2\mid\rndm{r}_1,\rndm{r}_3)\approx 1.06\ \text{bits}\)) substracted from the number of bits needed to encode \(\rndm{r}_2\) without looking at its neighbors (the cross-entropy \(\expectation_{r_1, r_3}[\entropy_{P(\rndm{r}_2)}(\rndm{r}_2\mid r_1, r_3)]\approx 8.01\ \text{bits}\)).%
    213 \sidenote{
    214 	We denote the cross-entropy by \(\entropy_Q(P) = -\expectation_P[\log Q]\).
    215 }
    216 In other words, most of the uncertainty about the relation of a sample can be removed by looking at the relations of two of its neighbors.