approach.tex (22501B)
1 \section{Proposed Approaches} 2 \label{sec:graph:approach} 3 We now turn to the graph-based models we propose to leverage information from the structure of the dataset. 4 Let us quickly summarize the context in which we inscribe our work. 5 We have access to two kinds of features: linguistic---from the sentence---and topological---from the graph. 6 Unsupervised relation extraction methods do not fully exploit graph neighborhoods.% 7 \sidenote{As explained in Section~\ref{sec:graph:encoding}, \textsc{mtb} does use close neighborhoods as contrast during training, but not for inference.} 8 Supervised methods such as \textsc{epgnn} and \textsc{gp-gnn} do, even though the information present in the graph is more important in the unsupervised setting. 9 Indeed, the relational information is mostly extractable from the sentences and entities alone. 10 While extra information from topological features can still be used by supervised models, it is not essential. 11 On the other hand, in the unsupervised setting, the main issue is to identify the relational information in the sentence, to distinguish it from other semantic contents. 12 As we show in Section~\ref{sec:graph:analysis}, this relational information is also present in the topological features (the neighborhood of a sample). 13 This can be useful in two ways: 14 \begin{enumerate} 15 \item Use both pieces of information jointly, linguistic and topological: ``the more features, the better.'' 16 This is what supervised models do. 17 \item Use the topological features to identify the relational information in the linguistic features. 18 \end{enumerate} 19 20 In Section~\ref{sec:graph:topological features}, we exploit the first point by adding a \textsc{gcn} to the matching the blanks model (\textsc{mtb}, Section~\ref{sec:relation extraction:mtb}). 21 In Section~\ref{sec:graph:nonparametric wl}, we show that topological features can be used without training a \textsc{gcn}. 22 This also serves as an introduction to Section~\ref{sec:graph:refining}, which proposes an unsupervised loss following the second point above; it exploits the fact that relation information is present in both linguistic and topological features. 23 24 \subsection{Using Topological Features} 25 \label{sec:graph:topological features} 26 In this section, we seek to use topological information as additional features for an existing unsupervised model: matching the blanks (\textsc{mtb}). 27 The usefulness of these features lies in the fact that many relations are ``typed'': e.g.~they only accept geographical locations as objects and only people as subjects (such as \emph{born in}). 28 This can be captured by looking at the neighborhood of each entity, which can be seen as a ``soft'' version of \hypothesis{type} (``relations are typed,'' Section~\refAssumptionSection{type}). 29 30 A straightforward approach is to parallel the construction of \textsc{r-gcn} (Section~\ref{sec:graph:r-gcn}): use a \textsc{gcn}-like encoder followed by a relation classifier---in the case of \textsc{r-gcn}, DistMult. 31 In effect, this corresponds to taking \textsc{mtb} and augmenting it with a \textsc{gcn} to process neighboring samples. 32 As a reminder, \textsc{mtb} uses a similarity-based loss where each unsupervised sample \((s, \vctr{e})\in\dataSet\) is represented by \(\bertcoder(s)\). 33 In this model, the information lies on the arcs. 34 In order to use a \textsc{gcn} model, we transform our graph \(G=(\entitySet, \arcSet, \gfendpoints, \gfrelation, \gfsentence)\) such that the information lies on the vertices instead. 35 This transformed graph is called the \emph{line graph} of \(G\) and noted \(L(G)\). 36 An illustration for simple undirected graphs is provided in Figure~\ref{fig:graph:line graph}. 37 For a directed (multi)graph, it is defined as follows: 38 \begin{marginfigure} 39 \centering 40 \input{mainmatter/graph/line graph.tex} 41 \scaption[Example of line graph construction.]{ 42 Example of line graph construction. 43 Each edge \(x\,\text{---}\,y\) in the simple undirected graph \(G\) corresponds to the vertex \(xy\) with the same color in the graph \(L(G)\). 44 Two vertices in \(L(G)\) are connected iff the corresponding edges share an endpoint in \(G\). 45 In directed graphs, the two arcs further need to be in the same direction in \(G\) for an arc to exist in \(L(G)\). 46 \label{fig:graph:line graph} 47 } 48 \end{marginfigure} 49 \begin{align*} 50 L(G) & = (\arcSet, \symfrak{A}, \gfendpoints, \gfsentence) \\ 51 \symfrak{A} & = \left\{\,(a_1, a_2)\in\arcSet^2 \middlerel{|} \gftarget(a_1) = \gfsource(a_2) \,\right\}. 52 \end{align*} 53 In other words, each arc becomes a vertex and an arc \(a_1\to a_2\) is present if and only if \(a_1\) and \(a_2\) form a directed path of length 2. 54 The neighborhood of each sample (arc is the original \(G\)) is still defined as all other samples with at least one entity in common since by construction for all \textsc{v}-structures \inlineDoubleArc*[a_1][a_2]{e_1}{e_2}{e_3}, there exists a directed path \inlineDoubleArc[a_1][\breve{a}_2]{e_1}{e_2}{e_3} in the original graph \(G\). 55 This construction is actually similar to the one of \textsc{epgnn} introduced in Section~\ref{sec:relation extraction:epgnn}. 56 The main difference is that each vertex in \(L(G)\) corresponds to a sample in \(\dataSet\), while an \textsc{epgnn} graph groups samples by entity pairs into a single vertex. 57 58 The standard loss and training algorithm of \textsc{mtb} as defined by Equation~\ref{eq:relation extraction:mtb loss} can be reused as is, we only need to redefine the similarity function (Equation~\ref{eq:relation extraction:mtb similarity}): 59 \begin{equation} 60 \operatorname{sim}(a, a', G) = \sigmoid\left( 61 \begin{aligned} 62 \bertcoder(\gfsentence(a))\transpose\bertcoder(\gfsentence(a'))\hspace{2cm} \\ 63 + \lambda \operatorname{\textsc{gcn}}(L(G))_a\transpose\operatorname{\textsc{gcn}}(L(G))_{a'} 64 \end{aligned} 65 \right), 66 \label{eq:graph:mtb-gcn} 67 \end{equation} 68 where \(\lambda\) is a hyperparameter weighting the topological-based prediction over the sentence-based one. 69 At the input of the \textsc{gcn}, the vertices are labeled using the same sentence encoder: \(\vctr{x}_a = \bertcoder(\gfsentence(a))\). 70 71 The only difference between \textsc{mtb} and the \textsc{mtb}--\textsc{gcn} hybrid we propose is the additional \(\lambda\)-weighted term in Equation~\ref{eq:graph:mtb-gcn}. 72 We use this model to evaluate whether topological features can be exploited by an existing unsupervised relation extraction loss. 73 It tells us how much can be gained from the ``adding more features'' aspect of graph-based methods and contrast it with the new topology-aware loss design we propose in Section~\ref{sec:graph:refining}. 74 75 \subsection{Nonparametric Weisfeiler--Leman Iterations} 76 \label{sec:graph:nonparametric wl} 77 The losses used to train unsupervised \textsc{gnn}s usually make the hypothesis that linked vertices should have similar representations. 78 This can be seen in \loss{gs} (Equation~\ref{eq:graph:graphsage loss}), which seeks to maximize the dot product between the representations of adjacent vertices. 79 While this hypothesis might be helpful for most problems on which \textsc{gnn}s are applied, this is clearly not the case for relation extraction. 80 In Section~\ref{sec:graph:topological features}, we introduced a first simple solution to this problem is to replace the loss used by the \textsc{gnn} with a standard unsupervised relation extraction loss. 81 However, it is also possible to design an unsupervised loss from the theoretical foundation of \textsc{gcn}: the Weisfeiler--Leman isomorphism test. 82 To this end, we propose to build a model relying on the following hypothesis: 83 \begin{spacedblock} 84 \strong{Weak Distributional Hypothesis on Relation Extraction Graph:} 85 \emph{Two arcs conveying similar relations have similar neighborhoods.} 86 \end{spacedblock} 87 Note that we dubbed this version of the distributional hypothesis \emph{weak} since we only state it in one direction, the converse having several counter-examples. 88 For example, sentences about the place of birth and the place of death of a person tend to have similar neighborhoods despite conveying different relations.% 89 \sidenote{ 90 The neighborhoods are somewhat dissimilar in that ``notable'' people tend to die in places with more population than their birthplace. 91 However, whether current models can pick this up from other kinds of regularity in a dataset is dubious. 92 } 93 To distinguish these kinds of relations with similar neighborhoods, we have to rely on sentence representations.% 94 \sidenote[][21mm]{ 95 This can partly explain the conditional entropy \(\entropy(\rndm{r}_2\mid\rndm{r}_1,\rndm{r}_3)\approx 1.06\ \text{bits}\) given in Section~\ref{sec:graph:analysis}. 96 } 97 98 Following this hypothesis, we first propose a simple parameter-less approach based on the Weisfeiler--Leman isomorphism test (Section~\ref{sec:graph:weisfeiler-leman}). 99 \emph{We can say that two neighborhoods are similar if they are isomorphic.} 100 Therefore, we can enforce the hypothesis above by ensuring that if two neighborhoods are assigned similar coloring by the \textsc{wl} algorithm, they convey similar relations. 101 In the relation extraction problem, contrary to much of the related work presented in Section~\ref{sec:graph:related work}, we have data on the arcs of the graph, not on the vertices. 102 This means that instead of using the 1-dimensional Weisfeiler--Leman algorithm, we use the 2-dimensional version. 103 In other words, instead of coloring the vertices, we color the arcs since our problem is to label them with a relation. 104 105 The initial coloring \(\chi_0(a)\) is initialized as the isomorphism class of a sample \(a\in\arcSet\). 106 We can define this isomorphism class using \(\bertcoder(a)\), which means that the initial representation of a sample will simply be the sentential representation of the sample. 107 The difficult task is to define the re-indexing of colors as performed by \(\symfrak{I}\) in Algorithm~\ref{alg:graph:weisfeiler-leman}. 108 This is difficult since the original \textsc{wl} algorithm is defined on a discrete set of colors, while we need to manipulate distributed representations of sentences. 109 110 \begin{marginparagraph} 111 The astute reader might have noticed that the 2-dimensional \textsc{wl} isomorphism test as described in Algorithm~\ref{alg:graph:weisfeiler-leman} loops over pairs of vertices, not arcs. 112 This is impractical in our relation extraction graph, which is particularly sparse---the number of arcs \(m\) is far larger than the number of vertices \(n\). 113 The extra (unlinked) entity pairs considered by Algorithm~\ref{alg:graph:weisfeiler-leman} are usually referred to as \emph{anti-arcs}. 114 Ignoring anti-arcs leads to the local Weisfeiler--Leman isomorphism tests since only the ``local neighborhood'' is considered. 115 Other intermediate approaches are possible, sometimes referred to as the \emph{glocalized} variants of Weisfeiler--Leman. 116 See \textcite{weisfeiler-leman_sparse} for an example of application to graph embeddings. 117 Alternatively, our proposed approach can be seen as a 1-dimensional Weisfeiler--Leman isomorphism test applied to the line graph. 118 \end{marginparagraph} 119 120 If we want to produce clear-cut relation classes, we can use a hashing algorithm on sentence representations such as the one proposed for graph kernels by \textcite{graph_continuous_hashing}. 121 However, we focus on a few-shot evaluation in order to compare with \textsc{mtb} and to avoid errors related to knowledge base design as described in Section~\ref{sec:relation extraction:few-shot}. 122 In this case, we only need to be able to compare the colors of two different samples, measuring how close they are to each other. 123 Let us define \(\gfeneighbors: \arcSet\to2^\arcSet\) the function mapping an arc to the set of its neighbors. 124 Formally, for \(a\in\arcSet\), \(\gfeneighbors(a) = \{a'\in\arcSet\mid \gfendpoints(a)\cap\gfendpoints(a')\neq\emptyset\}\). 125 In other words, \(\gfeneighbors\) in \(G\) corresponds to the neighbors function \(\gfneighbors\) in the line graph \(L(G)\). 126 Since \(\arcSet\) can be seen as the set of samples, \(\gfeneighbors(a)\) can be seen as the set of samples with at least one entity in common with \(a\). 127 To enforce the weak distributional hypothesis on graphs stated above, we take two first-order neighborhoods \(\gfeneighbors(a), \gfeneighbors(a')\subseteq\arcSet\) and define a distance between them. 128 This corresponds to comparing two empirical distributions of sentence representations% 129 \sidenote{ 130 We are comparing sentence representations and not directly sentences since the initial coloring \(\chi_0\) has been defined using \bertcoder. 131 } 132 that have an entity in common with \(a\) and \(a'\). 133 This can be done using the 1-Wasserstein distance between the two neighborhoods since they can be seen as two distributions of Dirac deltas in \bertcoder{} representation space.% 134 \sidenote{ 135 Wasserstein distance has the advantage of working on distributions with disjoint supports. 136 } 137 This needs to be done for the two entities, which correspond to the in-arc-neighbors \(\gfeneighbors_\gfnleft\) and out-arc-neighbors \(\gfeneighbors_\gfnright\). 138 While this is 1-localized, we can generalize this encoding to be \(K\)-localized by defining the \(k\)-sphere centered on an arc \(a\), where the 1-sphere corresponds to \(\gfeneighbors\): 139 \begin{align*} 140 S_\gfnright(a, 0) & = \{\,a\,\} \\ 141 S_\gfnright(a, k) & = \{\,x\in\arcSet \mid \exists y\in S_\gfnright(a, k-1): \gfsource(x)=\gftarget(y)\,\}. 142 \end{align*} 143 This sphere can be embedded using \bertcoder{}, which corresponds to retrieving its initial coloring: 144 \begin{equation*} 145 \symfrak{S}_\gfnright(a, k) = \{\,\bertcoder(\gfsentence(x))\in\symbb{R}^d \mid x\in S_\gfnright(a, k)\,\}. 146 \end{equation*} 147 We can thereafter define the \(K\)-localized out-neighborhood of \(a\in\arcSet\) as the sequence of \(\symfrak{S}_\gfnright(a, k)\) for all \(k=1,\dotsc,K\). 148 The in-neighborhood is defined similarly. 149 Finally, the distance between two samples \(a, a'\in\arcSet\) can be defined as: 150 \begin{marginparagraph} 151 To be precise Equation~\ref{eq:graph:topological distance} defines a distance between samples from the Euclidean distances between neighboring samples---that is samples with an entity in common. 152 The distance \(W_1\) is the cost of the optimal transport plan between two sets of Dirac deltas corresponding to the neighborhoods of the samples. 153 \end{marginparagraph} 154 \begin{equation} 155 d(a, a'; \vctr{\lambda}) = 156 \sum_{k=0}^K \frac{\lambda_k}{2} 157 \sum_{o\in\{\gfoleft, \gforight\}} 158 W_1\left(\symfrak{S}_o(a, k), \symfrak{S}_o(a', k)\right), 159 \label{eq:graph:topological distance} 160 \end{equation} 161 where \(W_1\) designates the 1-Wasserstein distance, and \(\vctr{\lambda}\in\symbb{R}^{K+1}\) weights the contribution of each sphere to the final distance value. 162 In particular \(\lambda_0\) parametrizes how much the linguistic features should weight compared to topological features.% 163 \sidenote{ 164 The 1-Wasserstein distance is defined on top of a metric space; therefore, the difference between two neighbors must be defined using the Euclidean distance. 165 We can't use dot product as usually done with \textsc{bert} representations (see for example Equation~\ref{eq:relation extraction:mtb similarity}). 166 However, we can slightly change Equation~\ref{eq:graph:topological distance} to use the dot product for the computation of the linguistic similarity (the term \(k=0\)). 167 In this case, however, \(d\) would no longer satisfy the properties of a metric. 168 } 169 170 To relate this function back to our original re-coloring problem, the distance \(d\) up to \(K\) can be seen as a distance on \(\chi_K\), the coloring assigned at step \(K\). 171 Indeed, if \(d(a, a', \vctr{\lambda}) = 0\) then \(\chi_K(a)=\chi_K(a')\). 172 However, while two colors are either equal or not in the original algorithm, the distance \(d\) gives a topology to the set of arcs. 173 We don't directly compute a hard-coloring of 2-tuples. 174 The closest thing to a coloring \(\chi\) in our algorithm is the sphere embedding \(\symfrak{S}\), which in fact, is more akin to \(c\) in Algorithm~\ref{alg:graph:weisfeiler-leman}. 175 In other words, we skip the re-indexing step of the Weisfeiler--Leman algorithm to deal with the continuous nature of sentence embeddings at the cost of a higher computational cost. 176 177 Combining a Wasserstein distance with Weisfeiler--Leman was already proposed for graph kernels \parencite{weisfeiler-leman_wasserstein}. 178 However, this was applied to a simple graph without attributed edges, and it was unrelated to any information extraction task. 179 For unsupervised relation extraction, the distance function \(d\) can directly be used to compute the similarity between query and candidates samples in a few-shot problem (Section~\ref{sec:relation extraction:few-shot}). 180 Since the number of arcs at distance \(k\) grows quickly in a scale-free graph,% 181 \sidenote{ 182 Remember that the diameter of the (scale-free) graph is in the order of \(\log\log n\). 183 } 184 we either need to keep \(K\) low or employ sampling strategies similarly to Graph\textsc{sage} (Section~\ref{sec:graph:spatial gcn}). 185 Furthermore, the Wasserstein distance is hard to compute exactly; entropic regularization of the objective has been proposed. 186 In particular, \(W_1\) can be efficiently computed with Sinkhorn iterations \parencite{sinkhorn}. 187 188 \subsection{Refining Linguistic and Topological Features} 189 \label{sec:graph:refining} 190 While the nonparametric method presented in the \hyperref[sec:graph:nonparametric wl]{previous} section manages to consider both the linguistic and topological features, it processes them in isolation. 191 In this section, we propose a scheme that allows both the encoder of linguistic and topological features to adapt to each other in a training process. 192 Conceptually, this is somewhat similar to Self\textsc{ore} (Section~\ref{sec:relation extraction:selfore}). 193 As a reminder, Self\textsc{ore} is a clustering method that purifies relation clusters by optimizing \bertcoder{} such that samples with close linguistic forms are pushed closer. 194 In our scheme, we propose to refine both linguistic and topological features with respect to each other. 195 \begin{marginparagraph} 196 As a reminder, \hypothesis{\ctxoneadj} states that two samples with similar contextualized embeddings convey similar relations. 197 See Appendix~\ref{chap:assumptions}. 198 \end{marginparagraph} 199 In this way we hope to both enforce \hypothesis{\ctxoneadj} and the following assumption:% 200 \begin{assumption}[oneneighborhood]{1-neighborhood} 201 Two samples with the same neighborhood in the relation extraction graph convey the same relation. 202 203 \smallskip 204 \noindent 205 \( \forall a, a'\in\arcSet\colon \gfeneighbors(a) = \gfeneighbors(a') \implies \gfrelation(a)=\gfrelation(a') \) 206 \end{assumption} 207 Note that this is the converse of the weak distributional hypothesis on relation extraction graph stated in Section~\ref{sec:graph:nonparametric wl}. 208 We need to make the modeling hypothesis in this direction since in the unsupervised relation extraction problem, we do not have access to relations and therefore can't enforce an hypothesis between samples conveying the same relations. 209 We posit that by balancing \hypothesis{\ctxoneadj} and \hypothesis{1-neighborhood} we are able to exploit the structure induced by both sources information in an unsupervised samples \((s, \vctr{e})\in\dataSet\): the sentence \(s\) and entities \(\vctr{e}\), whereas Self\textsc{ore} only relies on the sentence \(s\). 210 211 To define the topological and linguistic distance between two samples, we use the distance function defined by Equation~\ref{eq:graph:topological distance}. 212 For computational reasons, we set \(K=1\), which means that our model is 1-localized. 213 The linguistic distance is simply the distance between the \bertcoder{} of the samples' sentences. 214 In other words, it is \(d(a, a'; [1, 0]\transpose)\). 215 On the other hand, the topological distance can be defined as the distance between the two neighborhoods, in other words, \(d(a, a'; [0, 1]\transpose)\). 216 We propose to train \bertcoder{} such that these two distances coincide more. 217 In practice, this can be achieved with a triplet loss similar to the one used by TransE (Section~\ref{sec:context:transe}). 218 Given three arcs \(\vctr{a}\in\arcSet^3\), we ensure the two distances are similar between the two first arcs \(a_1\) and \(a_2\), and we contrast these distances using the third arc \(a_3\). 219 This translates to the following loss: 220 \begin{marginparagraph} 221 Intuitively, we want to optimize the mean squared error (\textsc{mse}) between the linguistic and topological features of all pairs of arcs \((d(a_1, a_2, [1, 0]\transpose) - d(a_1, a_2, [0, 1]\transpose))^2\). 222 However, this loss could be optimized by encoding all arcs into a single point. 223 The output of \bertcoder{} would then be constant. 224 Therefore, we need to regularize the \textsc{mse} loss such that distances that shouldn't be close are not. 225 This is the point of the triplet loss; we contrast the positive distance delta with a negative one. 226 While \(d(a_1, a_2, [1, 0]\transpose)\) and \(d(a_1, a_2, [0, 1]\transpose)\) should be close to each other (because of \hypothesis{1-neighborhood}), they shouldn't be close to any distance involving a third sample \(a_3\). 227 This ensures that our model does not collapse. 228 \end{marginparagraph} 229 \begin{equation*} 230 \loss{lt}(a_1, a_2, a_3) = \max\left( 231 \begin{aligned} 232 0, \zeta & 233 + 2 \big(d(a_1, a_2, [1, 0]\transpose) - d(a_1, a_2, [0, 1]\transpose)\big)^2 \\ 234 & \hspace{5mm} - \big(d(a_1, a_2, [1, 0]\transpose) - d(a_1, a_3, [0, 1]\transpose)\big)^2 \\ 235 & \hspace{5mm} - \big(d(a_1, a_3, [1, 0]\transpose) - d(a_1, a_2, [0, 1]\transpose)\big)^2 236 \end{aligned} 237 \right), 238 \end{equation*} 239 where \(\zeta > 0\) is a hyperparameter defining the maximum margin we seek to enforce between the true distance-error and the negative distance-error. 240 By randomly sampling arcs triplets \(\vctr{a}\in\arcSet^3\), we can fine-tune a \bertcoder{} in an unsupervised fashion such that it captures both linguistics and topological features. 241 During evaluation, the procedure described in Section~\ref{sec:graph:nonparametric wl} can be reused, such that both linguistic representations refined by the topological structure and the topological representations refined by the linguistic structure are used jointly. 242 However, both distances could be used independently, for example if a sample contains unseen entities, or on the contrary if we want to assess which relation links two entities without any supporting sentence.