experiments.tex (15496B)
1 \section{Experiments} 2 \label{sec:fitb:experiments} 3 To compare with previous works, we repeat the experimental setup of \textcite{vae_re} with the \bcubed{} evaluation metric \parencite{bcubed}. 4 We complemented this setup with two additional datasets extracted from \textsc{t-re}x \parencite{trex} and two more metrics commonly seen in clustering task evaluation: V-measure \parencite{v-measure} and \textsc{ari} \parencite{ari}. 5 This allows us to capture the characteristics of each approach in more detail. 6 7 In this section, we begin by describing the processing of the datasets in Section~\ref{sec:fitb:datasets}. 8 We then describe the experimental details of the models we evaluated in Section~\ref{sec:fitb:baselines}. 9 Finally, we give quantitative results in Section~\ref{sec:fitb:quantitative} and qualitative results in Section~\ref{sec:fitb:qualitative} 10 The description of the metrics can be found in Section~\ref{sec:relation extraction:clustering}. 11 Appendix~\ref{chap:datasets} gives further details on the source datasets, their specificities, their sizes and some example of their content when appropriate. 12 13 \subsection{Datasets} 14 \label{sec:fitb:datasets} 15 As explained in Section~\ref{sec:relation extraction:unsupervised evaluation}, to evaluate the models, we use labeled datasets, the labels being used for validation and testing. 16 The first dataset we consider is the one of \textcite{vae_re}, which is similar to the one used in \textcite{rellda}. 17 This dataset was built through distant supervision (Section~\ref{sec:relation extraction:distant supervision}) by aligning sentences from the New York Times corpus (\textsc{nyt}, Section~\ref{sec:datasets:nyt}, \cite{nyt}) with Freebase (\textsc{fb}, Section~\ref{sec:datasets:freebase}, \cite{freebase}) facts. 18 Several sentences were filtered out based on features like the length of the dependency path between the two entities, resulting in 2 million sentences with only 41\,000 (2\%) of them labeled with one of 262 possible relations. 19 20\% of the labeled sentences were set aside for validation; the remaining 80\% are used to compute the final results. 20 21 We also extracted two datasets from \textsc{t-re}x (Section~\ref{sec:datasets:trex}, \cite{trex}), which was built as an alignment of Wikipedia with Wikidata (Section~\ref{sec:datasets:wikidata}, \cite{wikidata}). 22 We only consider \((s, e_1, e_2)\) triplets where both entities appear in the same sentence.% 23 \sidenote{ 24 \textsc{t-re}x provides annotations for whole articles; it should therefore be possible to process broader contexts by defining \(\sentenceSet\) as a set of articles. 25 However, in this work, we stay in the traditional sentence-level relation extraction setup. 26 } 27 If a single sentence contains multiple triplets, it appears multiple times in the dataset, each time with a different pair of tagged entities. 28 We built the first dataset \textsc{ds} by extracting all triplets of \textsc{t-re}x where the two entities are linked by a relation in Wikidata. 29 This is the usual distant supervision method. 30 It results in 1\,189 relations and nearly 12 million sentences, all of them labeled with a relation. 31 32 In Wikidata, each relation is annotated with a list of associated surface forms; for example, ``\textsl{shares border with}'' can be conveyed by ``borders,'' ``adjacent to,'' ``next to,'' etc. 33 The second dataset we built, \textsc{spo}, only contains the sentences where a surface form of the relation also appears in the sentence, resulting in 763\,000 samples (6\% of the unfiltered dataset) and 615 relations. 34 This dataset still contains some misalignment, but it should nevertheless be easier for models to extract the correct semantic relation since the set of surface forms is much more restricted and much more regular. 35 36 \subsection{Baselines and Models} 37 \label{sec:fitb:baselines} 38 We compare our model with three state-of-the-art approaches, two generative rel-\textsc{lda} models of \textcite{rellda}, the \textsc{vae} model of \textcite{vae_re} and the deep clustering of \textsc{bert} representations by \textcite{selfore}. 39 40 The two rel-\textsc{lda} models only differ by the number of features considered. 41 We use the eight features listed in \textcite{vae_re}: 42 \begin{enumerate} 43 \item the bag of words of the infix; 44 \item the surface form of the entities; 45 \item the lemma words on the dependency path; 46 \item the \textsc{pos} of the infix words; 47 \item the type of the entity pair (e.g.\ person--location); 48 \item the type of the head entity (e.g.\ person); 49 \item the type of the tail entity (e.g.\ location); 50 \item the words on the dependency path between the two entities. 51 \end{enumerate} 52 Rel-\textsc{lda} uses the first three features, while rel-\textsc{lda}1 is trained by iteratively adding more features until all eight are used. 53 54 To assess our two main contributions individually, we evaluate the \textsc{pcnn} classifier and our additional losses separately. 55 More precisely, we first study the effect of the RelDist losses by looking at the differences between models optimizing \(\loss{ep}+\loss{vae reg}\) and the ones optimizing \(\loss{ep}+\loss{s}+\loss{d}\) with \loss{ep} being either computed using the relation classifier of \textcite{vae_re} or our \textsc{pcnn}. 56 Second, we study the effect of the relation classifier by comparing the feature-based classifier and the \textsc{pcnn} trained with the same losses. 57 We also give results for our RelDist losses together with a \bertcoder{} classifier. 58 This latter combination is evaluated by \textcite{selfore} following our experimental setup. 59 We thus focus mainly on four models: 60 \begin{itemize} 61 \item \(\text{Linear}+\loss{vae reg}\), which corresponds to the model of \textcite{vae_re}; 62 \item \(\text{Linear}+\loss{s}+\loss{d}\), which uses the feature-based linear encoder of \textcite{vae_re} together with our RelDist losses; 63 \item \(\textsc{pcnn}+\loss{vae reg}\), which uses our \textsc{pcnn} encoder together with the regularization of \textcite{vae_re}; 64 \item \(\textsc{pcnn}+\loss{s}+\loss{d}\), which is our complete model. 65 \end{itemize} 66 67 All models are trained with ten relation classes, which, while lower than the number of actual relations, allows us to compare the models faithfully since the distribution of gold relations is very unbalanced. 68 For feature-based models, the size of the features domain range from 1 to 10~million values depending on the dataset. 69 We train our models with Adam using \(L_2\) regularization on all parameters. 70 To have a good estimation of \(P(\rndm{R})\) in the computation of \(\loss{d}\), we use a batch size of 100. 71 Our word embeddings are of size 50, entities embeddings of size \(m=10\). 72 We sample \(k=5\) negative samples to estimate \(\loss{ep}\). 73 Lastly, we set \(\alpha=0.01\) and \(\beta=0.02\). 74 All three datasets come with a validation set, and following \textcite{vae_re}, we used it for cross-validation to optimize the \bcubed{} \fone{}. 75 76 \subsection{Results} 77 \label{sec:fitb:quantitative} 78 \begin{table*}[t] 79 \centering 80 \input{mainmatter/fitb/quantitative.tex} 81 \scaption[Quantitative results of clustering models.]{ 82 Results (percentage) on our three datasets. 83 The results for rel-\textsc{lda}, rel-\textsc{lda}1, Linear and \textsc{pcnn} are our own, while results for \bertcoder{} and Self\textsc{ore}, marked with \(^\dagger\), are from \textcite{selfore}. 84 The best results at the time of publication of our article are in \strong{bold}, while the best results at the time of writing are in \emph{italic}. 85 \label{tab:fitb:quantitative} 86 } 87 \end{table*} 88 The results reported in Table~\ref{tab:fitb:quantitative} are the average test scores of three runs on the \nytfb{} and \trexspo{} datasets, using different random initialization of the parameters---in practice, the variance was low enough so that reported results can be analyzed. 89 We observe that regardless of the model and metrics, the highest measures are obtained on \trexspo{}, then \nytfb{} and finally \trexds{}. 90 This was to be expected since \trexspo{} was built to be easy, while hard-to-process sentences were filtered out of \nytfb{} \parencite{rellda, vae_re}. 91 We also observe that the main metrics agree in general (\bcubed{}, V-measure and \textsc{ari}) in most cases. 92 Performing a \textsc{pca} on the measures, we observed that V-measure forms a nearly-orthogonal axis to \bcubed{}, and to a lesser extent \textsc{ari}. 93 Hence we can focus on \bcubed{} and V-measure in our analysis. 94 95 We first measure the benefit of our RelDist losses: on all datasets and metrics, the two models using \(\loss{s}+\loss{d}\) are systematically better than the ones using \loss{vae reg}: 96 \begin{itemize} 97 \item The \textsc{pcnn} models consistently gain between 7 and 11 points in \bcubed{} \fone{} from these additional losses; 98 \item The feature-based linear classifier benefits from the RelDist losses to a lesser extent, except on the \trexds{} dataset on which the \(\text{Linear}+\loss{vae reg}\) model without the RelDist losses completely collapses---we hypothesize that this dataset is too hard for the model given the number of parameters to estimate. 99 \end{itemize} 100 101 We now restrict to discriminative models based on \(\loss{s}+\loss{d}\). 102 We note that both relation classifier (Linear and \textsc{pcnn}) exhibit better performances than generative ones (rel-\textsc{lda}, rel-\textsc{lda}1) with a difference ranging from 2.5/0.6 (\nytfb{}, for Linear/\textsc{pcnn}) to 11/17.8 (on \trexspo{}). 103 However, the advantage of \textsc{pcnn}s over feature-based classifiers is not completely clear. 104 While the \textsc{pcnn} version has a systematically better \bcubed{} \fone{} on all datasets (differences of 1.9/6.8/0.2 respectively for \nytfb{}/\trexspo{}/\trexds{}), the V-measure decreases by 0.4/4.0 on respectively \nytfb{}/\trexds{}, and \textsc{ari} by 2.1 on \trexds{}. 105 As \bcubed{} \fone{} was used for validation, this shows that the \textsc{pcnn} models overfit this metric by polluting relatively clean clusters with unrelated sentences or degrades well clustered gold relations by splitting them into two clusters. 106 107 The \bertcoder{} classifier improves all metrics consistently, with the sole exception of the V-measure on the \trexspo{} dataset. 108 This can be explained both by the larger expressive power of \textsc{bert} and by its pretraining as a language model. 109 The Self\textsc{ore} model, which is built on top of a \bertcoder{} further improves the results on all datasets. 110 Since these results are from a subsequent work \parencite{selfore}, we won't delve too much into details. 111 As mentioned in Section~\ref{sec:relation extraction:selfore}, Self\textsc{ore} is an iterative algorithm; the \hypothesis{uniform} assumption is enforced on the whole dataset at once, thus solving \problem{2}. 112 While to solve \problem{1}, Self\textsc{ore} uses a concentration objective (through the square in the target distribution \(\mtrx{P}\) in Equation~\ref{eq:relation extraction:selfore target}). 113 While the \bertcoder{} can replace our \textsc{pcnn} classifier and can be evaluated with our regularization losses, the Self\textsc{ore} algorithm is a replacement for the \(\loss{ep}+\loss{s}+\loss{d}\) and can't be use jointly with \(\loss{s}+\loss{d}\). 114 In theory, the Self\textsc{ore} algorithm could be used with a linear or \textsc{pcnn} encoder. 115 However, Self\textsc{ore} strongly relies on a good initial representation; such a model would need to be pre-trained as a language model beforehand. 116 117 \subsection{Qualitative Analysis} 118 \label{sec:fitb:qualitative} 119 \begin{figure*}[t] 120 \centering 121 \renderConfusions 122 {mainmatter/fitb/confusion lda.xml}{Rel-\textsc{lda}1} 123 {mainmatter/fitb/confusion vae.xml}{\(\text{Linear}+\loss{vae reg}\)} 124 {mainmatter/fitb/confusion regularized vae.xml}{\(\text{Linear}+\loss{s}+\loss{d}\)} 125 {mainmatter/fitb/confusion pcnn.xml}{\(\textsc{pcnn}+\loss{s}+\loss{d}\)} 126 \vspace{-7mm} 127 \scaption[Confusion matrices on the \trexspo{} dataset.]{ 128 Normalized confusion matrices for the \trexspo{} dataset. 129 For each model, each of the 10 columns corresponds to a predicted relation cluster, which were sorted to ease comparison. 130 The rows identify Wikidata relations sorted by their frequency in the \trexspo{} corpus (reported as percentage in front of each relation name). 131 The area of each circle is proportional to the number of sentences in the cell. 132 For clarity, the matrix was normalized so that each row sum to 1, thus it is more akin to a \bcubed{} per-item recall than a true confusion matrix. 133 \label{fig:fitb:confusion} 134 } 135 \end{figure*} 136 137 Since, for our model of interest, all the metrics agree on the \trexspo{} dataset, we plot the confusion matrix of our models in Figure~\ref{fig:fitb:confusion}. 138 Each row is labeled with the gold Wikidata relation extracted through distant supervision. 139 For example, the top left cell of each matrix correspond to the value \(P\mkern1mu\big(c(\rndm{X})=0\mathrel{\big|} g(\rndm{X})=\text{``}\sfTripletHolds{e_1}{located in}{e_2}\text{''}\big)\) using the notation of Section~\ref{sec:relation extraction:unsupervised evaluation}. 140 Since relations are generally not symmetric, each Wikidata relation appears twice in the table, once for each disposition of the entities in the sentence. 141 This is particularly problematic with symmetric relations such as ``shares border,'' which are two different gold relations that actually convey the same semantic relation. 142 143 To interpret Figure~\ref{fig:fitb:confusion}, we have to see whether a predicted cluster (column) contains different gold relations---paying attention to the fact that the most important gold relations are listed in the top rows (the top 5 relations account for 50\% of sentences). 144 The first thing to notice is that the confusion matrix of both models using our RelDist losses (\(\loss{s}+\loss{d}\)) are sparser (for each column), which means that our models better separate relations from each other. 145 We observe that \(\text{Linear}+\loss{vae reg}\) (the model of the model of \cite{vae_re}) is affected by the pitfall \problem{1} (uniform distribution) for many gold clusters. 146 The \loss{vae reg} loss forces the classifier to be uncertain about which relation is expressed, translating into a dense confusion matrix and resulting in poor performances. 147 The rel-\textsc{lda}1 model is even worse and fails to identify clear clusters, showing the limitations of a purely generative approach that might focus on features not linked with any relation. 148 149 Focusing on our proposed model, \(\textsc{pcnn}+\loss{s}+\loss{d}\) (rightmost figure), we looked at two different mistakes. 150 The first is a gold cluster divided in two (low recall). 151 When looking at clusters 0 and 1, we did not find any recognizable pattern. 152 Moreover, the corresponding entity predictor parameters are very similar. 153 This seems to be a limitation of the distance loss: splitting a large cluster in two may improve \loss{d} but worsen all the evaluation metrics. 154 The model is then penalized by the fact that it lost one slot to transmit information between the classifier and the entity predictor. 155 The second type of mistake is when a predicted cluster corresponds to two gold ones (low precision). 156 Here, most of the mistakes seem understandable: ``shares border'' is symmetric (cluster 7), ``located in'' and ``in country'' (cluster 8) or ``cast member'' and ``director of'' (cluster 9) are clearly related. 157 Note that other variants are also affected similarly, showing that the problem of granularity is complex.