experiments.tex (4351B)
1 \section{Experiments} 2 \label{sec:graph:experiments} 3 Matching the blanks was trained on a huge unsupervised dataset that is not publicly available \parencite{mtb}. 4 To ensure reproducibility, we instead attempt to train on \textsc{t-re}x (Section~\ref{sec:datasets:trex}, \citex{trex}). 5 The evaluation is done in the few-shot setting (Section~\ref{sec:relation extraction:few-shot}) on the FewRel dataset (Section~\ref{sec:datasets:fewrel}) in the 5-way 1-shot setup. 6 Our code is available at \url{https://esimon.eu/repos/gbure}. 7 8 The \bertcoder{} model we use is the entity markers--entity start described in Section~\ref{sec:relation extraction:mtb sentential}, based on a \texttt{bert-base-cased} transformer. 9 We use a \bertcoder{} with no post-processing layer for the standalone \textsc{bert} model. 10 The \textsc{mtb} model is followed by a layer norm even during pre-training as described by \textcite{mtb}. 11 The \textsc{mtb} similarity function remains a dot product but was rescaled to be normally distributed. 12 When augmenting \textsc{mtb} with a \textsc{gcn}, we tried both the Chebyshev approximation described in Section~\ref{sec:graph:spectral gcn} and the mean aggregator of Section~\ref{sec:graph:spatial gcn}, however we were only able to train de Chebyshev variant at the time of writing. 13 The nonparametric \textsc{wl} algorithm uses a dot product for linguistic similarity and a Euclidean 1-Wasserstein distance for topological distance; the hyperparameters are \(\vctr{\lambda}=[-1, 0.2]\transpose\). 14 15 \leavevmode% 16 \begin{margintable}[0mm] 17 \centering 18 \input{mainmatter/graph/quantitative.tex} 19 \scaption[Preliminary results for FewRel valid accuracies of graph-based approaches.]{ 20 Preliminary results for FewRel valid accuracies of graph-based approaches. 21 To better evaluate the efficiency of topological features, we report results on the subset of the dataset that is connected in \textsc{t-re}x. 22 } 23 \label{tab:graph:quantitative} 24 \end{margintable} 25 We report our results in Table~\ref{tab:graph:quantitative}. 26 The given numbers are accuracies on the subset of FewRel with at least one neighbor in \textsc{t-re}x. 27 The accuracies on the whole dataset are 73.74\% for linguistic features alone (\textsc{bert}) and 77.54\% for \textsc{mtb}. 28 Our results for \textsc{mtb} are still slightly below what \textcite{mtb} report because of the \textsc{bert} model size mismatch and the smaller pre-training dataset. 29 The result gap is within expectations, as already reported by other works that used a similar setup on the supervised setup \parencite{mtb_low}. 30 On the other hand, our accuracy for a standalone \textsc{bert} is higher than what \textcite{mtb} report; we suspect this is due to our removal of the randomly initialized post-processing layer. 31 32 The top half of Table~\ref{tab:graph:quantitative} reports results for nonparametric models. 33 These models were not trained for the relation extraction task; they simply exploit an \textsc{mlm}-pretrained \textsc{bert} in clever ways. 34 As we can see, while topological features are a bit less expressive to extract relations by themselves, they still contain additional information that can be used jointly with linguistic features---this is what the nonparametric \textsc{wl} model does. 35 36 For parametric models, we have difficulties training on \textsc{t-re}x because of its relative small size. 37 In practice 66.89\% of FewRel entities are already mentioned in \textsc{t-re}x. 38 However, a standard 5-way 1-shot problem contains \((1+5)\times 2=12\) different entities. 39 We measure the empirical probability that all entities of a few-shot problem are connected in \textsc{t-re}x to be around 0.54\%. 40 Furthermore, we observe that \textsc{mtb} augmented with a \textsc{gcn} performs worse than a standalone \textsc{mtb} despite adding a single linear layer to the parameters (the \bertcoder{} of the linguistic and topological distances are shared). 41 These are still preliminary results, however, it seems the small size of \textsc{t-re}x coupled with the large amount of additional information presented to the model cause it to overfit on the train data. 42 We observe a similar problem with the triplet loss model of Section~\ref{sec:graph:refining}. 43 At the time of writing, our current plan is to attempt training on a larger graph, similar to the unsupervised dataset of \textcite{mtb}.