experiments.tex - PhD - The LaTeX sources of my Ph.D. thesis

experiments.tex (4351B)
      1 \section{Experiments}
      2 \label{sec:graph:experiments}
      3 Matching the blanks was trained on a huge unsupervised dataset that is not publicly available \parencite{mtb}.
      4 To ensure reproducibility, we instead attempt to train on \textsc{t-re}x (Section~\ref{sec:datasets:trex}, \citex{trex}).
      5 The evaluation is done in the few-shot setting (Section~\ref{sec:relation extraction:few-shot}) on the FewRel dataset (Section~\ref{sec:datasets:fewrel}) in the 5-way 1-shot setup.
      6 Our code is available at \url{https://esimon.eu/repos/gbure}.
      7 
      8 The \bertcoder{} model we use is the entity markers--entity start described in Section~\ref{sec:relation extraction:mtb sentential}, based on a \texttt{bert-base-cased} transformer.
      9 We use a \bertcoder{} with no post-processing layer for the standalone \textsc{bert} model.
     10 The \textsc{mtb} model is followed by a layer norm even during pre-training as described by \textcite{mtb}.
     11 The \textsc{mtb} similarity function remains a dot product but was rescaled to be normally distributed.
     12 When augmenting \textsc{mtb} with a \textsc{gcn}, we tried both the Chebyshev approximation described in Section~\ref{sec:graph:spectral gcn} and the mean aggregator of Section~\ref{sec:graph:spatial gcn}, however we were only able to train de Chebyshev variant at the time of writing.
     13 The nonparametric \textsc{wl} algorithm uses a dot product for linguistic similarity and a Euclidean 1-Wasserstein distance for topological distance; the hyperparameters are \(\vctr{\lambda}=[-1, 0.2]\transpose\).
     14 
     15 \leavevmode%
     16 \begin{margintable}[0mm]
     17 	\centering
     18 	\input{mainmatter/graph/quantitative.tex}
     19 	\scaption[Preliminary results for FewRel valid accuracies of graph-based approaches.]{
     20 		Preliminary results for FewRel valid accuracies of graph-based approaches.
     21 		To better evaluate the efficiency of topological features, we report results on the subset of the dataset that is connected in \textsc{t-re}x.
     22 	}
     23 	\label{tab:graph:quantitative}
     24 \end{margintable}
     25 We report our results in Table~\ref{tab:graph:quantitative}.
     26 The given numbers are accuracies on the subset of FewRel with at least one neighbor in \textsc{t-re}x.
     27 The accuracies on the whole dataset are 73.74\% for linguistic features alone (\textsc{bert}) and 77.54\% for \textsc{mtb}.
     28 Our results for \textsc{mtb} are still slightly below what \textcite{mtb} report because of the \textsc{bert} model size mismatch and the smaller pre-training dataset.
     29 The result gap is within expectations, as already reported by other works that used a similar setup on the supervised setup \parencite{mtb_low}.
     30 On the other hand, our accuracy for a standalone \textsc{bert} is higher than what \textcite{mtb} report; we suspect this is due to our removal of the randomly initialized post-processing layer.
     31 
     32 The top half of Table~\ref{tab:graph:quantitative} reports results for nonparametric models.
     33 These models were not trained for the relation extraction task; they simply exploit an \textsc{mlm}-pretrained \textsc{bert} in clever ways.
     34 As we can see, while topological features are a bit less expressive to extract relations by themselves, they still contain additional information that can be used jointly with linguistic features---this is what the nonparametric \textsc{wl} model does.
     35 
     36 For parametric models, we have difficulties training on \textsc{t-re}x because of its relative small size.
     37 In practice 66.89\% of FewRel entities are already mentioned in \textsc{t-re}x.
     38 However, a standard 5-way 1-shot problem contains \((1+5)\times 2=12\) different entities.
     39 We measure the empirical probability that all entities of a few-shot problem are connected in \textsc{t-re}x to be around 0.54\%.
     40 Furthermore, we observe that \textsc{mtb} augmented with a \textsc{gcn} performs worse than a standalone \textsc{mtb} despite adding a single linear layer to the parameters (the \bertcoder{} of the linguistic and topological distances are shared).
     41 These are still preliminary results, however, it seems the small size of \textsc{t-re}x coupled with the large amount of additional information presented to the model cause it to overfit on the train data.
     42 We observe a similar problem with the triplet loss model of Section~\ref{sec:graph:refining}.
     43 At the time of writing, our current plan is to attempt training on a larger graph, similar to the unsupervised dataset of \textcite{mtb}.
	PhD The LaTeX sources of my Ph.D. thesis
	git clone https://esimon.eu/repos/PhD.git
	Log \| Files \| Refs \| README \| LICENSE