variants.tex (9355B)
1 \section{Alternative Models} 2 \label{sec:fitb:variants} 3 In this section, we present some variations we considered during the development of our model. 4 However, we did not manage to obtain satisfactory results with these variants. 5 When possible, we provide an analysis of why we think these variants did not work; keeping in mind that negative results are difficult to certify, poor results might be improved with a better hyperparameter search. 6 7 \paragraph{\textmd{\textsc{lstm}} Relation Classifier} 8 Instead of a \textsc{pcnn}, we tried using a deep \textsc{lstm} (Section~\ref{sec:context:lstm}) for our relation classifier. 9 We never managed to obtain any results with them; the training always collapsed into one of \problem{1} or \problem{2}. 10 An \textsc{lstm} is quite a lot harder to train than a \textsc{cnn}. 11 The representation provided by an \textsc{lstm} is the result of several non-linear operator compositions, through which it is hard to backpropagate information. 12 On the other hand, with good initialization, the representation extracted by a \textsc{cnn} can be close to its input embeddings (which are pre-trained). 13 Since the training of the entity predictor heavily depends on the relation classifier, it is not surprising that the training fails with an \textsc{lstm}. 14 The failure of the \textsc{lstm} to provide a good representation at the beginning of the training procedure pushes the entity predictor to ignore the relation variable \(r\), which therefore does not receive any gradient and thus does not provide any supervision back to the \textsc{lstm}. 15 Retrospectively, pre-training the sentence representation extractor with a language modeling loss could have overcome this problem. 16 The initial representation would have been good enough for the entity predictor to provide some gradient back to the relation classifier. 17 This is confirmed by the work of \textcite{selfore}, who trained a \textsc{bert} relation classifier with our losses. 18 In the end, what made a \textsc{pcnn} work is its shallowness and the pre-trained GloVe word embeddings. 19 20 \paragraph{Gumbel--Softmax} 21 Another approach to tackling \problem{1} (uniform output) would be to use a discrete distribution for the relation \(r\); instead of marginalizing over all possible relations in Equation~\ref{eq:fitb:entity prediction loss}, we would only take the most likely relation. 22 However, taking the maximum would not be differentiable. 23 The Gumbel--softmax technique provides a solution to this problem. 24 Let's call \(y_r\in\symbb{R}\) for \(r\in\relationSet\) the unnormalized score assigned to each relation by the \textsc{pcnn}. 25 It can be shown \parencite{gumbel_max} that sampling from \(\softmax(\vctr{y})\) is equivalent to taking \(\argmax_{r\in\relationSet} y_r + \rndm{G}_r\) where \(\rndm{G}_r\) are randomly sampled from the Gumbel distribution. 26 Knowing this, \textcitex{gumbel_softmax} propose to use the following Gumbel--Softmax distribution: 27 \begin{equation*} 28 \pi_r = \frac{(\exp(y_r)+\rndm{G}_r)\divslash\tau}{\sum_{r'\in\relationSet}(\exp(y_{r'})+\rndm{G}_{r'})\divslash\tau} 29 \end{equation*} 30 This distribution has the advantage of being differentiable, barring the Gumbel variables \(\rndm{G}_r\). 31 Furthermore, when the temperature \(\tau>0\) is close to 1, this distribution looks like a standard softmax output. 32 On the other hand, when the temperature is close to 0, this distribution is closer to a one-hot vector with low entropy. 33 Decreasing the temperature gradually throughout the training process, this should help us solve \problem{1}. 34 35 \begin{table}[t] 36 \centering 37 \input{mainmatter/fitb/gumbel.tex}% 38 \scaption*[Quantitative results of the Gumbel--Softmax model on the \nytfb{} dataset.]{ 39 Quantitative results of the Gumbel--Softmax model on the \nytfb{} dataset. 40 The \loss{s} solution is used together with \loss{d} and a softmax activation, while the Gumbel--Softmax activation is used with \loss{d} only. 41 Therefore, the first row reports the same results present in Table~\ref{tab:fitb:quantitative}. 42 \label{tab:fitb:gumbel} 43 } 44 \end{table} 45 46 Following a grid search, we initially set \(\tau=1\) with an annealing rate of 0.9 per epoch. 47 Table~\ref{tab:fitb:gumbel} compares the best Gumbel--Softmax results of \(\loss{ep}+\loss{d}\) with the standard softmax result of \(\loss{ep}+\loss{s}+\loss{d}\) discussed above. 48 We do not use \loss{s} with Gumbel--Softmax since both mechanisms seek to address \problem{1}. 49 While the Gumbel--Softmax prevents the model from falling entirely into \problem{1}, it still underperforms compared to the \loss{s} regularization of our standard model. 50 51 \paragraph{Aligning Sentences and Entity Pairs} 52 Another model we attempted to train purposes to align sentences and entities. 53 It recombines our \textsc{pcnn} relation classifier with the energy function \(\psi\) into a new layout following a relaxation of the \hypothesis{pullback} assumption.% 54 \sidenote{This hypothesis introduced Section~\refAssumptionSection{pullback} assumes that the relation can be found from the entities alone, and from the relations alone.} 55 In this model, we obtain a distribution over the relations \(P(\rndm{r}_s\mid\operatorname{blanked}(s))\) using a \textsc{pcnn} as described Section~\ref{sec:fitb:classifier}, but we also extract a distribution \(P(\rndm{r}_e\mid\vctr{e})\) using the energy function \(\psi\) normalized over the relations \(P(r_e\mid e_1, e_2)\propto \exp(\psi(e_1, r_e, e_2))\). 56 This model clearly assumes \hypothesis{pullback} since it extracts a relation from the entities and from the sentence separately. 57 However, in contrast to other models assuming \hypothesis{pullback} (such as \textsc{dipre}, Section~\ref{sec:relation extraction:dipre}), we combine the separate relations into a single one to express the fact that a relation is both conveyed by the sentence and the entities: 58 \begin{equation} 59 P(\rndm{r}=r\mid s, \vctr{e}; \vctr{\theta}, \vctr{\phi}) = P(\rndm{r}_s=r\mid s; \vctr{\phi}) P(\rndm{r}_e=r\mid \vctr{e}; \vctr{\theta}) 60 \label{eq:fitb:align product} 61 \end{equation} 62 For the final prediction \(\rndm{r}\), the assumption \hypothesis{pullback} is not made, since it depends both on the sentence and entities. 63 However, Equation~\ref{eq:fitb:align product} clearly assumes that \(\rndm{r}_s\) and \(\rndm{r}_e\) are independent and \(\rndm{r}\) does not capture any interaction between \(s\) and \(\vctr{e}\). 64 To train this model, we force the two distributions to align by maximizing: 65 \begin{marginparagraph}[-4cm] 66 For numerical stability, the first term of Equation~\ref{eq:fitb:align loss} needs to be computed as: 67 \begin{multline*} 68 - \log \sum_{r\in\relationSet}P(r\mid s, \vctr{e}; \vctr{\theta}, \vctr{\phi}) = \\ 69 \shoveright{- \log \sum_{r\in\relationSet} \exp (y_r^{(s)} + y_e^{(s)})} \\ 70 \shoveright{+ \log \sum_{r\in\relationSet} \exp (y_r^{(s)})} \\ 71 + \log \sum_{r\in\relationSet} \exp (y_r^{(e)}) 72 \end{multline*} 73 where \(\vctr{y}^{(s)}\) and \(\vctr{y}^{(e)}\) are the logits used for predicting \(\rndm{r}_s\) and \(\rndm{r}_e\) respectively. 74 \end{marginparagraph} 75 \begin{marginparagraph}[2cm] 76 We also attempted (without success) to align the two distribution by minimizing \(\jsd(\rndm{r}_s \mathrel{\|} \rndm{r}_e)\). 77 Where \(\jsd\) is the Jensen--Shannon divergence defined as: 78 \begin{align*} 79 \jsd(\rndm{r}_s \mathrel{\|} \rndm{r}_e) = \frac{1}{2} \big( 80 & \kl(\rndm{r}_s \mathrel{\|} \rndm{m}) \\ 81 & + \kl(\rndm{r}_e \mathrel{\|} \rndm{m}) \big) 82 \end{align*} 83 with \(\displaystyle P(\rndm{m}) = \frac{1}{2} \big( P(\rndm{r}_s) + P(\rndm{r}_e) \big)\). 84 \end{marginparagraph} 85 \begin{equation} 86 \loss{align}(\vctr{\theta}, \vctr{\phi}) = - \log \sum_{r\in\relationSet}P(r\mid s, \vctr{e}; \vctr{\theta}, \vctr{\phi}) + \loss{d}(\vctr{\theta}) + \loss{d}(\vctr{\phi}). 87 \label{eq:fitb:align loss} 88 \end{equation} 89 Here \loss{s} is not needed since, in order to maximize the pointwise product of two probability mass functions, each distribution must be deterministic on a matching relation, which solves \problem{1}. 90 91 Table~\ref{tab:fitb:align} gives the results on the \nytfb{} datasets and compares them to the fill-in-the-blank model of Section~\ref{sec:fitb:model}. 92 The main problem we have with this model is its lack of stability. 93 The average, maximum and minimum given in Table~\ref{tab:fitb:align} are computed over eight runs. 94 Similar results were observed with slightly different setups such as enforcing \loss{d} on the product (\(\rndm{r}\)) instead of each distribution separately (\(\rndm{r}_s\) and \(\rndm{r}_e\)). 95 As we can see, the alignment model sometimes reaches excellent performances relative to the fill-in-the-blank model. 96 However, this happens rarely, and on average, it performs more poorly according to the \bcubed{} and \textsc{ari} metrics. 97 Its good V-measures scores are nevertheless encouraging. 98 99 \begin{table}[t] 100 \centering 101 \input{mainmatter/fitb/align.tex} 102 \scaption[Quantitative results of the alignment models on the \nytfb{} dataset.]{ 103 Quantitative results of the alignment model on the \nytfb{} dataset. 104 The first row reports the same results present in Table~\ref{tab:fitb:quantitative}. 105 Eight alignment models were trained, the average scores are given in the second row, while the third and fourth rows report the best and worst model among the eight. 106 \label{tab:fitb:align} 107 } 108 \end{table}