model.tex - PhD - The LaTeX sources of my Ph.D. thesis

model.tex (20233B)
      1 \section{Model description}
      2 \label{sec:fitb:model}
      3 Our model focuses on extracting the relation between two entities in textual data and assumes that an entity chunker has identified named entities in the text.
      4 Furthermore, following Section~\ref{sec:relation extraction:definition}, we limit ourselves to binary relations and therefore consider sentences with two tagged entities, as shown in Figure~\ref{fig:fitb:split}.
      5 These sentences constitute the set \(\sentenceSet\).
      6 We further assume that entity linking was performed and that we have access to entity identifiers from the set \(\entitySet\).
      7 We therefore consider samples from a dataset \(\dataSet\subseteq\sentenceSet\times\entitySet^2\).
      8 From these samples we learn a relation classifier that maps each sample \(x\in\dataSet\) to a relation \(r\in\relationSet\).
      9 As such, our approach is sentential (Section~\ref{sec:relation extraction:definition}).
     10 
     11 To provide a supervision signal to our relation classifier, we follow the \textsc{vae} model of Section~\ref{sec:relation extraction:vae} \parencitex{vae_re}.
     12 However, the interpretation of their model as a \textsc{vae} is part of the limitation we observed and is in conflict with the modifications we introduce.
     13 We, therefore, reformulate their approach as a \emph{fill-in-the-blank} task:
     14 \begin{indentedexample}
     15 	``The \uhead{sol} was the currency of \utail{~?~} between 1863 and 1985.''
     16 \end{indentedexample}
     17 To correctly fill in the blank, we could directly learn to predict the missing entity; but in this case, we would not be able to learn a relation classifier.
     18 Instead, we first want to learn that this sentence expresses the semantic relation ``currency used by'' before using this information for a (self-)supervised entity prediction task.
     19 To this end, we make the following assumption:
     20 \begin{assumption}{blankable}
     21 	The relation can be predicted by the text surrounding the two entities alone.
     22 	Formally, using \(\operatorname{blanked}(s)\) to designate the tagged sentence \(s\in\sentenceSet\) from which the entities surface forms were removed, we can write:
     23 
     24 	\smallskip
     25 	\noindent
     26 	\( \displaystyle \rndm{r} \independent \rndmvctr{e} \mid \operatorname{blanked}(\rndm{s}) \).
     27 \end{assumption}
     28 
     29 Furthermore, since the information between \(\rndm{s}\) and \(\operatorname{blanked}(\rndm{s})\) is determined by \(\rndmvctr{e}\), as a corollary of \hypothesis{blankable}, we have the equivalence \(P(\rndm{r}\mid \rndm{s}) = P(\rndm{r}\mid \operatorname{blanked}(\rndm{s}))\).
     30 Using this assumption and the above observation about filling blanked entities, we design a surrogate fill-in-the-blank task to train a relation extraction model.
     31 This task uses the point of view that a relation is something that allows us to predict \(e_2\) from \(e_1\) and vice versa.
     32 Our goal is to predict a missing entity \(e_{-i}\) given the predicted relation \(r\) and the other entity \(e_i\):
     33 \begin{marginparagraph}
     34 	Derivation of Equation~\ref{eq:fitb:model}:\\
     35 	\(P(e_{-i}\mid s, e_i)\)\\
     36 	First introduce and marginalize the latent relation variable \(r\) (``sum rule''):\\
     37 	\null\hfill\(\displaystyle= \sum_{r\in\relationSet} P(r, e_{-i}\mid s, e_i)\)\\
     38 	Apply the definition of conditional probability (``product rule''):\\
     39 	\null\hfill\(\displaystyle= \sum_{r\in\relationSet} P(r\mid s, e_i) P(e_{-i}\mid r, s, e_i)\)\\
     40 	Apply the independence \hypothesis{blankable} assumption on the first term and our definition of a relation on the second:\\
     41 	\null\hfill\(\displaystyle= \sum_{r\in\relationSet} P(r\mid s) P(e_{-i}\mid r, e_i)\)\\
     42 	Furthermore, by applying the corollary of \hypothesis{blankable}, we can write:\\
     43 	\null\hfill\(\displaystyle= \sum_{r\in\relationSet} P(r\mid \operatorname{blanked}(s)) P(e_{-i}\mid r, e_i)\)
     44 \end{marginparagraph}
     45 \begin{equation}
     46 	P(e_{-i} \mid s, e_i) = 
     47 	 \sum_{r\in\relationSet} \underbrace{P(r\mid s)}_{\text{(i) classifier}} \underbrace{P(e_{-i} \mid r, e_i)}_{\text{(ii) entity predictor}} \qquad \text{for } i=1, 2,
     48 	\label{eq:fitb:model}
     49 \end{equation}
     50 where \(e_1, e_2\in\entitySet\) are the two entities identifiers, \(s\in\sentenceSet\) is the sentence mentioning them, and \(r\in\relationSet\) is the relation linking them.
     51 As the entity predictor can consider either entity, we use \(e_i\) to designate the given entity, and \(e_{-i}=\{e_1, e_2\}\setminus \{e_i\}\) the one to predict.
     52 
     53 The relation classifier \(P(r\mid s)\) and entity predictor \(P(e_{-i}\mid r, e_i)\) are trained jointly to discover a missing entity, with the constraint that the entity predictor cannot access the input sentence directly.
     54 Thus, all the required information must be condensed into \(r\), which acts as a bottleneck.
     55 We advocate that this information is the semantic relation between the two entities.
     56 
     57 Note that \textcite{vae_re} did not make the \hypothesis{blankable} hypothesis.
     58 Instead, their classifier is conditioned on both \(e_i\) and \(e_{-i}\), strongly relying on the fact that \(r\) is an information bottleneck and will not ``leak'' the identity of \(e_{-i}\).
     59 This is possible since they use pre-defined sentence representations; this is impossible to enforce with the learned representations of a deep neural network.
     60 
     61 In the following, we first describe the relation classifier \(P(r\mid s)\) in Section~\ref{sec:fitb:classifier} before introducing the entity predictor \(P(e_{-i}\mid r, e_i)\) in Section~\ref{sec:fitb:entity predictor}.
     62 Arguing that the resulting model is unstable, we describe the two new RelDist losses in Section~\ref{sec:fitb:regularization}.
     63 
     64 \begin{figure}[t]
     65 	\centering
     66 	\input{mainmatter/fitb/fitb split.tex}
     67 	\scaption*[Fill-in-the-blanks sentence partition.]{
     68 		A sentence from Wikipedia where the conveyed relation is ``\textsl{currency used by}.''
     69 		In contrast to Figure~\ref{fig:relation extraction:dipre split}, which presented \textsc{dipre}'s split-in-three-affixes, we do not label the entities surface forms with \(e_1\) and \(e_2\) to avoid confusion with entity identifiers.
     70 		\label{fig:fitb:split}
     71 	}
     72 \end{figure}
     73 
     74 \subsection{Unsupervised Relation Classifier}
     75 \label{sec:fitb:classifier}
     76 
     77 Our model for \(P(r\mid s)\) follows the then state-of-the-art practices for supervised relation extraction by using a piecewise convolutional neural network (\textsc{pcnn}, Section~\ref{sec:relation extraction:pcnn}, \citex{pcnn}).
     78 Similar to \textsc{dipre}'s split-in-three-affixes, the input sentence can be split into three parts separated by the two entities (see Figure~\ref{fig:fitb:split}).
     79 In a \textsc{pcnn}, the model outputs a representation for each part of the sentence.
     80 These are then combined to make a prediction.
     81 Figure~\ref{fig:relation extraction:pcnn} shows the network architecture that we now describe.
     82 
     83 First, each word of \(s\) is mapped to a real-valued vector.
     84 In this work, we use standard word embeddings, initialized with GloVe%
     85 \sidenote{We use the \texttt{6B.50d} pre-trained word embeddings from \url{https://nlp.stanford.edu/projects/glove/}}
     86 (Section~\ref{sec:context:word2vec}, \cite{glove}), and fine-tune them during training.
     87 Based on those embeddings, a convolutional layer detects patterns in subsequences of words.
     88 Then, a max-pooling along the text length combines all features into a fixed-size representation.
     89 Note that in our architecture, we obtained better results by using three distinct convolutions, one for each sentence part (i.e.~the weights are not shared).
     90 We then apply a non-linear function (\(\tanh\)) and sum the three vectors into a single representation for \(s\).
     91 Finally, this representation is fed to a softmax layer to predict the distribution over the relations.
     92 This distribution can be plugged into Equation~\ref{eq:fitb:model}.
     93 Denoting \(\operatorname{\textsc{pcnn}}\) our classifier, we have:
     94 \begin{equation*}
     95 	P(r\mid s) = \operatorname{\textsc{pcnn}}(r; s, \vctr{\phi}),
     96 \end{equation*}
     97 where \(\vctr{\phi}\) are the parameters of the classifier.
     98 Note that we can use the \textsc{pcnn} to predict the relationship for any pair of entities appearing in any sentence since the input will be different for each selected pair (see Figure~\ref{fig:relation extraction:pcnn}).
     99 Furthermore, since the \textsc{pcnn} ignore the entities surface forms, we can have \(P(r\mid s) = P(r\mid \operatorname{blanked}(s))\) which is necessary to enforce \hypothesis{blankable}.
    100 
    101 \subsection{Entity Predictor}
    102 \label{sec:fitb:entity predictor}
    103 The purpose of the entity predictor is to provide supervision for the relation classifier.
    104 As such, it needs to be differentiable.
    105 We follow \textcite{vae_re} to model \(P(e_i \mid r, e_{-i})\), and use an energy-based formalism, where \(\psi(e_1, r, e_2)\) is the energy associated with \((e_1,r,e_2)\). The probability is obtained as follows:
    106 \begin{equation}
    107 	P(e_1 \mid r, e_2) = \frac
    108 		{\exp(\psi(e_1, r, e_2))}
    109 		{\sum_{e'\in\entitySet} \exp(\psi(e', r, e_2))},
    110 	\label{eq:fitb:entity predictor softmax}
    111 \end{equation}
    112 where \(\psi\) is expressed as the sum of two standard relational learning models selectional preferences (Section~\ref{sec:context:selectional preferences}) and \textsc{rescal} (Section~\ref{sec:context:rescal}):
    113 \begin{equation*}
    114 	\psi(e_1, r, e_2; \vctr{\theta}) = \underbrace{\vctr{u}_{e_1}\transpose \vctr{a}_r + \vctr{u}_{e_2}\transpose \vctr{b}_r}_\text{Selectional Preferences} + \underbrace{\vctr{u}_{e_1}\transpose \mtrx{C}_r \vctr{u}_{e_2}}_\textsc{rescal}
    115 \end{equation*}
    116 where \(\mtrx{U}\in\symbb{R}^{\entitySet\times m}\) is an entity embedding matrix, \(\mtrx{A}, \mtrx{B}\in\symbb{R}^{\relationSet\times m}\) are two matrices encoding the preferences of each relation of certain entities, \(\tnsr{C}\in\symbb{R}^{\relationSet\times m\times m}\) is a three-way tensor encoding the entities interactions, and the hyperparameter \(m\) is the dimension of the embedded entities.
    117 The function \(\psi\) also depends on the energy functions parameters \(\vctr{\theta}=\{\tnsr{A}, \mtrx{B}, \mtrx{C}, \mtrx{U}\}\) that we might omit for legibility.
    118 \textsc{rescal} \parencite{rescal} uses a bilinear tensor product to gauge the compatibility of the two entities; whereas, in the Selectional Preferences model, only the predisposition of an entity to appear as the subject or object of a relation is captured.
    119 
    120 \paragraph{Negative Sampling}
    121 The number of entities being very large, the partition function of Equation~\ref{eq:fitb:entity predictor softmax} cannot be efficiently computed.
    122 To avoid the summation over the set of entities, we follow Section~\ref{sec:context:negative sampling} and use negative sampling \parencite{word2vec_follow-up}; instead of training a softmax classifier, we train a discriminator which tries to recognize real triplets (\(\rndm{D}=1\)) from fake ones (\(\rndm{D}=0\)):
    123 \begin{equation*}
    124 	P(\rndm{D}=1\mid e_1, e_2, r) = \sigmoid \left( \psi(e_1, r, e_2) \right),
    125 \end{equation*}
    126 where \(\sigmoid(x) = 1 \divslash (1 + \exp(-x))\) is the sigmoid function.
    127 This model is then trained by generating negative entities for each position and optimizing the negative log-likelihood:
    128 \begin{equation}
    129 	\begin{split}
    130 		\loss{ep}(\vctr{\theta}, \vctr{\phi}) = \expectation_{\substack{(\rndm{s}, \rndm{e}_1, \rndm{e}_2)\sim \uniformDistribution(\dataSet)\\\rndm{r}\sim \operatorname{\textsc{pcnn}}(\rndm{s}; \vctr{\phi})}} \bigg[ & - \log \sigmoid \left( \psi(\rndm{e}_1, \rndm{r}, \rndm{e}_2; \vctr{\theta}) + b_{\rndm{e}_1}\right) \\
    131 		& - \log \sigmoid \left( \psi(\rndm{e}_1, \rndm{r}, \rndm{e}_2; \vctr{\theta}) + b_{\rndm{e}_2}\right) \\
    132 		& - \sum_{j=1}^k \expectation_{\rndm{e}'\sim\uniformDistribution_\dataSet(\entitySet)} \left[ \log \sigmoid \left( - \psi(\rndm{e}_1, \rndm{r}, \rndm{e}'; \vctr{\theta}) - b_{\rndm{e}'} \right) \right] \\
    133 		& - \sum_{j=1}^k \expectation_{\rndm{e}'\sim\uniformDistribution_\dataSet(\entitySet)} \left[ \log \sigmoid \left( - \psi(\rndm{e}', \rndm{r}, \rndm{e}_2; \vctr{\theta}) - b_{\rndm{e}'} \right) \right] \bigg]
    134 	\end{split}
    135 	\label{eq:fitb:entity prediction loss}
    136 \end{equation}
    137 This loss is defined over the empirical data distribution \(\uniformDistribution(\dataSet)\), i.e.~the samples \((\rndm{s}, \rndm{e}_1, \rndm{e}_2)\) follow a uniform distribution over sentences tagged with two entities; and the empirical entity distribution \(\uniformDistribution_\dataSet(\entitySet)\), that is the categorical distribution over \(\entitySet\) where each entity is weighted by its frequency in \(\dataSet\).
    138 The distribution of the relation \(\rndm{r}\) for the sentence \(\rndm{s}\) is then given by the classifier \(\operatorname{\textsc{pcnn}}(\rndm{s}; \vctr{\phi})\), which corresponds to the \(\sum_{r\in\relationSet} P(r\mid s)\) in Equation~\ref{eq:fitb:model}.
    139 Following standard practice, during training, the expectation on negative entities is approximated by sampling \(k\) random entities following the empirical entity distribution \(\entitySet\) for each position.
    140 
    141 \paragraph{Biases}
    142 Following \textcite{vae_re}, we add a bias for entities to \(\psi\).
    143 These biases are parametrized by a single vector \(\vctr{b}\in\symbb{R}^\entitySet\).
    144 They encode how some entities are more likely to appear than others; as such, the \(+\vctr{b}_{e_i}\) appear in \loss{ep} where the \(P(e_i\mid r, e_{-i})\) would appear in the negative sampling estimation.
    145 
    146 \paragraph{Approximation}
    147 When \(|\relationSet|\) is large, the expectation over \(\rndm{r}\sim \operatorname{\textsc{pcnn}}(\rndm{s}; \vctr{\phi})\) can be slow to evaluate.
    148 To avoid computing \(\psi\) for all possible relation \(r\in\relationSet\), we employ an optimization also used by \textcite{vae_re}.
    149 This optimization is built upon the following approximation:
    150 \begin{equation}
    151 	\expectation_{\rndm{r}\sim \operatorname{\textsc{pcnn}}(\rndm{s}; \vctr{\phi})}[ \log \sigmoid (\psi(\rndm{e}_1, \rndm{r}, \rndm{e}_2; \vctr{\theta}))] \approx \log \sigmoid \left(\expectation_{\rndm{r}\sim \operatorname{\textsc{pcnn}}(\rndm{s}; \vctr{\phi})}\left[\psi(\rndm{e}_1, \rndm{r}, \rndm{e}_2; \vctr{\theta})\right]\right).
    152 \end{equation}
    153 Since the function \(\psi\) is linear in \(r\), we can efficiently compute its expected value over \(r\) using the convex combinations of the relation embeddings.
    154 For example we can replace the selectional preference of a relation \(r\) for a head entity \(e_1\): \(\vctr{u}_{e_1}\transpose \vctr{a}_r\) by the selectional preference of a distribution \(\operatorname{\textsc{pcnn}}(s; \vctr{\phi})\) for a head entity: \(\vctr{u}_{e_1}\transpose (\operatorname{\textsc{pcnn}}(s; \vctr{\phi})\transpose \mtrx{A})\).
    155 
    156 \subsection{RelDist losses}
    157 \label{sec:fitb:regularization}
    158 Training the classifier through Equation~\ref{eq:fitb:entity prediction loss} alone is very unstable and dependent on precise hyperparameter tuning.
    159 More precisely, according to our early experiments, the training process usually collapses into one of two regimes:
    160 \begin{marginfigure}
    161 	\centering
    162 	\input{mainmatter/fitb/problem 1.tex}
    163 	\scaption[Illustration of \problem{1}.]{
    164 		Illustration of \problem{1}.
    165 		The classifier assigns roughly the same probability to all relations.
    166 		Instead, we would like the classifier to predict a single relation confidently.
    167 		\label{fig:fitb:problem 1}
    168 	}
    169 \end{marginfigure}%
    170 \begin{marginfigure}
    171 	\vspace{5mm}
    172 	\centering
    173 	\input{mainmatter/fitb/problem 2.tex}
    174 	\scaption[Illustration of \problem{2}.]{
    175 		Illustration of \problem{2}.
    176 		The classifier consistently predicts the same relation.
    177 		This is clearly visible when taking the average distribution (by marginalizing over the sentences \(\rndm{s}\)).
    178 		Instead, we would like the classifier to predict a diverse set of relations.
    179 		\label{fig:fitb:problem 2}
    180 	}
    181 \end{marginfigure}
    182 \begin{description}[style=multiline, labelwidth=\widthof{(\problem{2})\,}, leftmargin=\dimexpr\labelwidth+5mm\relax]
    183 	\item[(\problem{1})] The classifier is very uncertain about which relation is expressed and outputs a uniform distribution over relations (Figure~\ref{fig:fitb:problem 1});
    184 	\item[(\problem{2})] All sentences are classified as conveying the same relation (Figure~\ref{fig:fitb:problem 2}).
    185 \end{description}
    186 In both cases, the entity predictor can do a good job minimizing \loss{ep} by ignoring the output of the classifier, simply exploiting entities' co-occurrences.
    187 More precisely, many entities only appear in one relationship with a single other entity. In this case, the entity predictor can easily ignore the relationship \(r\) and predict the missing entity---and this pressure is even worse at the beginning of the optimization process as the classifier's output is not yet reliable.
    188 
    189 This instability problem is particularly prevalent since the two components (classifier and entity predictor) are strongly interdependent: the classifier cannot be trained without a good entity predictor, which itself cannot take \(r\) into account without a good classifier resulting in a bootstrapping problem.
    190 To overcome these pitfalls, we developed two additional losses, which we now describe.
    191 
    192 \paragraph{Skewness.}
    193 Firstly, to encourage the classifier to be confident in its output, we minimize the entropy of the predicted relation distribution.
    194 This addresses \problem{1} by forcing the classifier toward outputting one-hot vectors for a given sentence using the following loss:
    195 \begin{equation}
    196 	\loss{s}(\vctr{\phi}) = \expectation_{(\rndm{s}, \rndmvctr{e})\sim \uniformDistribution(\dataSet)} \left[ \entropy(\rndm{R} \mid \rndm{s}, \rndmvctr{e}; \vctr{\phi}) \right],
    197 	\label{eq:fitb:skewness}
    198 \end{equation}
    199 where \(\rndm{R}\) is the random variable corresponding to the predicted relation.
    200 Following our first independence hypothesis, the entropy of equation~\ref{eq:fitb:skewness} is equivalent to \(\entropy(\rndm{R}\mid \rndm{s})\).
    201 
    202 \paragraph{Distribution Distance.}
    203 Secondly, to ensure that the classifier predicts several relations, we enforce \hypothesis{uniform} by minimizing the Kullback--Leibler divergence between the model prior distribution over relations \(P(\rndm{R}\mid\vctr{\phi})\) and the uniform distribution%
    204 \sidenote[][15mm]{
    205 	Other distributions could be used, but in the absence of further information, this might be the best thing to do.
    206 	See Section~\ref{sec:fitb:conclusion} for a discussion of alternatives.
    207 }
    208 over the set of relations \(\uniformDistribution(\relationSet)\), that is:
    209 \begin{equation}
    210 	\loss{d}(\vctr{\phi}) = \kl(P(\rndm{R}\mid\vctr{\phi}) \mathrel{\|} \uniformDistribution(\relationSet)).
    211 	\label{eq:fitb:uniformity}
    212 \end{equation}
    213 Note that contrary to \loss{s}, to have a good approximation of \(P(\rndm{R}\mid\vctr{\phi})\), the loss \loss{d} measures the unconditional distribution over \(\rndm{R}\), i.e.~the distribution of predicted relations over all sentences.
    214 This addresses \problem{2} by forcing the classifier toward predicting each class equally often over a set of sentences.
    215 
    216 To satisfactorily and jointly train the entity predictor and the classifier, we use the two losses at the same time, resulting in the final loss:
    217 \begin{equation}
    218 	\symcal{L}(\vctr{\theta}, \vctr{\phi}) = \loss{ep}(\vctr{\theta}, \vctr{\phi}) + \alpha \loss{s}(\vctr{\phi}) + \beta \loss{d}(\vctr{\phi}),
    219 	\label{eq:fitb:fullloss}
    220 \end{equation}
    221 where \(\alpha\) and \(\beta\) are both positive hyperparameters.
    222 
    223 All three losses are defined over the real data distribution, but in practice, they are approximated at the level of a mini-batch.
    224 First, both \loss{ep} and \loss{s} can be computed for each sample independently.
    225 To optimize \loss{d} however, we need to estimate \(P(\rndm{R})\) at the mini-batch level and maximize the entropy of the mean predicted relation.
    226 Formally, let \(s_i\) for \(i=1,\dotsc,B\) be the \(i\)-th sentence in a batch of size \(B\), we approximate \loss{d} as:
    227 \begin{equation*}
    228 	\sum_{r\in\relationSet} \left( \sum\limits_{i=1}^B \frac{\operatorname{\textsc{pcnn}}(r; s_i)}{B} \right) \log \left( \sum\limits_{i=1}^B \frac{\operatorname{\textsc{pcnn}}(r; s_i)}{B} \right).
    229 \end{equation*}
    230 
    231 \paragraph{Learning}
    232 We optimize the empirical estimation of Equation~\ref{eq:fitb:fullloss}, learning the \textsc{pcnn} parameters and word embeddings \(\vctr{\phi}\) as well as the entity predictor parameters and entity embeddings \(\vctr{\theta}\) jointly.
	PhD The LaTeX sources of my Ph.D. thesis
	git clone https://esimon.eu/repos/PhD.git
	Log \| Files \| Refs \| README \| LICENSE