PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

sentence.tex (27410B)


      1 \section{Distributed Representation of Sentences}
      2 \label{sec:context:sentence}
      3 Most \textsc{nlp} tasks are tackled at the sentence level.
      4 In the previous section, we saw how to obtain representations of words.
      5 We now focus on how to aggregate these word representations in order to process whole sentences.
      6 Henceforth, given a sentence of length \(m\), we assume symbolic words \(\vctr{w}\in V^m\) are embedded as \(\mtrx{X}\in\symbb{R}^{m\times d}\) in a vector space of dimension \(d\).
      7 This can be achieved through the use of an embedding matrix \(\mtrx{U}\in\symbb{R}^{V\times d}\) such as the one provided by word2vec.
      8 
      9 An early approach to sentence representation was to use \emph{bag-of-words}, that is to simply ignore the ordering of the words.
     10 In this section, we focus on more modern, deep learning approaches.
     11 Section~\ref{sec:context:cnn} presents \textsc{cnn}s, which process fixed-length sequences of words to produce representations of sentences.
     12 We then focus on \textsc{rnn}s in Section~\ref{sec:context:rnn}, a method to get representations of sentences through a causal language model.
     13 \textsc{rnn}s can be improved by an attention mechanism as explained in Section~\ref{sec:context:attention}.
     14 Finally, we present transformers in Section~\ref{sec:context:transformers}, which build upon the concept of attention to extract state-of-the-art contextualized word representations.
     15 
     16 \subsection{Convolutional Neural Network}
     17 \label{sec:context:cnn}
     18 \begin{marginfigure}[-45mm]
     19 	\centering
     20 	\input{mainmatter/context/cnn.tex}
     21 	\scaption[Architecture of a single convolutional filter with a pooling layer.]{
     22 		Architecture of a single convolutional filter with a pooling layer.
     23 		The filter is of width 3, which means it works on trigrams.
     24 		A single filter (the \(i\)-th) is shown here, this is repeated \(d'\) times, meaning that \(\vctr{h}_t,\vctr{o}\in\symbb{R}^{d'}\).}
     25 	\label{fig:context:cnn}
     26 \end{marginfigure}
     27 
     28 Convolutional neural networks (\textsc{cnn}) can be used to build the representation of a sentence from the representation of its constituting words~\parencite{unified_nlp,cnn_classification}.
     29 These words embeddings can come from word2vec (Section~\ref{sec:context:word2vec}) or can be learned using a \textsc{cnn} with a language model objective (Section~\ref{sec:context:language model}), the latter being the original approach proposed by \textcitex{unified_nlp}.
     30 
     31 The basic idea behind \textsc{cnn} is to recognize patterns in a position-invariant fashion~\parencite{tdnn}.
     32 This is applicable to natural language following the principle of compositionality: the words composing an expression and the rules used to combine them determine its meaning, with little influence from the location of the expression in the text.
     33 So, given a sequence of \(d\)-dimensional embeddings \(\vctr{x}_1, \dotsc, \vctr{x}_m\in\symbb{R}^d\), a one dimensional \textsc{cnn} works on the \(n\)-grams of the sequence, that is the subwords%
     34 \sidenote{Here we use \emph{subwords} in its formal language theory meaning. In the simple setting where we deal with words in a sentence, this \emph{subword} actually designates a sequence of consecutive words.}
     35 \(\vctr{x}_{t:t+n-1} = (\vctr{x}_t, \dotsc, \vctr{x}_{t+n-1})\) of length \(n\).
     36 The basic design of a \textsc{cnn} is illustrated in Figure~\ref{fig:context:cnn}.
     37 A convolutional layer is parametrized by \(d'\) filters \(\mtrx{W}^{(i)}\in\symbb{R}^{n\times d}\) of width \(n\) and a bias \(b^{(i)}\in\symbb{R}\).
     38 The \(t\)-th output of the \(i\)-th filter layer is defined as:
     39 \begin{equation}
     40 	h^{(i)}_t = f(\mtrx{W}^{(i)} * \vctr{x}_{t:t+n-1} + b^{(i)})
     41 	\label{eq:context:convolution}
     42 \end{equation}
     43 where \(*\) is the convolution operator%
     44 \sidenote{
     45 	Usually, a cross-correlation operator is actually used, which is equivalent up to a mirroring of the filters when they are real-valued.
     46 }
     47 and \(f\) is a non-linear function.
     48 As is usual with neural networks, several layers of this kind can be stacked.
     49 To obtain a fixed-size representation---which does not depend on the length of the sequence \(m\)---a pooling layer can be used.
     50 Most commonly, max-over-time pooling~\parencite{maxpool}, which simply takes the maximum activation over time---that is sequence length---for each feature \(i=1, \dotsc, d'\).
     51 
     52 In the same way that word2vec produces a real vector space where words with similar meanings are close to each other, the sentence representations \(\vctr{o}\) extracted by a \textsc{cnn} tend to be close to each other when the sentences convey similar meanings.
     53 This is somewhat dependent on the task on which the \textsc{cnn} is trained.
     54 However, the purpose of \textsc{cnn} is usually to extract the semantics of a sentence, and the nature of most tasks makes it so that sentences with similar meanings should have similar representations.
     55 
     56 \subsection{Recurrent Neural Network}
     57 \label{sec:context:rnn}
     58 A limitation of \textsc{cnn}s is the difficulty they have modeling patterns of non-adjacent words.
     59 A second approach to process whole sentences is to use recurrent neural networks (\textsc{rnn}).
     60 \textsc{rnn}s purpose to sum up an entire sentence prefix into a fixed-size hidden state, updating this hidden state as the sentence is processed.
     61 This can be used to build a causal language model following the decomposition of Equation~\ref{eq:context:causal lm}.
     62 As showcased by Figure~\ref{fig:context:rnn}, the hidden state \(\vctr{h}_t\) can be used to predict the next word \(w_{t+1}\) with a simple linear layer followed by a softmax, formally:
     63 \begin{marginfigure}
     64 	\centering
     65 	\input{mainmatter/context/rnn lm.tex}
     66 	\scaption{\textsc{rnn} language model unrolled through time.}
     67 	\label{fig:context:rnn}
     68 \end{marginfigure}
     69 \begin{align}
     70 	\vctr{h}_t & = f(\mtrx{W}^{(x)} \vctr{x}_t + \mtrx{W}^{(h)} \vctr{h}_{t-1} + \vctr{b}^{(h)})
     71 		\label{eq:context:rnn} \\
     72 	\hat{w}_t & = \softmax(\mtrx{W}^{(o)} \vctr{h}_t + \vctr{b}^{(o)})
     73 		\nonumber
     74 \end{align}
     75 where \(\mtrx{W}^{(x)}\), \(\mtrx{W}^{(h)}\), \(\mtrx{W}^{(o)}\), \(\vctr{b}^{(h)}\) and \(\vctr{b}^{(o)}\) are model parameters and \(f\) is a non-linearity, usually a sigmoid \(f(x) = \sigmoid(x) = \frac{1}{1 + \symup{e}^{-x}}\).
     76 This model is usually trained by minimizing the negative log-likelihood:
     77 \marginnote{
     78 	We generally use \(\vctr{\theta}\) to refer to the set of model parameters.
     79 	In this case \(\vctr{\theta} = \{\mtrx{W}^{(x)}, \mtrx{W}^{(h)}, \mtrx{W}^{(o)}, \vctr{b}^{(h)}, \vctr{b}^{(o)}\}\).
     80 }[1cm]
     81 \begin{equation*}
     82 	\loss{rnn}(\vctr{\theta}) = \sum_{t=1}^m - \log P(w_t\mid \vctr{x}_1, \dotsc, \vctr{x}_{t-1}; \vctr{\theta})
     83 \end{equation*}
     84 using the backpropagation-through time algorithm.
     85 The gradient is run through all the steps of the \textsc{rnn} until reaching the beginning of the sequence.
     86 When the sequence is a sentence, this can easily be achieved.
     87 However, when longer spans of text are considered, the gradient only goes back a fixed number of tokens in order to limit memory usage.
     88 
     89 \subsubsection{Long Short-term Memory}
     90 \label{sec:context:lstm}
     91 Standard \textsc{rnn}s tend to have a hard time dealing with long sequences.
     92 This problem is linked to the vanishing and exploding gradient problems.
     93 When the gradient goes through several non-linearities, it tends to be less meaningful, and gradient descent does not lead to satisfying parameters anymore.
     94 In particular, when \(\mtrx{W}^{(h)}\) has a large spectral norm, the values \(\vctr{h}_t\) tend to get bigger and bigger with long sequences, on the other hand when its spectral norm is small, these values get smaller and smaller.
     95 When \(\vctr{h}_t\) has a large magnitude, the sigmoid activation saturates and \(\frac{\partial \loss{rnn}}{\partial \vctr{h}_t}\) gets close to zero, the gradient vanishes.
     96 \textsc{rnn} variants are used to alleviate this vanishing gradient problem, the most common being long short-term memory (\textsc{lstm}, \citex{lstm}).
     97 \begin{figure}[ht!]
     98 	\centering
     99 	\input{mainmatter/context/lstm.tex}
    100 	\scaption[Architecture of an \textsc{lstm} cell.]{
    101 		\label{fig:context:lstm cell}
    102 		Architecture of an \textsc{lstm} cell.
    103 		In its simplest form, this block replaces the linear layer at the bottom of Figure~\ref{fig:context:rnn}.
    104 		The link between \(\vctr{c}_t\) and \(\vctr{c}_{t-1}\) is illustrated by a self-loop but could be seen as an additional input and output.
    105 	}
    106 \end{figure}
    107 
    108 \textsc{lstm}s redefine the recurrence of \textsc{rnn}s (Equation~\ref{eq:context:rnn}) by adding multiplicative gates as illustrated by Figure~\ref{fig:context:lstm cell}.
    109 It is governed by the following set of equations:
    110 \begin{equation*}
    111 \def\arraystretch{1.25}
    112 \begin{array}{l l l l}
    113 		\vctr{x}'_t        & = & \begin{bmatrix}\vctr{x}_t\\ \vctr{h}_{t-1}\end{bmatrix}                               & \text{Recurrent input}\\[3.9mm]
    114 		\tilde{\vctr{c}}_t & = & \tanh(\mtrx{W}^{(c)} \vctr{x}'_t + \vctr{b}^{(c)})                                    & \text{Cell candidate}\\
    115 		\vctr{i}_t         & = & \sigmoid(\mtrx{W}^{(i)} \vctr{x}'_t + \mtrx{U}^{(i)} \vctr{c}_{t-1} + \vctr{b}^{(i)}  & \text{Input gate}\\
    116 		\vctr{f}_t         & = & \sigmoid(\mtrx{W}^{(f)} \vctr{x}'_t + \mtrx{U}^{(f)} \vctr{c}_{t-1} + \vctr{b}^{(f)}) & \text{Forget gate}\\
    117 		\vctr{c}_t         & = & \vctr{i}_t\odot \tilde{\vctr{c}}_t + \vctr{f}_t\odot \vctr{c}_{t-1}                   & \text{New cell}\\
    118 		\vctr{o}_t         & = & \sigmoid(\mtrx{W}^{(o)} \vctr{x}'_t + \mtrx{U}^{(o)} \vctr{c}_t + \vctr{b}^{(o)})     & \text{Output gate}\\
    119 		\vctr{h}_t         & = & \vctr{o}_t\odot \tanh(\vctr{c}_t)                                                     & \text{Hidden layer output}\\
    120 \end{array}
    121 \end{equation*}
    122 \marginnote{
    123 	\(\odot\) is the element-wise multiplication and \(\sigmoid\) the sigmoid function.
    124 }[-18mm]
    125 \marginnote{
    126 	As with \textsc{rnn}, \(\vctr{\theta} = \{ \mtrx{W}^{(c)}, \mtrx{W}^{(i)}, \mtrx{U}^{(i)},\\\mtrx{W}^{(f)}, \mtrx{U}^{(f)}, \mtrx{W}^{(o)}, \mtrx{U}^{(o)}, \vctr{b}^{(c)}, \vctr{b}^{(f)}, \vctr{b}^{(i)},\\\vctr{b}^{(o)} \}\) are model parameters.
    127 }[-8mm]
    128 
    129 The main peculiarity of \textsc{lstm} is the presence of multiple gates used as masks or mixing factors in the unit.
    130 \textsc{lstm} units are interpreted as having an internal cell memory \(\vctr{c}_t\) which is an additional (internal) state alongside \(\vctr{h}_t\) and is used as input of the cell alongside \(\vctr{x}_t\) and \(\vctr{h}_{t-1}\).
    131 When computing its activation, we first compute a cell candidate \(\tilde{\vctr{c}}_t\) which is the potential successor to \(\vctr{c}_t\).
    132 Then, the multiplicative gates come into play, the cell \(\vctr{c}_t\) is partially updated with a mix of \(\vctr{c}_{t-1}\) and \(\tilde{\vctr{c}}_t\) controlled by the input and forget gates \(\vctr{i}_t\) and \(\vctr{f}_t\).
    133 Finally, the output of the unit is masked by the output gate \(\vctr{o}_t\).%
    134 \sidenote{Note that the output gate \(\vctr{o}_t\) has its value computed from the new cell value \(\vctr{c}_t\) instead of \(\vctr{c}_{t-1}\) in contrast to the expression of \(\vctr{i}_t\) and \(\vctr{f}_t\).}
    135 
    136 It has been theorized~\parencite{lstm_vanishing} that the gates are what makes \textsc{lstm}s so powerful.
    137 The multiplications allow the model to learn to control the flow of information in the unit, thus counteracting the vanishing gradient problem.
    138 The basic building block of multiplicative gates has been reused for other \textsc{rnn} cell designs such as gated recurrent unit (\textsc{gru}, \cite{nmt_encdec}).
    139 Furthermore, random cell designs using multiplicative gates can be shown to perform as well as \textsc{lstm}~\parencite{lstm_odyssey}.
    140 However, standard practice is to always use \textsc{lstm} or \textsc{gru} for recurrent neural networks.
    141 
    142 \subsubsection{\textsc{elm}o}
    143 \label{sec:context:elmo}
    144 Recurrent neural networks with \textsc{lstm} cells were widely used for language modeling, both at the character-level~\parencite{charrnn} and at the word-level~\parencite{lm_limits}.
    145 The first language model to become widely used for extracting contextual word embeddings was \textsc{elm}o (Embeddings from Language Model, \citex{elmo}) which uses several \textsc{lstm} layers.
    146 
    147 The peculiarity of the word embeddings extracted by \textsc{elm}o is that they are contextualized (see Section~\ref{sec:context:language model}).
    148 Static word embeddings models like word2vec (Section~\ref{sec:context:word2vec}) map each word to a unique vector.
    149 However, this fares poorly with polysemic words and homographs whose meaning depends on the context in which they are used.
    150 \begin{marginparagraph}
    151 	Before \textsc{elm}o, \textcite{cove} already trained contextualized word representations using an \textsc{nmt} task.
    152 \end{marginparagraph}
    153 Contextualized word embeddings provide an answer to this problem.
    154 Given a sentence, \textsc{elm}o proposes to use the hidden states \(\vctr{h}_t\) as a representation of each constituting word \(w_t\).
    155 These representations are hence a function of the whole sentence.%
    156 \sidenote{
    157 	In order to encode both the left and right context of a word, \textsc{elm}o uses bidirectional \textsc{lstm}, meaning that each layer contains two \textsc{lstm}, one running from left-to-right and one right-to-left.}
    158 Thus words are mapped to different vectors in different contexts.
    159 
    160 \subsection{Attention Mechanism}
    161 \label{sec:context:attention}
    162 To obtain a vector representation of a sentence from an \textsc{rnn}, two straightforward methods are to use the last hidden state \(\vctr{h}_m\) or use a pooling layer similar to the one used in \textsc{cnn}, such as max-over-time pooling.
    163 However, both of these approaches present shortcomings: the last hidden state tends to encode little information about the beginning of the sentence, while pooling is too indiscriminate and influenced by unimportant words.
    164 Using an attention mechanism is a way to avoid these shortcomings.
    165 Furthermore, an attention mechanism is parametrized by a \emph{query} which allows us to select the piece of information we want to extract from the sentence.
    166 
    167 The concept of attention first appeared in neural machine translation (\textsc{nmt}) under the name ``alignment''~\parencitex{attention} before becoming ubiquitous in \textsc{nlp}.
    168 The same principle was also presented under the name \emph{memory network}~\parencite{memory_networks, memory_networks_end-to-end}.
    169 It is also the building block of transformers, which are presented \hyperref[sec:context:transformers]{next}.
    170 With this in mind, we use the vocabulary of memory networks to describe the attention mechanism.
    171 
    172 \begin{figure}[ht!]
    173 	\centering
    174 	\input{mainmatter/context/attention.tex}
    175 	\scaption[Schema of an attention mechanism.]{
    176 		Schema of an attention mechanism.
    177 		The attention scores are obtained by an inner product between the query and the memory.
    178 		The output is obtained as a sum of the memory weighted by the softmax of the attention scores.
    179 		\label{fig:context:attention}
    180 	}
    181 \end{figure}
    182 
    183 \subsubsection{Attention as a Mechanism for \textsc{rnn}}
    184 The principle of an attention layer on top of an \textsc{rnn} is illustrated by Figure~\ref{fig:context:attention}.
    185 The layer takes three inputs: a query \(\vctr{q}\in\symbb{R}^d\), memory keys \(\mtrx{K}\in\symbb{R}^{\ell\times d}\) and memory values \(\mtrx{V}\in\symbb{R}^{\ell\times d'}\).
    186 Originally, more often than not, \(\mtrx{K}=\mtrx{V}\).
    187 In the model of Figure~\ref{fig:context:attention}, the memory corresponds to the hidden states of the \textsc{rnn}, which was the most common architecture when attention was introduced in 2014.
    188 First, attention weights are computed from the query \(\vctr{q}\) and keys \(\mtrx{K}\), then these weights are used to compute the output \(\vctr{o}\in\symbb{R}^{d'}\) as a convex combination of the values \(\mtrx{V}\)\,:
    189 \begin{marginparagraph}[-11mm]
    190 	Where \(\softmax\) is a smooth version of the \(\argmax\) function.
    191 	It can also be seen as a multi-dimensional sigmoid, defined as:
    192 	\begin{equation*}
    193 		\softmax(\vctr{x})_i = \frac{\exp x_i}{\sum_j \exp x_j}
    194 	\end{equation*}
    195 \end{marginparagraph}
    196 \begin{equation}
    197 	\vctr{o} = \softmax(\mtrx{K}\vctr{q})\mtrx{V}.
    198 \end{equation}
    199 
    200 In \textsc{nmt}, the memory is built from the hidden states of an \textsc{rnn} running on the sentence to be translated (meaning \(\ell=m\)), while the query is the state of the translated sentence (``what was already translated''), the attention is then recomputed for each output position.
    201 In other words, a new representation of the source sentence is recomputed for each word in the target sentence.
    202 The attention weights---that is, the output of the softmax---can provide an interpretation of what the model is focusing on when making a prediction.
    203 In the case of \textsc{nmt}, the attention for producing a translated word usually focuses on the corresponding word or group of words in the source sentence.
    204 
    205 \subsubsection{Attention as a Standalone Model}
    206 \label{sec:context:attention lm}
    207 Since the attention mechanism produces a fixed-size representation (\(\vctr{o}\)) from a variable length sequence (\(\mtrx{K}\), \(\mtrx{V}\)\,), it can actually be used by itself without an \textsc{rnn}.
    208 This was already mentioned in \textcitex{memory_networks_end-to-end}[-10mm] and used for language modeling.
    209 We now succinctly present their approach.
    210 As shown Figure~\ref{fig:context:memory network lm}, this is a causal language model (Section~\ref{sec:context:language model}), at each step \(P(w_t\mid w_1,\dotsc,w_{t-1})\) is modeled.
    211 While the previous words constitute the memory of the attention mechanism, there is no natural value for the query.
    212 As such, for the first layer, it is simply taken to be a constant vector \(q^{(1)}_i = 0.1\) for all \(i=1,\dotsc, d\).
    213 When several attention layers are stacked, the output \(o^{(l)}\) of a layer \(l\) is used as the query \(q^{(l+1)}\) for the layer \(l+1\).
    214 Furthermore, residual connections with linear layers and modified ReLU non-linearities%
    215 \sidenote[][-35.5mm]{
    216 	While the standard ReLU activation~\parencite{relu} is defined as \(\ReLU(x)=\max(0, x)\).
    217 	The non-linearity used in this model is \(\ReLU_{\halfCircleScript}\), which applies the ReLU activation to half of the units in the layer.}
    218 are introduced between layers thus: \(\vctr{q}^{(l+1)} = \ReLU_{\halfCircleScript}(\mtrx{W}^{(l)} \vctr{q}^{(l)} + \vctr{o}^{(l)})\) where the matrices \(\mtrx{W}^{(l)}\in\symbb{R}^{d\times d}\) are parameters of the model.
    219 As usual, the next word prediction \(\hat{w}_i\) is made using a softmax layer.
    220 
    221 \begin{marginfigure}[-30mm]
    222 	\centering
    223 	\input{mainmatter/context/memory network lm.tex}
    224 	\scaption[Schema of a memory network language model with two layers.]{
    225 		Schema of a memory network language model with two layers.
    226 		Each red block corresponds to an attention mechanism as illustrated by Figure~\ref{fig:context:attention}.
    227 	}
    228 	\label{fig:context:memory network lm}
    229 \end{marginfigure}
    230 
    231 \paragraph{Temporal Encoding}
    232 The attention mechanism as described above is invariant to a permutation of the memory.
    233 This is not a problem when an \textsc{rnn} is run on the sentence, as it can encode the relative positions of each token.
    234 However, in the \textsc{rnn}-less approach of \textcite{memory_networks_end-to-end} this information is lost, which is quite damaging for language modeling.
    235 Indeed, this would mean that shuffling the words in a sentence---like inverting the subject and object of a verb---does not change its meaning.
    236 In order to solve this problem, temporal encoding is introduced.
    237 When predicting \(w_i\), each word embedding \(\vctr{x}_j\) in the memory is summed with a relative position embedding \(\vctr{e}_{i-j}\).
    238 These position embeddings are trained through back-propagation like any other parameters.
    239 
    240 \bigskip
    241 
    242 Attention mechanisms form the basis of current state-of-the-art approaches in \textsc{nlp}.
    243 One of the explanations behind their success is that, in a sense, they are more shallow than \textsc{rnn}.
    244 Indeed, when computing \(\frac{\partial \hat{w}_i}{\partial \vctr{x}_j}\) for the language model of \textcite{memory_networks_end-to-end}, one can see that part of the gradient goes through few non-linearities.
    245 In contrast, the information from \(\vctr{x}_j\) to \(\hat{w}_i\) must go through the composition of at least \(i-j\) non-linearities in an \textsc{rnn}, which may cause the gradient to vanish.
    246 However, an attention mechanism has linear complexity in the length of the sequence for a total of \(\Theta(m\times d^2)\) operations at each step.
    247 When \(m\) is large, this can be prohibitive compared to \textsc{rnn}, which have a \(\Theta(d^2)\) complexity at each step.
    248 On the other hand, an attention layer can easily be parallelized while an \textsc{rnn} always necessitates \(\Omega(m)\) sequential operations.
    249 
    250 \subsection{Transformers}
    251 \label{sec:context:transformers}
    252 Transformers~\parencitex{transformers} were originally introduced for \textsc{nmt}.
    253 Likewise to the memory network language model presented above, they introduce several slight modifications of its architecture which make them the current state of the art for most \textsc{nlp} tasks.
    254 For conciseness, we present the concept of transformers as used by \textsc{bert} (Bidirectional Encoder Representations from Transformers, \citex{bert}).
    255 \textsc{bert} is a language model used to extract contextualized embeddings similar to \textsc{elm}o but using attention layers in place of \textsc{lstm} layers.
    256 
    257 \subsubsection{Transformer Attention}
    258 \label{sec:context:transformer attention}
    259 The attention layers used by transformers are slightly modified.
    260 \marginpar{Note that in contrast to the classical attention mechanism presented in Section~\ref{sec:context:attention}, transformers have \(\mtrx{K}\neq\mtrx{V}\).}
    261 First, it is often advisable that in a neural network, all activations follow a standard normal distribution \(\normalDistribution(0, 1)\).
    262 In order to achieve this, transformers use scaled attention:
    263 \begin{equation}
    264 	\operatorname{Attention}(\vctr{q}, \mtrx{K}, \mtrx{V}) = \softmax\left(\frac{\mtrx{K}\vctr{q}}{\sqrt{d}}\right)\mtrx{V}.
    265 \end{equation}
    266 This ensures that if \(\mtrx{K}\) and \(\vctr{q}\) follow a standard normal distribution, so does the input of the softmax.
    267 
    268 Second, multi-head attention is used: each layer actually applies \(h=8\) attentions in parallel.
    269 To ensure each individual attention captures a different part of the semantic, its input is projected by different matrices, one for each attention head:
    270 \begin{equation*}
    271 	\operatorname{MultiHeadAttention}(\vctr{q}, \mtrx{K}, \mtrx{V}) =
    272 	\begin{bmatrix}
    273 		\operatorname{head}_1(\vctr{q}, \mtrx{K}, \mtrx{V}) \\
    274 		\operatorname{head}_2(\vctr{q}, \mtrx{K}, \mtrx{V}) \\
    275 		\vdots\\
    276 		\operatorname{head}_h(\vctr{q}, \mtrx{K}, \mtrx{V}) \\
    277 	\end{bmatrix} \mtrx{W}^{(o)}
    278 \end{equation*}
    279 \begin{equation*}
    280 	\operatorname{head}_i(\vctr{q}, \mtrx{K}, \mtrx{V}) = \operatorname{Attention}(\vctr{q}\mtrx{W}_i^{(q)}, \mtrx{K}\mtrx{W}_i^{(k)}, \mtrx{V}\mtrx{W}_i^{(v)}).
    281 \end{equation*}
    282 
    283 Lastly, on top of each attention layer is a linear layer with ReLU activation and a linear layer followed by layer normalization~\parencite{layernorm}.
    284 These linear layers are identical along the sequence length, akin to a convolution with kernel size 1.
    285 While the query of each layer is the output of the preceding layer, similarly to the model of \textcite{memory_networks_end-to-end}, the initial query is now the current word itself \(\vctr{x}_t\).
    286 This architecture is illustrated in Figure~\ref{fig:context:bert}.
    287 
    288 \Textcite{bert} introduce two \textsc{bert} architectures dubbed \bertArch{small} and \bertArch{large}.
    289 Like their names imply, \bertArch{small} has fewer parameters than \bertArch{large}, in particular, \bertArch{small} is composed of 12 layers while \bertArch{large} is composed of 24 layers.
    290 
    291 \begin{marginfigure}
    292 	\centering
    293 	\input{mainmatter/context/bert.tex}
    294 	\scaption[Schema of \textsc{bert}, a transformer masked language model.]{
    295 		Schema of \textsc{bert}, a transformer masked language model.
    296 		The schema is focused on the prediction for a single position \(t\), this is repeated for the whole sentence \(t=1, \dotsc, m\).
    297 		The model presented is the \bertArch{small} variant containing only 12 layers.
    298 		The input vectors \(\vctr{\tilde{x}}_t\) are obtained from the corrupted sentence \(\vctr{\tilde{w}}\) using an embedding layer.
    299 		To obtain \(\hat{w}_t\) from the last \textsc{bert} layer output, a linear layer with softmax over the vocabulary is used.
    300 		\label{fig:context:bert}
    301 	}
    302 \end{marginfigure}
    303 
    304 \subsubsection{Masked Language Model}
    305 \label{sec:context:mlm}
    306 While some transformer models such as \textsc{gpt} (Generative Pre-Training, \cite{gpt}) are causal language models, \textsc{bert} is a \emph{masked} language model (\textsc{mlm}).
    307 Instead of following Equation~\ref{eq:context:causal lm}, the following approximation is used:
    308 \begin{equation}
    309 	P(\vctr{w}) \propto \prod_{t\in C} P(w_t \mid \tilde{\vctr{w}})
    310 \end{equation}
    311 where \(C\) is a random set of indices, 15\% of tokens being uniformly selected to be part of \(C\), and \(\tilde{\vctr{w}}\) is a corrupted sequence defined as follow:
    312 \begin{equation*}
    313 	\tilde{w}_t = \left\{\begin{array}{@{}ll@{}}
    314 		w_t & \text{if } t\not\in C \\
    315 		\left.\begin{array}{@{}ll@{}}
    316 			\text{\blanktag{} token} & \text{with probability } 80\% \\
    317 			\text{random token} & \text{with probability } 10\% \\
    318 			w_t & \text{with probability } 10\% \\
    319 		\end{array}\right\} & \text{if } i\in C\\
    320 	\end{array}\right.
    321 \end{equation*}
    322 The masked tokens \blanktag{} make up the majority of the set \(C\) of tokens predicted by the model, thus the name ``masked language model''.
    323 The main advantage of this approach compared to causal language model is that the probability distribution at a given position is parametrized by the whole sentence, including both the left and right context of a token.
    324 
    325 \subsubsection{Transfer Learning}
    326 \label{sec:context:transfer learning}
    327 The main purpose of \textsc{bert} is to be used on a \emph{downstream task}, transferring the knowledge gained on masked language modeling to a different problem.
    328 As with \textsc{elm}o, the hidden state of the topmost layer, just before the linear and softmax, can be used as contextualized word representations.
    329 Furthermore, the first token, usually called ``beginning of sentence'' but dubbed \textsc{cls} in \textsc{bert}, can be used as a representation of the whole sentence.%
    330 \sidenote{
    331 	This is by virtue of an additional \emph{next sentence prediction} loss with which \textsc{bert} is trained.
    332 	We do not detail this task here as it is not essential to \textsc{bert}'s training.
    333 	Furthermore, the embedding of the \textsc{cls} token is considered a poor representation of the sentence and is rarely used~\parencite{xlm, xlnet}.}
    334 In contrast with \textsc{elm}o, \textsc{bert} is usually fully fine-tuned on the downstream task.
    335 In the original article~\parencite{bert}, this was shown to outperform previous models on a wide variety of tasks, from question answering to textual entailments.
    336 
    337 \bigskip
    338 
    339 In this section, we presented several \textsc{nlp} models which allow us to get a distributed representation for words, sentences and words contextualized in sentences.
    340 These representations can then be used on a downstream task, such as relation extraction, as we do from Chapter~\ref{chap:relation extraction} onward.
    341 We now focus on the other kind of data handled in this thesis: knowledge bases.