PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

word.tex (17602B)


      1 \section{Distributed Representation of Words}
      2 \label{sec:context:word}
      3 Natural language processing (\textsc{nlp}) deals with the automatic manipulation of natural language by algorithms.
      4 Nowadays, a large pan of \textsc{nlp} concerns itself with the question of how to obtain good distributed representations from textual inputs.
      5 What constitutes a good representation may vary, but it is usually measured by performance on a task of interest.
      6 Natural language inputs present themselves as tokens or sequences of tokens, usually in the form of words stringed together into sentences.
      7 The goal is then to map these sequences of symbolic units to distributed representations.
      8 This section and the next present several methods designed to achieve this goal which have become ubiquitous in \textsc{nlp} research.
      9 We first describe how to obtain good representations of words---or of smaller semantic units in Section~\ref{sec:context:bpe}---before studying how to use these representations to process whole sentences in Section~\ref{sec:context:sentence}.
     10 
     11 Given a vocabulary, that is a set of words \(V=\{\text{a}, \text{aardvark}, \text{aback}, \dotsc\}\), our goal is to map each word \(w\in V\) to an embedding \(u_w\in\symbb{R}^d\) where \(d\) is a hyperparameter.
     12 An example of an embedding space is given in Figure~\ref{fig:context:word2vec pca}.
     13 \begin{marginparagraph}[-1cm]
     14 	In contrast, a symbolic representation of words would simply map each word to an index \(V\to\{1, \dotsc, |V|\}\).
     15 \end{marginparagraph}
     16 One of the early methods to embed words like this is latent semantic analysis (\textsc{lsa}, \citex{lsa}).
     17 Interestingly, \textsc{lsa} was popularized by the information retrieval field under the name latent semantic indexing (\textsc{lsi}).
     18 The basis of \textsc{lsa} is a document--term matrix indicating how many times a word appears in a document.
     19 A naive approach would be to take the rows of this matrix; we would obtain a vector representation of each word, the dimension \(d\) of these embeddings would be the number of documents.
     20 The similarity of two words is then evaluated by taking the cosine similarity of the associated vectors; in the simple case described above, this value would be high if the two words often appear together in the same documents and low otherwise.
     21 We can already see that this representation is distributed since each document makes up a small fraction of the representation of the words it contains.
     22 However, this approach is not practical, as either \(d\) is too large, or the representations obtained tend to be noisy (when the number of documents is relatively small).
     23 So \textsc{lsa} goes one step further and builds a low-rank approximation of this matrix such that \(d\) can be chosen as small as we want.
     24 This basic idea of modeling word co-occurrences forms the basis behind most word embedding techniques.
     25 
     26 \begin{marginfigure}
     27 	\centering
     28 	\renderEmbeddings{mainmatter/context/word2vec embeddings.xml}
     29 	\scaption[Word2vec embeddings \textsc{pca}.]{
     30 		Selected word2vec embeddings of dimension \(d=300\), projected into two dimensions using \textsc{pca} (explained variance ratio \(\explainedvarx + \explainedvary\)).
     31 		The representations encode a strong separation between countries and capitals.
     32 		Furthermore, the relative position of each country with respect to its associated capital is somewhat similar.}
     33 	\label{fig:context:word2vec pca}
     34 \end{marginfigure}
     35 
     36 In this section, we focus on the representation of words, yet most \textsc{nlp} tasks need to process longer chunks of text; this will be the focus of Section~\ref{sec:context:sentence}.
     37 We center our overview of word representations on word2vec in Section~\ref{sec:context:word2vec}.
     38 With the advent of deep learning, word2vec has been the most ubiquitous word embedding technique.
     39 Additionally, it introduced negative sampling, a technique that we make use of in Chapter~\ref{chap:fitb}.
     40 Section~\ref{sec:context:language model} introduces the notion of language model, which is central to several representation extraction techniques in \textsc{nlp}; we also present several alternatives to word2vec used before the transition to sentence-level approaches of Section~\ref{sec:context:transformers}.
     41 Finally, while models presented in this section are focused on words, smaller semantic units can similarly be used.
     42 This is especially needed for languages in which words have a complex internal structure, but it can also be applied to English.
     43 Section~\ref{sec:context:bpe} will explore alternative levels at which we can apply methods from Sections~\ref{sec:context:word2vec} and~\ref{sec:context:language model}.
     44 
     45 \subsection{Word2vec}
     46 \label{sec:context:word2vec}
     47 Word2vec~\parencite{word2vec, word2vec_follow-up}\sidecite{word2vec_follow-up} is one of the first \textsc{nlp} models widely used for the representations it produces.
     48 As its name implies, word2vec outputs word representations; however, its general framework can be used on other kinds of tokens.
     49 Word2vec relies strongly on the distributional hypothesis: its goal is to model the context of a word to produce a representation of the word itself, a technique which was pioneered by \textcitex{nplm}.
     50 Several variants of the word2vec model exist, but for the sake of conciseness, this section focuses on the skip-gram with negative sampling (\textsc{sgns}) approach.
     51 
     52 \subsubsection{Skip-gram}
     53 \label{sec:context:skip-gram}
     54 Given a word, the idea behind skip-gram is to model its context.%
     55 \sidenote{The context of a word \(w\) is defined as all words appearing in a fixed-size window around \(w\) in the text. In the case of word2vec, this window is of size five in both directions.}
     56 The probability of a word \(c\in V\) to appear in the context of a word \(w\in V\) is modeled by the following softmax:
     57 \begin{marginparagraph}
     58 	Here, we omit the conditioning on the parameters.
     59 	More formally, \(P(c\mid w)\) should be written \(P(c\mid w; \mtrx{U}, \mtrx{U}')\).
     60 \end{marginparagraph}
     61 \begin{equation}
     62 	P(c\mid w) = \frac{\exp(\vctr{u}_w\transpose \vctr{u}'_c)}{\sum_{c'\in V} \exp(\vctr{u}_w\transpose \vctr{u}'_{c'})}
     63 	\label{eq:context:word2vec softmax}
     64 \end{equation}
     65 where \(V\) is the vocabulary, and \(\mtrx{U},\mtrx{U}'\in\symbb{R}^{V\times d}\) are the model parameters assigning a vector representation to all words in the vocabulary.
     66 The rows of these parameters \(\vctr{u}_w\) and \(\vctr{u}'_w\) are what is of interest when word2vec is used for transfer learning.
     67 Once the model has been trained, \(\vctr{u}_w\) can be used as a distributed representation for \(w\), capturing its associated semantics.
     68 See Figure~\ref{fig:context:word2vec pca} for an example of extracted vectors.
     69 
     70 \subsubsection{Noise Contrastive Estimation}
     71 \label{sec:context:nce}
     72 Evaluating Equation~\ref{eq:context:word2vec softmax} is quite expensive since the normalization term involves all the words in the vocabulary.
     73 Noise Contrastive Estimation (\textsc{nce}, \citex{nce}) is a training method that removes the need to compute the partition function of probabilistic models explicitly.
     74 To achieve this, \textsc{nce} reframes the model as a binary classification problem by modeling the probability that a data point---in word2vec's case a word-context pair---comes from the observed dataset \(P(\rndm{D}=1\mid w,c)\).
     75 \begin{marginparagraph}
     76 	We use \(\empP\) to refer to empirical distributions, whereas \(P\) denotes a modeled probability.
     77 	For example, \(\empP(c\mid w)\) is the actual frequency of the word \(c\in V\) in the context of \(w\in V\).
     78 	While \(P(c\mid w)\) is the probability word2vec assigns to a given pair \((c, w)\in V^2\).
     79 \end{marginparagraph}
     80 This probability is contrasted with \(k\) samples from a noise distribution following the unigram distribution \(\empP(\rndm{W})\), that is the empirical word frequency.%
     81 \sidenote{Word2vec actually scales this distribution and uses various other tricks to lessen the effect of frequent words, refer to \textcite{word2vec_follow-up} for details.}
     82 This translate to \(P(c\mid \rndm{D}=1, w) = \empP(c\mid w)\) and \(P(c\mid \rndm{D}=0, w) = \empP(\rndm{W}=c)\).
     83 Using the prior \(P(\rndm{D}=0)=\frac{k}{k+1}\), the posterior can be expressed as:
     84 \begin{equation}
     85 	P(\rndm{D}=1\mid w, c) = \frac{\empP(c\mid w)}{\empP(c\mid w) + k \empP(c)}.
     86 	\label{eq:context:word2vec nce posterior}
     87 \end{equation}
     88 
     89 Restating Equation~\ref{eq:context:word2vec softmax} as \(P(c\mid w) = \exp(\vctr{u}_w\transpose\vctr{u}'_c) \times \gamma_w\) and treating \(\gamma_w\) as another model parameter, \textsc{nce} allows us to train \(\mtrx{U}\) and \(\mtrx{U}'\) without computing the denominator of Equation~\ref{eq:context:word2vec softmax}.
     90 Furthermore, estimating \(\gamma_w\) is not even necessary, since \textcite{nplm_nce} showed that using \(\gamma_w=1\) for all \(w\) works well in practice.
     91 The final objective maximised by \textsc{nce} is the log-likelihood of the classification data:
     92 \begin{equation}
     93 	\mathop{J_\textsc{nce}}(w,c) = \log P(\rndm{D}=1\mid w,c) + \sum_{i=1}^k \expectation_{c'_i\sim P(\rndm{W})} \left[ \log P(\rndm{D}=0\mid w, c'_i) \right].
     94 	\label{eq:context:word2vec nce objective}
     95 \end{equation}
     96 
     97 \Textcite{nce} showed that optimizing \(J_\textsc{nce}\) is equivalent to maximizing the log-likelihood using Equation~\ref{eq:context:word2vec softmax} under some reasonable assumptions.
     98 
     99 \subsubsection{Negative Sampling}
    100 \label{sec:context:negative sampling}
    101 However, \textsc{sgns} uses a different approximation of Equation~\ref{eq:context:word2vec softmax} called negative sampling.
    102 The difference is mainly visible in the expression of the objective which simplifies to:
    103 \begin{equation}
    104 	\mathop{J_\textsc{neg}}(w,c) = \log \sigmoid(\vctr{u}_w\transpose \vctr{u}'_c) + \sum_{i=1}^k \expectation_{c'_i\sim P(\rndm{W})} \left[ \log \sigmoid(-\vctr{u}_w\transpose \vctr{u}'_{c'_i}) \right].
    105 	\label{eq:context:word2vec neg objective}
    106 \end{equation}
    107 This can be shown to be similar to \textsc{nce}, where Equation~\ref{eq:context:word2vec nce posterior} is instead replaced by the following posterior:
    108 \begin{equation}
    109 	P(\rndm{D}=1\mid w, c) = \frac{\empP(c\mid w)}{\empP(c\mid w) + 1}.
    110 	\label{eq:context:word2vec neg posterior}
    111 \end{equation}
    112 
    113 Optimizing the objective of Equation~\ref{eq:context:word2vec neg objective} is not equivalent to maximizing the log-likelihood of the language model.
    114 But even though this is not an approximation of the softmax of Equation~\ref{eq:context:word2vec softmax}, this method has proven to be quite effective at producing good word representations.
    115 \Textcitex{word2vec_pmi} explain the effectiveness of word2vec by showing that \textsc{sgns} can be interpreted as factoring the pointwise mutual information (\textsc{pmi}) matrix between words and contexts.
    116 This led to the emergence of GloVe~\parencite{glove}, which produces word embeddings by directly factorizing the \textsc{pmi} matrix.
    117 
    118 The negative sampling algorithm is one of the main contributions of word2vec; it can be used outside \textsc{nlp} to optimize softmax over large domains.
    119 In particular, we make use of negative sampling to approximate a softmax over a large number of entities in Chapter~\ref{chap:fitb}.
    120 Furthermore, even though it was initially presented on words, the algorithm can be used on other kinds of tokens, as we will see in Section~\ref{sec:context:bpe}.
    121 
    122 \subsection{Language Modeling for Word Representation}
    123 \label{sec:context:language model}
    124 Word2vec is part of a large class of algorithms that seek to learn word representation from raw text.
    125 More precisely, to obtain distributed representations of natural language inputs, most modern approaches rely on language models.
    126 A language model specifies a probability distribution over sequences of tokens \(P(w_1, \dotsc, w_m)\).
    127 The tokens \(\vctr{w}\) are usually words, but as we see in Section~\ref{sec:context:bpe}, they need not be.
    128 This distribution is often decomposed into a product of conditional distributions on tokens.
    129 The most common approach is the so-called \emph{causal} language model, which uses the following decomposition:
    130 \begin{equation}
    131 	P(w_1, \dotsc, w_m) = \prod_{t=1}^m P(w_t\mid w_1, \dotsc, w_{t-1}).
    132 	\label{eq:context:causal lm}
    133 \end{equation}
    134 Modeling the tokens one by one cannot only enable the model to factorize the handling of local information but also makes it easy to sample to generate new utterances.
    135 However most language models do not use an exact decomposition but either approximate \(P(\vctr{w})\) directly or use the decomposition of Equation~\ref{eq:context:causal lm} together with an approximation of the conditionals \(P(w_t\mid w_1, \dotsc, w_{t-1})\).
    136 This is for example the case of word2vec which conditions each word on its close neighbors instead of using the whole sentence.
    137 
    138 The use of language models is motivated by transfer learning, the idea that by solving a problem, we can get knowledge about a different but related problem.
    139 To assign a probability to a sequence, language models extract intermediate latent factors, which were proven to capture the semantic information contained in the sequence.
    140 Using these latent factors as distributed representations for natural language inputs improved the performance of most \textsc{nlp} tasks.
    141 The effectiveness of language models can be justified by the externalist approach and the distributional hypothesis exposed in Section~\ref{sec:context:history}: a word is defined by the distribution of the other words with which it co-occurs.
    142 
    143 Since language models process sequences of words, we will delve into the details of these approaches in Section~\ref{sec:context:sentence}.
    144 Apart from the neural probabilistic language model of \textcite{nplm}, a precursor to word embedding techniques was the \textsc{cnn}-based approach of \textcite{unified_nlp}, both of them learn distributed word representations by approximating \(P(\vctr{w})\) using a window somewhat similar to word2vec.
    145 
    146 All of these methods learn \emph{static} word embeddings, meaning that the vector assigned to a word such as ``bank'' is the same regardless of the context in which the word appears.
    147 In the last few years, \emph{contextualized} word embeddings have grown in popularity; in these approaches, the word ``bank'' is assigned different embeddings in the phrases ``robbing a bank'' and ``bank of a river.''
    148 These methods were first based on recurrent neural networks (Section~\ref{sec:context:rnn}) such as \textsc{elm}o but are now primarily based on transformers (Section~\ref{sec:context:transformers}).
    149 Among contextualized word embedding built using transformers, some are based on the causal decomposition of Equation~\ref{eq:context:causal lm} (e.g.~\textsc{gpt}) while others are based on masked language models (e.g.~\textsc{bert}), a different approximation of \(P(\vctr{w})\) introduced in Section~\ref{sec:context:mlm}.
    150 
    151 \subsection{Subword Tokens}
    152 \label{sec:context:bpe}
    153 We defined word2vec and language models for a vocabulary \(V\) composed of words.
    154 This may seem natural in the case of English and other somewhat analytic languages,%
    155 \sidenote[][-21mm]{
    156 	An analytic language is a language with a low ratio of morphemes to words.
    157 	This is in contrast to synthetic languages, where words have a complex inner structure.
    158 	Take for example the Nahuatl word ``Nimitztētlamaquiltīz'' (I-you-someone-something-give-\textsc{causative}-\textsc{future}) meaning ``I shall make somebody give something to you'' \parencite{nahuatl}.
    159 	For this kind of language, word-level approaches fail.
    160 	Older models preprocessed the text with a morphological segmentation algorithm, while modern approaches directly work on subword units.
    161 }
    162 but it cannot directly be applied to all languages.
    163 Furthermore, language models that work at the word level tend to have difficulties working with rare words.
    164 A first solution to this problem is to use character-level models, but these tend to have a hard time dealing with the resulting long sequences. 
    165 
    166 Modern approaches neither work at the word-level nor at the character-level; instead, an intermediate subword vocabulary is used.
    167 The standard method to build this vocabulary nowadays is to use the byte pair encoding algorithm (\textsc{bpe}, \cite{bpe}).
    168 \textsc{bpe} listed as Algorithm~\ref{alg:context:bpe} consists in iteratively replacing the most common bigram \(c_1c_2\) in a corpus with a new token \(c_\text{new}\).
    169 This new token can then appear in the most common bigram with another token \(c_\text{new}c_3\), they are then replaced with a new token \(c'_\text{new}\) which represents a tri-gram in the original alphabet: \(c_1c_2c_3\).
    170 This is repeated until the desired vocabulary size is reached.
    171 In this way, \textsc{bpe} extracts tokens close to morphemes, the smallest linguistic unit with a meaning.
    172 As an example, by using this algorithm, the word ``pretrained'' can be split into three parts: ``pre-,'' ``-train-'' and ``-ed.''
    173 
    174 \begin{marginalgorithm}
    175 	\input{mainmatter/context/bpe.tex}
    176 	\scaption{The byte pair encoding algorithm.}
    177 	\label{alg:context:bpe}
    178 \end{marginalgorithm}
    179 
    180 Word2vec can be both applied to words and to subwords extracted by \textsc{bpe} or other algorithms.
    181 This is the case of fastText~\parencite{fasttext} which uses the word2vec algorithm on fixed-size subwords.
    182 All the models discussed in this section and the next have very loose requirements on the vocabulary \(V\).
    183 However, they might work best using a smaller \(V\); this is especially the case of transformers, the current state-of-the-art approach introduced in Section~\ref{sec:context:transformers}.