PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

unsupervised.tex (54386B)


      1 \section{Unsupervised Extraction Models}
      2 \label{sec:relation extraction:unsupervised}
      3 \begin{epigraph}
      4 		{Yann LeCun}
      5 		{Inaugural Lecture at Collège de France}
      6 		{2016}
      7 	% This translation comes from a Facebook post made by Yann LeCun himself, but the Collège de France reference should be easier to find.
      8 	If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.
      9 \end{epigraph}
     10 In the unsupervised setting, no samples are labeled with a relation, i.e.~all samples are triplets (sentence, head entity, tail entity) from \(\dataSet\subseteq\sentenceSet\times\entitySet^2\).
     11 Furthermore, no information about the relation set \(\relationSet\) is available.
     12 This is problematic since whether a specific semantic link is worthy of appearing in \(\relationSet\) or not is not well defined.
     13 Having so little information about what constitutes a relation makes the problem intractable if we do not impose some restrictions upon \(\relationSet\).
     14 All unsupervised models presented in this section are not universal and make some kind of assumption on the structure of the data or on its underlying knowledge base.
     15 However, developing unsupervised relation extraction models is still interesting for three reasons: they
     16 \begin{itemize*}
     17 	\item[(1)] do not necessitate labeled data except for validating the models;
     18 	\item[(2)] can uncover new relation types; and
     19 	\item[(3)] can be trained from large unlabeled datasets and then fine-tuned for specific relations.
     20 \end{itemize*}
     21 
     22 For all models, we list the important modeling hypothesis such as \hypothesis{1-adjacency} and \hypothesis{pullback} introduced previously.
     23 Appendix~\ref{chap:assumptions} contains a list of assumptions with some counterexamples and references to the sections where they were introduced.
     24 We strongly encourage the reader to refer to it, especially when the implications of a modeling hypothesis is not immediately clear.
     25 
     26 \subsection{Evaluation}
     27 \label{sec:relation extraction:unsupervised evaluation}
     28 The output of unsupervised models vary widely.
     29 The main modus operandi can be categorized into two categories:
     30 \begin{description}
     31 	\item[Clustering] A first approach is to cluster the samples such that all samples in the same cluster convey the same relation and samples in different clusters convey different relations.
     32 	\item[Similarity Space] A second approach is to associate each sample with an element of a vector space equipped with a similarity function.
     33 		If two samples are similar in this vector space, they convey similar relations.
     34 		This can be seen as a soft version of the clustering approach.
     35 \end{description}
     36 
     37 This distinction has an impact on how we evaluate the models.
     38 In the first case, standard clustering metrics are used.
     39 We introduce \bcubed{} \parencite{bcubed}, V-measure \parencite{v-measure} and \textsc{ari} \parencite{ari} in Section~\ref{sec:relation extraction:clustering}.
     40 They are the most prevalent metrics in cluster evaluation, \bcubed{} in particular is widely used in unsupervised relation extraction.
     41 In the second case, a few-shot evaluation can be used \parencite{fewrel}.
     42 We introduce this approach in Section~\ref{sec:relation extraction:few-shot}.
     43 
     44 A difficulty of evaluating unlabeled clusters is that we do not know which cluster should be compared to which relation.
     45 A possible solution to this problem is to use a small number of labeled samples, which can be used to constrain the output of a model to fall into a specific relation set \(\relationSet\).
     46 This setup is actually similar to semi-supervised approaches such as label propagation (Section~\ref{sec:relation extraction:label propagation}), except that the model must be trained in an unsupervised fashion before being fine-tuned on the supervised dataset.
     47 Similar to the label propagation model evaluation, unsupervised models evaluated by fine-tuning on a supervised dataset usually report performance varying the number of train labels.
     48 These performances are measured using the standard supervised metrics introduced in Section~\ref{sec:relation extraction:supervised evaluation}.
     49 Evaluating performances as a pre-training method can be used for all unsupervised models, in particular similarity-space-based approaches.
     50 
     51 \subsubsection{Clustering Metrics}
     52 \label{sec:relation extraction:clustering}
     53 In this section, we describe three metrics used to evaluate clustering approaches.
     54 The first metric, \bcubed{} was first introduced to unsupervised relation extraction by rel-\textsc{lda} (\cite{rellda}, Section~\ref{sec:relation extraction:rellda}), while the other two were proposed as complements by \textcite{fitb} presented in Chapter~\ref{chap:fitb}.
     55 
     56 \begin{epigraph}
     57 		{Valve}
     58 		{``Portal''}
     59 		{2007}
     60 		The cake is a lie.
     61 \end{epigraph}
     62 To clearly describe these different clustering metrics, we propose a common probabilistic formulation---in practice, these probabilities are estimated on the validation and test sets---and use the following notations.
     63 Let \(\rndm{X}\) and \(\rndm{Y}\) be random variables corresponding to samples in the dataset.
     64 Following Section~\ref{sec:relation extraction:supervised evaluation}, we denote by \(c(\rndm{X})\) the predicted cluster of \(\rndm{X}\) and \(g(\rndm{X})\) its conveyed gold relation.%
     65 \sidenote{
     66 	This implies that a labeled dataset is sadly necessary to evaluate an unsupervised clustering model.
     67 }
     68 
     69 \paragraph{B\texorpdfstring{\textsuperscript{3}}{³}}
     70 The metric most commonly computed for unsupervised model evaluation is a generalization of \fone{} for clustering tasks called \bcubed{} \parencitex{bcubed}.
     71 The \bcubed{} precision and recall are defined as follows:
     72 \begin{align*}
     73 	\bcubed \operatorname{precision}(g, c) & = \expectation_{\rndm{X},\rndm{Y}\sim\uniformDistribution(\dataSet_\relationSet)} P\left(g(\rndm{X})=g(\rndm{Y}) \mid c(\rndm{X})=c(\rndm{Y})\right) \\
     74 	\bcubed \operatorname{recall}(g, c)    & = \expectation_{\rndm{X},\rndm{Y}\sim\uniformDistribution(\dataSet_\relationSet)} P\left(c(\rndm{X})=c(\rndm{Y}) \mid g(\rndm{X})=g(\rndm{Y})\right) \\
     75 \end{align*}
     76 As precision and recall can be trivially maximized by putting each sample in its own cluster or by clustering all samples into a single class, the main metric \bcubed{} \fone{} is defined as the harmonic mean of precision and recall:
     77 \begin{equation*}
     78 	\bcubed \fone{}(g, c) = \frac{2}{\bcubed{} \operatorname{precision}(g, c)^{-1} + \bcubed{} \operatorname{recall}(g, c)^{-1}}
     79 \end{equation*}
     80 
     81 While the usual precision (Section~\ref{sec:relation extraction:supervised evaluation}) can be seen as the probability that a sample with a given prediction is correct, the \bcubed{} precision cannot use the correct relation as a reference to determine the correctness of a prediction.
     82 Instead, whether an assignment is correct is computed as the expectation that a sample is accurately classified relatively to all other samples grouped in the same cluster.
     83 
     84 \paragraph{V-measure}
     85 Another metric is the entropy-based V-measure \parencitex{v-measure}.
     86 This metric is defined by homogeneity and completeness, which are akin to \bcubed{} precision and recall but rely on conditional entropy.
     87 For a cluster to be homogeneous, we want most of its elements to convey the same gold relation.
     88 In other words, the distribution of gold relations inside a cluster must have low entropy.
     89 This entropy is normalized by the unconditioned entropy of the gold relations to ensure that it does not depend on the size of the dataset:
     90 \begin{equation*}
     91 	\operatorname{homogeneity}(g, c) = 1 - \frac{\entropy\left(c(\rndm{X})\mid g(\rndm{X})\right)}{\entropy\left(c(\rndm{X})\right)}.
     92 \end{equation*}
     93 Similarly, for a cluster to be complete, we want all the elements conveying the same gold relation to be captured by this cluster.
     94 In other words, the distribution of clusters inside a gold relation must have low entropy:
     95 \begin{equation*}
     96 	\operatorname{completeness}(g, c) = 1 - \frac{\entropy\left(g(\rndm{X})\mid c(\rndm{X})\right)}{\entropy\left(g(\rndm{X})\right)}.
     97 \end{equation*}
     98 As  \bcubed{}, the V-measure is summarized by the \fone{} value:
     99 \begin{equation*}
    100 	\operatorname{V-measure}(g, c) = \frac{2}{\operatorname{homogeneity}(g, c)^{-1} + \operatorname{completeness}(g, c)^{-1}}.
    101 \end{equation*}
    102 \begin{marginfigure}
    103 	\centering
    104 	\input{mainmatter/relation extraction/clustering metrics.tex}
    105 	\scaption[Comparison of \bcubed{} and V-measure.]{
    106 		Comparison of \bcubed{} and V-measure.
    107 		Samples conveying three different relations indicated by shape and color are clustered into three boxes.
    108 		The two rows represent two different clusterings, \bcubed{} favors the first one while V-measure favors the second.
    109 		V-measure prefers the second clustering since the blue star cluster is kept pure; on the other hand, the green circle cluster is impure no matter what, so its purity is not taken as much into account by the V-measure compared to \bcubed{}.
    110 		\label{fig:relation extraction:clustering metrics}
    111 	}
    112 \end{marginfigure}
    113 Compared to \bcubed{}, the V-measure penalizes small impurities in a relatively ``pure'' cluster more harshly than in less pure ones.
    114 Symmetrically, it penalizes a degradation of a well-clustered relation more than of a less-well-clustered one.
    115 This difference is illustrated in Figure~\ref{fig:relation extraction:clustering metrics}.
    116 
    117 \paragraph{Adjusted Rand Index}
    118 The Rand index (\textsc{ri}, \cite{ri}) is the last clustering metric we consider, it is defined as the probability that cluster and gold assignments are compatible:
    119 \begin{equation*}
    120 	\operatorname{\textsc{ri}}(g, c) = \expectation\limits_{\rndm{X},\rndm{Y}} \left[ P\left(
    121 			    c(\rndm{X})=c(\rndm{Y}) \Leftrightarrow g(\rndm{X})=g(\rndm{Y}) 
    122 	\right) \right]
    123 \end{equation*}
    124 In other words, given two samples, the \textsc{ri} is improved when both samples are in the same cluster and convey the same gold relation or when both samples are in different clusters and convey different relations; otherwise, the \textsc{ri} deteriorates.
    125 The adjusted Rand index (\textsc{ari}, \citex{ari}) is a normalization of the Rand index such that a random assignment has an \textsc{ari} of 0, and the maximum is 1:
    126 \begin{equation*}
    127 	\operatorname{\textsc{ari}}(g, c) =
    128 		\frac{\displaystyle\operatorname{\textsc{ri}}(g, c) - \expectation_{c\sim\uniformDistribution(\relationSet^\dataSet)}[\operatorname{\textsc{ri}}(g, c)]}
    129 		{\displaystyle\max_{c\in\relationSet^\dataSet} \operatorname{\textsc{ri}}(g, c) - \expectation_{c\sim\uniformDistribution(\relationSet^\dataSet)}[\operatorname{\textsc{ri}}(g, c)]}
    130 \end{equation*}
    131 In practice, the \textsc{ari} can be computed from the elements of the confusion matrix.
    132 Compared to the previous metrics, \textsc{ari} will be less sensitive to a discrepancy between precision--homogeneity and recall--completeness since it is not a harmonic mean of both.
    133 
    134 \subsubsection{Few-shot}
    135 \label{sec:relation extraction:few-shot}
    136 \begin{marginparagraph}
    137 	This section only presents Few-shot evaluation.
    138 	It is possible---and quite common---to train a model using a few-shot objective, usually as a fine-tuning phase before a few-shot evaluation.
    139 	Since we are mostly interested in unsupervised approaches, we do not delve into few-shot training.
    140 	See \textcite{fewrel} for details.
    141 \end{marginparagraph}
    142 Clustering metrics are problematic since producing a clustering with no a priori knowledge on the relation schema \(\relationSet\) leads to unsolvable problems:
    143 \begin{itemize}
    144 	\item Should the relation \textsl{sibling} be cut into \textsl{brother} and \textsl{sister}?
    145 	\item Is the relation between a country and its capital the same as the one between a county and its seat?
    146 	\item Is the ear \textsl{part of} the head in the same fashion that the star Altair is \textsl{part of} the Aquila constellation?
    147 \end{itemize}
    148 All of these questions can be answered differently depending on the design of the underlying knowledge base.
    149 However, unsupervised clustering algorithms do not depend on \(\relationSet\).
    150 They must decide whether ``Phaedra is the sister of Ariadne'' and ``Castor is the brother of Pollux'' go inside the same cluster independently of these design choices.
    151 
    152 Fine-tuning on a supervised dataset solves this problem but adds another.
    153 The evaluation no longer assesses the proficiency of a model to learn from unlabeled data alone; it also evaluates its ability to adapt to labeled samples.
    154 Furthermore, the smaller the labeled dataset is, the more results have high variance.
    155 On the other hand, the larger the labeled dataset is, the less the experiment evaluates the unsupervised phase.
    156 
    157 A few-shot evaluation can be used to answer these caveats.
    158 Instead of evaluating a clustering of the samples, few-shot experiments evaluate a similarity function between samples: \(\operatorname{sim}\colon \dataSet\times\dataSet\to\symbb{R}\).
    159 Given a query sample \(x^{(q)}\) and a set of candidates \(\vctr{x}^{(c)}=\{x_i^{(c)}\mid i=1,\dotsc,C\}\), %
    160 \begin{marginparagraph}
    161 	\(C\) is the number of candidates, in Table~\ref{tab:relation extraction:few-shot problem} we have \(C=5\).
    162 \end{marginparagraph}
    163 the model is evaluated on whether it is able to find the candidate conveying the same relation as the query.
    164 This is simply reported as an accuracy by comparing \(\argmax_{x\in\vctr{x}^{(c)}} \operatorname{sim}(x^{(q)}, x)\) with the correct candidate.
    165 
    166 \begin{table}[ht!]
    167 	\centering
    168 	\input{mainmatter/relation extraction/few-shot problem.tex}
    169 	\sidecaption[Few-shot problem.]{
    170 		Few-shot problem.
    171 		For ease of reading, the entity identifiers---such as \wdent{450036} for ``Hörsel''---are not given.
    172 		Both the query and the third candidate convey the relation \wdrel{206} \textsl{located in or next to body of water}.
    173 		\label{tab:relation extraction:few-shot problem}
    174 	}
    175 \end{table}
    176 
    177 Table~\ref{tab:relation extraction:few-shot problem} gives an example of a few-shot problem.
    178 It illustrates the five-way one-shot problem, meaning that we must choose a relation among five and that each of the five relations is represented by a single sample.
    179 Another popular variant is the ten-way five-shot problem: the candidates are split into ten bags of five samples each, all samples in a bag convey the same relation, and the goal is to predict the bag in which the query belongs.
    180 \begin{marginparagraph}
    181 	Quite confusingly, they can also be referred to as ``meta-train'' and ``meta-test.''
    182 	Indeed, to follow the usual semantic of the ``meta-'' prefix, the ``meta-sets'' should refer to sets of \((\text{query}, \text{candidates})\) tuples, not the candidates themselves.
    183 \end{marginparagraph}
    184 Candidates are sometimes referred to as ``train set'' and the query as ``test set'' since this can be seen as an extremely small dataset with five training samples and one test sample.
    185 
    186 FewRel, described in Section~\ref{sec:datasets:fewrel}, is the standard few-shot dataset.
    187 In FewRel, Altair is not \wdrel{361} \textsl{part of} Aquila, it is \wdrel{59} \textsl{part of constellation} Aquila.
    188 However, this design decision does not influence the evaluation.
    189 Given the query ``Altair is located in the Aquila constellation,'' a model ought to rank this sample as more similar to samples conveying \textsl{part of constellation} than to those conveying other kinds of \textsl{part of} relationships.
    190 If FewRel made the opposite design choice, the model would still be able to achieve high accuracy by ensuring \textsl{part of} samples are similar.
    191 The decision to split or not the \textsl{part of} relation should be of no concern to the unsupervised model.
    192 
    193 \subsection{Open Information Extraction}
    194 \label{sec:relation extraction:oie}
    195 In Open information extraction (\textsc{oie}, \citex{oie}), the closed-domain assumption (Section~\ref{sec:relation extraction:domain restriction}) is neither made for relations nor entities, which are extracted jointly.
    196 Instead \(\entitySet\) and \(\relationSet\) are implicitly defined from the language itself, typically a fact \((e_1, r, e_2)\) is expressed as a triplet such as (noun phrase, verb phrase, noun phrase).
    197 This makes \textsc{oie} particularly interesting when processing large amounts of data from the web, where there can be many unanticipated relations of interest.
    198 
    199 This section focuses on TextRunner, the first model implementing \textsc{oie}.
    200 It uses an aggregate extraction setup where \(\dataSet\) is directly mapped to \(\kbSet\), with the peculiarity that \(\kbSet\) is defined using surface forms only.
    201 The hypothesis on which TextRunner relies is that the surface form of the relation conveyed by a sentence appears in the path between the two entities in its dependency tree.
    202 In the \textsc{oie} setup, these surface forms can then be used as labels for the conveyed relations, thereby using the language itself as the relation domain \(\relationSet\).
    203 TextRunner can be split into three parts:
    204 \begin{description}
    205 	\item[The Learner] is a naive Bayes classifier, trained on a small dataset to predict whether a fact \((e_1, r, e_2)\) is trustworthy.
    206 		To extract a set of samples for this task, a dependency parser (Figure~\ref{fig:relation extraction:dependency tree}) is run on the dataset and tuples \((e_1, r, e_2)\) are extracted where \(e_1\) and \(e_2\) are base noun phrases and \(r\) is the dependency path between the two entities.
    207 		The tuples are then automatically labeled as trustworthy or not according to a set of heuristics such as the length of the dependency path and whether it crosses a sentence boundary.
    208 		The naive Bayes classifier is then trained to predict the trustworthiness of a tuple given a set of hand-engineered features (Section~\ref{sec:relation extraction:hand-designed features}).
    209 	\item[The Extractor] extracts trustworthy facts on the whole dataset.
    210 		The features on which the Learner is built only depend on part-of-speech (\textsc{pos}) tags (noun, verb, adjective\dots) such that the Extractor does not need to run a dependency parser on all the sentences in the entire dataset.
    211 		\begin{marginparagraph}
    212 			Dependency parsers tend to be a lot slower than \textsc{pos} taggers.
    213 		\end{marginparagraph}
    214 		While the Learner uses the dependency path for \(r\), the Extractor uses the infix from which non-essential phrases (such as adverbs) are eliminated heuristically.
    215 		Thus the Extractor simply runs a \textsc{pos} tagger on all sentences, finds all possible entities \(e\), estimates a probable relation \(r\) and filters them using the Learner to output a set of trustworthy facts.
    216 	\item[The Assessor] assigns a probability that a fact is true from redundancy in the dataset using the urns model of \textcite{textrunner_assessor}.
    217 		This model uses a binomial distribution to model the probability that a correct fact appears \(k\) times among \(n\) extractions with a fixed repetition rate.
    218 		Furthermore, it assumes both correct and incorrect facts follow different Zipf's laws.
    219 		The shape parameter \(s_I\) of the distribution of incorrect facts is assumed to be 1.
    220 		While the shape parameter \(s_C\) of the distribution of correct facts as well as the number of correct facts \(N_C\) are estimated using an expectation--maximization algorithm.
    221 		\begin{marginparagraph}[-3mm]
    222 			Zipf's law comes from the externalist linguistic school.
    223 			It follows from the observation that the frequency of the second most common word is half the one of the most frequent word, that the one of the third most common word is a third of the one of the most frequent, etc.
    224 			The same distribution can often be observed in information extraction.
    225 			Zipf's law is parametrized by a shape \(s\) and the number of elements \(N\):
    226 			\begin{equation*}
    227 				P(x\mid s) \propto
    228 					\begin{cases}
    229 						x^{-s} & \text{ for } x\in\{1,\dotsc,N\} \\
    230 						0 & \text{ otherwise } \\
    231 					\end{cases}
    232 			\end{equation*}
    233 			A Zipf's law is easily recognizable on a \(\log\)--\(\log\) scale, its probability mass function being a straight line.
    234 			Take for example the Zipf's law with parameters \(s=2\) and \(N=10\):
    235 			\begin{center}
    236 				\input{mainmatter/relation extraction/zipf.tex}
    237 			\end{center}
    238 		\end{marginparagraph}
    239 		In the expectation step, the binomial and Zipf distribution assumptions can be combined using Bayes' theorem to estimate whether a fact is correct or not.
    240 		In the maximization step, the parameters \(s_C\) and \(N_C\) are estimated.
    241 \end{description}
    242 
    243 \textcite{oie} compare their approach to KnowItAll, an earlier work similar to \textsc{oie} but needing a list of relations (surface forms) as input to define the target relation schema \(\relationSet\).
    244 On a set of ten relations, they manually labeled the extracted facts as correct or not, obtaining an error rate of 12\% for TextRunner and 18\% for KnowItAll.
    245 They further run their model on 9 million web pages, extracting 7.8 million facts.
    246 
    247 A limitation of the \textsc{oie} approach is that it heavily depends on the raw surface form and suffers from bad generalization.
    248 The two facts ``Bletchley Park \textsl{known as} Station X'' and ``Bletchley Park \textsl{codenamed} Station X'' are considered different by TextRunner since the surface forms conveying the relations in the underlying sentences are different.
    249 Subsequent \textsc{oie} approaches try to address this problem, such as \textcite{textrunner_synonym}, which extend TextRunner with a resolver \parencite{textrunner_resolver} to merge synonyms.
    250 However, this problem is not overcome yet and is still an active area of research.
    251 Furthermore, since the input of \textsc{oie} systems is often taken to be the largest possible chunk of the web, and since the extracted facts do not follow a strict nomenclature, a fair evaluation of \textsc{oie} systems among themselves or to other unsupervised relation extraction models is still not feasible.
    252 
    253 \subsection{Clustering Surface Forms}
    254 \label{sec:relation extraction:hasegawa}
    255 The first unsupervised relation extraction model was the clustering approach of \textcitex{hasegawa}.
    256 It is somewhat similar to \textsc{dirt} (Section~\ref{sec:relation extraction:dirt}) in that it uses a similarity between samples.
    257 However, their work goes one step further by using this similarity to build relation classes.
    258 Furthermore, \textcite{hasegawa} does not assume \hypothesis{pullback}, i.e.~it does not assume that the sentence and entities convey the relation separately, on their own.
    259 Instead, its basic assumption is that the infix between two entities is the expression of the conveyed relation.
    260 \begin{marginparagraph}
    261 	As a reminder, the infix is the span of text between the two entities in the sentence.
    262 \end{marginparagraph}
    263 As such, if two infixes are similar, the sentences convey similar relations.
    264 Furthermore, \textsc{ner} (see the introduction of Chapter~\ref{chap:relation extraction}) is performed on the text instead of simple entity chunking.
    265 This means that all entities are tagged with a type such as ``organization'' and ``person.''
    266 These types strongly constrain the relations through the following assumption:
    267 \begin{marginparagraph}[-9mm]
    268 	Following Section~\ref{sec:context:relation algebra}, \(\breve{r}\) is the converse relation of \(r\), i.e.~the relation with \(e_1\) and \(e_2\) in the reverse order.
    269 	\(\relationComposition\) is the composition operator and \(\relationOne_X\) the complete relation over \(X\).
    270 	\(r\relationComposition\breve{r}\) is the relation linking all the entities which appear as subject (\(e_1\), on the left hand side) of \(r\) to themselves.
    271 	This relation is constrained to be between entities in \(X\).
    272 	Less relevant to this formula, \(r\relationComposition\breve{r}\) also links together entities linked by \(r\) to the same object.
    273 \end{marginparagraph}
    274 \begin{assumption}{type}
    275 	All entities have a unique type, and all relations are left and right restricted to one of these types.
    276 
    277 	\smallskip
    278 	\noindent
    279 	\(
    280 		\exists \symcal{T} \textup{ partition of } \entitySet :
    281 		\forall r\in\relationSet :
    282 		\exists X, Y\in \symcal{T} :
    283 		r\relationComposition \breve{r} \relationOr \relationOne_X = \relationOne_X
    284 		\; \land \;
    285 		\breve{r}\relationComposition r \relationOr \relationOne_Y = \relationOne_Y
    286 	\)
    287 \end{assumption}
    288 \begin{marginparagraph}[9mm]
    289 	Here, we assume that the partition \(\symcal{T}\) is not degenerate and somewhat looks like a standard \textsc{ner} classification output.
    290 	Otherwise, \(\symcal{T}=\{\entitySet\}\) is a valid partition of \(\entitySet\), and this assumption is tautological.
    291 \end{marginparagraph}
    292 This is a natural assumption for many relations; for example, the relation \textsl{born in} is always between a person and a geopolitical entity (\textsc{gpe}).
    293 
    294 Given a pair of entities \((e_1, e_2)\in\entitySet^2\), \textcite{hasegawa} collect all samples in which they appear and extract a single vector representation from all these samples.
    295 This representation is built from the bag of words of the infixes weighted by \textsc{tf--idf} (term frequency--inverse document frequency).
    296 Since a bag of words discards the ordering of the words or entities, the variant of \textsc{tf--idf} used takes into account the directionality:
    297 \begin{align*}
    298 	\textsc{tf}(w, e_1, e_2) = &
    299 			 \text{number of times \(w\) appears between \(e_1\) and \(e_2\)} \\
    300 			& \hspace{5mm} - \text{number of times \(w\) appears between \(e_2\) and \(e_1\)} \\
    301 	\textsc{idf}(w) = & (\text{number of documents in which \(w\) appears})^{-1}
    302 	\\
    303 	\textsc{tf--idf}(w, e_1, e_2) = & \textsc{tf}(w, e_1, e_2) \cdot \textsc{idf}(w)
    304 \end{align*}
    305 
    306 From this definition we obtain a representation \(\vctr{z}_{e_1, e_2}\in\symbb{R}^{V}\) of the pair \((e_1, e_2)\in\entitySet^2\) by taking the value of \(\textsc{tf--idf}(w, e_1, e_2)\) for all \(w\in V\).
    307 Given two entity pairs, their similarity is defined as follow:
    308 \begin{equation*}
    309 	\operatorname{sim}(\vctr{e}, \vctr{e}')
    310 	= \cos(\vctr{z}_\vctr{e}, \vctr{z}_\vctr{e'})
    311 	= \frac{\vctr{z}_\vctr{e} \cdot \vctr{z}_\vctr{e'}}{\|\vctr{z}_\vctr{e}\| \|\vctr{z}_\vctr{e'}\|}.
    312 \end{equation*}
    313 
    314 Using this similarity function, the complete-linkage clustering algorithm%
    315 \sidenote{
    316 	The complete-linkage algorithm is an agglomerative hierarchical clustering method also called farthest neighbor clustering.
    317 	The algorithm starts with each sample in its own cluster then merges the clusters two by two until reaching the desired number of clusters.
    318 	At each step, the two closest clusters are merged together, with the distance between clusters being defined as the distance between their farthest elements.
    319 }
    320 \parencite{complete_linkage} is used to extract relations classes.
    321 Since each pair end up in a single cluster, this assumes \hypothesis{1-adjacency}.
    322 \Textcite{hasegawa} evaluate their method on articles from the New York Times (\textsc{nyt}).
    323 They extract relations classes by first clustering all \(\vctr{z}_{e_1, e_2}\) where \(e_1\) has the type person and \(e_2\) has the type \textsc{gpe}, and then by clustering all \(\vctr{z}_{e_1, e_2}\) where both \(e_1\) and \(e_2\) are organizations.
    324 By clustering separately different type combinations, they ensure that \hypothesis{type} is enforced.
    325 
    326 They furthermore experiment with automatic labeling of the clusters with the most frequent word appearing in the samples.
    327 Apart from the relation \textsl{prime minister}, which is simply labeled ``minister'' since only unigrams are considered, the labels are rather on point.
    328 To measure the performance of their model, they use a classical supervised \fone{} where each cluster is labeled by the majority gold relation.
    329 Using this somewhat unadapted metric, they reach an \fone{} of 82\% on person--\textsc{gpe} pairs and an \fone{} of 77\% on organization--organization pairs.
    330 This relatively high score compared to subsequent models can be explained by the small size of their dataset, which is further split by entity type.
    331 Furthermore, note that some generic relations such as \textsl{part of} do not follow \hypothesis{type} and, as such, cannot be captured.
    332 
    333 \subsection{Rel-\textsc{lda}}
    334 \label{sec:relation extraction:rellda}
    335 Rel-\textsc{lda} \parencitex{rellda} is a probabilistic generative model inspired by \textsc{lda}.
    336 It works by clustering sentences: each relation defines a distribution over a handcrafted set of sentence features (Section~\ref{sec:relation extraction:hand-designed features}) describing the relationship between the two entities in the text.
    337 Furthermore, rel-\textsc{lda} models the propensity of a relation at the level of the document; thus, it is not strictly speaking a sentence-level relation extractor.
    338 The idea behind modeling this additional information is that when a relation such as \wdrel{413} \textsl{position played on team} appears in a document, other relations pertaining to sports are more likely to appear.
    339 Figure~\ref{fig:relation extraction:rellda plate} gives the plate diagram for the rel-\textsc{lda} model.
    340 It uses the following variables:
    341 \begin{itemize}[nosep,label={}]
    342 	\item \(\rndmvctr{f}_i\) the features of the \(i\)-th sample, where \(\rndm{f}_{ij}\) is its \(j\)-th feature
    343 	\item \(\rndm{r}_i\) the relation of the \(i\)-th sample
    344 	\item \(\rndm{\theta}_d\) the distribution of relations in the document \(d\)
    345 	\item \(\rndm{\phi}_{rj}\) the probability of the \(j\)-th feature to occurs for the relation \(r\)
    346 	\item \(\alpha\) the Dirichlet prior for \(\rndm{\theta}_d\)
    347 	\item \(\beta\) the Dirichlet prior for \(\rndm{\phi}_{rj}\)
    348 \end{itemize}
    349 The generative process is listed as Algorithm~\ref{alg:relation extraction:rellda generation}.
    350 The learning process uses the expectation--maximization algorithm.
    351 In the variational E-step, the relation for each sample \(r_i\) is sampled from the categorical distribution:
    352 \begin{equation*}
    353 	P(r_i\mid \vctr{f}_i, d) \propto P(r_i\mid d) \prod_{j=1}^m P(f_{ij}\mid r_i)
    354 \end{equation*}
    355 \begin{marginfigure}
    356 	\kern1mm% The top plate raise a bit to high over the top of the margin column otherwise.
    357 	\centering
    358 	\input{mainmatter/relation extraction/rellda plate.tex}
    359 	\scaption[Rel-\textsc{lda} plate diagram.]{
    360 		Rel-\textsc{lda} plate diagram.
    361 		\(D\) is the number of documents in the dataset and \(n_d\) is the number of samples in the document \(d\).
    362 		For each sample \(i\), there are several features \(\rndm{f}_{i1}, \rndm{f}_{i2}, \dotsc, \rndm{f}_{im}\), accordingly for each relation \(r\), there are also several feature priors \(\rndm{\phi}_{r1}, \dotsc, \rndm{\phi}_{rm}\), however for simplicity, a single prior is shown here.
    363 		\label{fig:relation extraction:rellda plate}
    364 	}
    365 \end{marginfigure}%
    366 \begin{marginalgorithm}
    367 	\centering
    368 	\input{mainmatter/relation extraction/rellda.tex}
    369 	\scaption[The rel-\textsc{lda} generative process.]{
    370 		The rel-\textsc{lda} generative process.
    371 		\(\operatorname{Dir}\) are Dirichlet distributions.
    372 		\(\operatorname{Cat}\) are categorical distributions.
    373 		\label{alg:relation extraction:rellda generation}
    374 	}
    375 \end{marginalgorithm}%
    376 where \(P(r\mid d)\) is defined by \(\theta_d\) and \(P(f_{ij}\mid r)\) is defined by \(\phi_{rj}\).
    377 In the M-step, the values for \(\theta_d\) are computed by counting the number of times each relation appears in \(d\) and the hyperprior \(\alpha\); and the value for \(\phi_{rj}\) is computed from the number of co-occurrences of the \(j\)-th feature with the relation \(r\) and from \(\beta\).
    378 
    379 \Textcite{rellda} evaluate their model on the New York Times by comparing their clusters to relations in Freebase.
    380 However, because of the incompleteness of knowledge bases, they only evaluate the recall on Freebase and use manual annotation to estimate the precision.
    381 Even though the original article lacks a significant comparison, subsequent approaches often compare to rel-\textsc{lda}.
    382 
    383 A first limitation of their approach is that given the relation \(r\), the features \(f\) are independents.
    384 Since the entities are among those features, this means that \(P(e_2\mid e_1, r) = P(e_2\mid r)\) which is clearly false.
    385 \begin{assumption}{biclique}
    386 	Given a relation, the entities are independent of one another: \( \displaystyle \rndm{e}_1 \independent \rndm{e}_2 \mid \rndm{r} \).
    387 	In other words, given a relation, all possible head entities are connected to all possible tail entities.
    388 	
    389 	\smallskip
    390 	\noindent
    391 	\(\forall r\in\relationSet:\exists A,B\subseteq\entitySet: r\relationComposition\breve{r}=\relationOne_A\land\breve{r}\relationComposition r=\relationOne_B\)
    392 \end{assumption}
    393 This is a widespread problem with generative models which are inclined to make extensive independence assumptions.
    394 Furthermore, generative models have an implicit bias that all observed features are related to relation extraction, even though they might measures other aspect of the sample (style, idiolectal word choice, etc).
    395 This might results in the model focusing on features not related to the relation extraction task.
    396 
    397 Several extensions of rel-\textsc{lda} were proposed.
    398 Type-\textsc{lda} \parencite{rellda} purpose to model entity types which are latent variables of entity features, themselves generated from the relation variable \(r\), thus softly enforcing \hypothesis{type}.
    399 Sense-\textsc{lda} \parencitex{rellda_sense} use a \textsc{lda}-like model for each different dependency path.
    400 Clusters for different paths are then merged into relation clusters.
    401 
    402 Rel-\textsc{lda} is an important work in that it proposes a simple evaluation framework; in particular, it introduces the \bcubed{} metric to unsupervised relation extraction.
    403 However, it predates the advent of neural networks and distributed representations in relation extraction, by which it was bound to be replaced.
    404 
    405 \subsection{Variational Autoencoder for Relation Extraction}
    406 \label{sec:relation extraction:vae}
    407 \Textcitex{vae_re} were first to propose a discriminative unsupervised relation extraction model.
    408 Discriminative models directly solve the inference problem of finding the posterior \(P(r\mid x)\).
    409 This is in contrast to generative models such as rel-\textsc{lda} which determine \(P(x\mid r)\) and then use Bayes' theorem to compute \(P(r\mid x)\) and make a prediction.
    410 The model of \textcite{vae_re} is closely related to the approach presented in Chapter~\ref{chap:fitb}.
    411 It is a clustering model, meaning that it produces clusters of samples where the samples in each cluster all convey the same relation.
    412 To do so, it uses a variational autoencoder model (\textsc{vae}, \citex{vae}) that we now describe.
    413 
    414 \paragraph{Variational Autoencoder}
    415 \begin{marginfigure}
    416 	\centering
    417 	\input{mainmatter/relation extraction/vae plate.tex}
    418 	\scaption[\textsc{vae} plate diagram.]{
    419 		\textsc{vae} plate diagram.
    420 		\(N\) is the number of samples in the dataset.
    421 		\label{fig:relation extraction:vae plate}
    422 	}
    423 \end{marginfigure}
    424 The goal of a variational autoencoder is to learn a latent variable \(\vctr{z}\) which explains the distribution of an observed variable \(\vctr{x}\).
    425 For our problem, the latent variable corresponds to the relation conveyed by the sample \(\vctr{x}\).
    426 We assume we know the generative process \(P(\vctr{x}\mid \vctr{z}; \vctr{\theta})\), i.e.~this process is the ``decoder'' (parametrized by \(\vctr{\theta}\)): given the latent variable it produces a sample.
    427 However, the process of interest to us is to estimate the latent variable---the relation---from a sample, that is \(P(\vctr{z}\mid\vctr{x}; \vctr{\theta})\).
    428 Using Bayes' theorem we can reformulate this posterior as \(P(\vctr{x}\mid \vctr{z}; \vctr{\theta})P(\vctr{z}\mid \vctr{\theta}) \divslash P(\vctr{x}\mid \vctr{\theta})\).
    429 However, computing \(P(\vctr{x}\mid \vctr{\theta})\) is often intractable, especially when the likelihood \(P(\vctr{x}\mid \vctr{z}; \vctr{\theta})\) is modeled using a complicated function like a neural network.
    430 To solve this problem, a variational approach is used: another model \(Q\) parametrized by \(\vctr{\phi}\) is used to approximate \(P(\vctr{z}\mid\vctr{x}; \vctr{\theta})\) as well as possible.
    431 This approximation \(Q(\vctr{z}\mid\vctr{x};\vctr{\phi})\) is the ``encoder'' since it finds the latent variable associated with a sample.
    432 The model can then be trained by maximizing the log-likelihood given the latent variable estimated by \(Q\) and by minimizing the difference between the latent variable predicted by \(Q\) and the desired prior \(P(\vctr{z}\mid\vctr{\theta})\):
    433 \begin{equation}
    434 	J_\textsc{elbo}(\vctr{\theta}, \vctr{\phi}) = \expectation_{Q(\vctr{z}\mid \vctr{x}; \vctr{\phi})}[\log P(\vctr{x}\mid\vctr{z};\vctr{\theta})] - \kl(Q(\vctr{z}\mid \vctr{x}; \vctr{\phi}) \mathrel{\|} P(\vctr{z}\mid\vctr{\theta}))
    435 	\label{eq:relation extraction:elbo}
    436 \end{equation}
    437 A justification for this objective can also be found in the fact that it's a lower bound of the log marginal likelihood \(\log P(\vctr{x}\mid \vctr{\theta})\), hence its name: evidence lower bound (\textsc{elbo}).
    438 The first part of the objective is often referred to as the negative reconstruction loss since it seeks to reconstruct the sample \(\vctr{x}\) after it went through the encoder \(Q\) and the decoder \(P\).
    439 One last problem with the \textsc{vae} approximation relates to the reconstruction loss, the estimation of the expectation over \(Q(\vctr{z}\mid\vctr{x};\vctr{\phi})\) not being differentiable which makes the model---in particular \(\vctr{\phi}\)---untrainable by gradient descent.
    440 This is usually solved using the reparameterization trick: sampling from \(Q(\vctr{z}\mid\vctr{x};\vctr{\phi})\) can often be done in a two steps process: sampling from a simple distribution like \(\epsilon\sim\normalDistribution(0, 1)\) then transforming this sample using a deterministic process parametrized by \(\vctr{\phi}\).
    441 The plate diagram of the \textsc{vae} is given Figure~\ref{fig:relation extraction:vae plate} where the model \(P\) is marked with solid lines and the variational approximation \(Q\) is marked with dashed lines.
    442 
    443 \begin{marginfigure}
    444 	\centering
    445 	\input{mainmatter/relation extraction/marcheggiani plate.tex}
    446 	\scaption[\textcite{vae_re} plate diagram.]{
    447 		\textcite{vae_re} plate diagram.
    448 		\label{fig:relation extraction:marcheggiani plate}
    449 	}
    450 \end{marginfigure}
    451 
    452 \bigskip
    453 
    454 Coming back to the model of \textcite{vae_re}, it is a conditional \(\beta\)\textsc{-vae},%
    455 \sidenote{
    456 	The \(\beta\) in ``\(\beta\)\textsc{-vae}'' simply indicates that the Kullback--Leibler term in Equation~\ref{eq:relation extraction:elbo} is weighted by a hyperparameter \(\beta\).
    457 	More details are given in Chapter~\ref{chap:fitb}.
    458 }
    459 i.e.~the whole process is conditioned on an additional variable.
    460 Indeed, in their approach, only the entities \(\vctr{e}\in\entitySet^2\) are reconstructed, while the sentence \(s\in\sentenceSet\) simply conditions the whole process.
    461 The latent variable explaining the observed entities is expected to be the relation conveyed by the sample.
    462 The resulting model's plate diagram is given in Figure~\ref{fig:relation extraction:marcheggiani plate}.
    463 This approach is defined by two models:
    464 \begin{description}
    465 	\item[The Encoder] \(Q(\rndm{r}\mid\vctr{e}, s; \vctr{\phi})\) is the relation extraction model properly speaking.
    466 		It is defined as a linear model on top of handcrafted features (Section~\ref{sec:relation extraction:hand-designed features}).
    467 		For each sample, the model outputs a distribution over a predefined number of relations.
    468 	\item[The Decoder] \(P(\vctr{e}\mid r; \vctr{\theta})\) is a model estimating how likely it is for two entities to be linked by a relation.
    469 		It is a reconstruction model since the entities \(\vctr{e}\) are known and need to be retrieved from the latent relation \(r\) sampled from the encoder.
    470 		It is defined using selectional preferences (Section~\ref{sec:context:selectional preferences}) and \textsc{rescal} (Section~\ref{sec:context:rescal}).
    471 \end{description}
    472 Note that to label a sample \((\vctr{e}, s)\in\dataSet\), \textcite{vae_re} simply select \(\argmax_{r\in\relationSet} Q(r\mid \vctr{e}, s; \vctr{\phi})\), meaning that the decoder is not used during evaluation.
    473 Its sole purpose is to provide a supervision signal to the encoder through the maximization of \(J_\textsc{elbo}\).
    474 The whole autoencoder can also be interpreted as being trained by a surrogate task of filling-in entity blanks.
    475 This is the interpretation we use in Chapter~\ref{chap:fitb}.
    476 
    477 For Equation~\ref{eq:relation extraction:elbo} to be well defined, a prior on the relations must also be selected; \textcite{vae_re} make the following assumption:
    478 \begin{assumption}{uniform}
    479 	All relations occur with equal frequency.
    480 	
    481 	\smallskip
    482 	\noindent
    483 	\( \displaystyle \forall r\in\relationSet\colon P(r) = \frac{1}{|\relationSet|} \)
    484 \end{assumption}
    485 
    486 They evaluate their approach on the New York Times distantly supervised by Freebase.
    487 By inducing 100 clusters, they show an improvement of the \bcubed{} \fone{} compared to \textsc{dirt} (Section~\ref{sec:relation extraction:dirt}) and rel-\textsc{lda} (Section~\ref{sec:relation extraction:rellda}).
    488 They also experiment using semi-supervised evaluation (Section~\ref{sec:relation extraction:unsupervised evaluation}) by pre-training their decoder on a subset of Freebase before training their encoder as described above; this additional supervision improves the \fone{} by more than 27\%.
    489 These results were further improved by \textcite{vae_re2}, which proposed to split the latent variable into a relation \(r\) and sentence information \(z\), with \(z\) conditioned on \(r\) and using a loss including the reconstruction of the sentence \(s\) from \(z\).
    490 
    491 \subsection{Matching the Blanks}
    492 \label{sec:relation extraction:mtb}
    493 Matching the blanks (\textsc{mtb}, \citex{mtb}) is an unsupervised method that does not attempt to cluster samples but rather learns a representation of the relational semantics they convey.
    494 More precisely, this representation is used to measure the similarity between samples such that similar samples convey similar relations.
    495 As such, it is either evaluated as a supervised pre-training method (Section~\ref{sec:relation extraction:unsupervised evaluation}) or using a few-shot dataset (Section~\ref{sec:relation extraction:few-shot}).
    496 The \textsc{mtb} article introduces several methods to extract an entity-aware representation of a sentence using \textsc{bert}; this was discussed in Section~\ref{sec:relation extraction:mtb sentential}.
    497 This section focuses on the unsupervised training.
    498 As a reminder, we refer to sentence encoder of \textsc{mtb} by the function \(\bertcoder\colon\sentenceSet\to\symbb{R}^d\) illustrated Figure~\ref{fig:relation extraction:emes}.
    499 Given this encoder, \textsc{mtb} defines the similarity between samples as:
    500 \begin{equation}
    501 	\operatorname{sim}(s, s') = \sigmoid(\bertcoder(s)\transpose\bertcoder(s'))
    502 	\label{eq:relation extraction:mtb similarity}
    503 \end{equation}
    504 This similarity function can be used to evaluate the model on a few-shot task.
    505 Note that this function completely ignores entities identifiers (e.g.\ \wdent{211539}), but can still exploit the entities surface forms (e.g.\ ``Peter Singer'') through the sentence \(s\in\sentenceSet\).
    506 This model can be used as is, without any training other than the masked language model pre-training of \textsc{bert} (Section~\ref{sec:context:mlm}) and reach an accuracy of 72.9\% on the FewRel 5 way 1 shot dataset.
    507 
    508 \Textcite{mtb} propose a training objective to fine-tune \textsc{bert} for the unsupervised relation extraction task.
    509 This objective is called matching the blanks.
    510 It assumes that two sentences containing the same entities convey the same relation.
    511 This is exactly \hypothesis{1-adjacency} as given Section~\refAssumptionSection{oneadjacency}.
    512 The probability that two sentences convey the same relation (\(\rndm{D}=1\)) is taken from the similarity function: \(P(\rndm{D}=1\mid s, s')=\operatorname{sim}(s, s')\).
    513 Given this, the \hypothesis{1-adjacency} assumption is translated into the following negative sampling (Section~\ref{sec:context:negative sampling}) loss:
    514 \begin{equation}
    515 	\loss{mtb} = \frac{-1}{|\dataSet|^2} \sum_{\substack{(\vctr{e}, s)\in\dataSet\\(\vctr{e}', s')\in\dataSet}}
    516 	\begin{array}[t]{l}
    517 		\delta_{\vctr{e},\vctr{e}'} \log P(\rndm{D}=1\mid s, s') \\
    518 		\hspace{1cm} + (1 - \delta_{\vctr{e},\vctr{e}'}) \log P(\rndm{D}=0\mid s, s') \\
    519 	\end{array}
    520 	\label{eq:relation extraction:mtb loss}
    521 \end{equation}
    522 This loss is minimized through gradient descent by sampling random positive and negative sentence pairs.
    523 These pairs can be obtained by comparing the entity identifier without the need for any supervision.
    524 
    525 A problem with this approach is that the \bertcoder{} model can simply learn to perform entity linking on the entities surface forms in the sentences \(s\), thus minimizing Equation~\ref{eq:relation extraction:mtb loss} by predicting whether \(\vctr{e}=\vctr{e}'\).
    526 We want to avoid this since this would only work on samples seen during training and would not generalize to unseen entities.
    527 To ensure the model predicts whether the samples convey the same relation from the sentences \(s\) and \(s'\) alone, blanks are introduced.
    528 A special token \blanktag{} is substituted to the entities as follow:
    529 \begin{indentedexample}
    530 	\uhead{\blanktag}, inspired by Cale's earlier cover, recorded one of the most acclaimed versions of ``\utail{\blanktag}.''\\
    531 	\smallskip
    532 	\uhead{\blanktag}'s rendition of ``\utail{\blanktag}'' has been called ``one of the great songs'' by Time\dots
    533 \end{indentedexample}
    534 This is similar to the sample corruption of \textsc{bert} (Section~\ref{sec:context:mlm}), indeed like \textsc{bert}, the entity surface forms are blanked only a fraction%
    535 \sidenote{\Textcite{mtb} blanks each entity with a probability of 70\%, meaning that only 9\% of training samples have both of their entity surface forms intact.}
    536 of the time so as to not confuse the model when real entities appear during evaluation.
    537 
    538 Another problem with Equation~\ref{eq:relation extraction:mtb loss} is that the negative sample space \(\vctr{e}\neq\vctr{e}'\) is extremely large.
    539 Instead of taking negative samples randomly in this space, \textcite{mtb} propose to take only samples which are likely to be close to positive ones.
    540 To this end, the \(\vctr{e}\neq\vctr{e}'\) condition is actually replaced with the following one:
    541 \begin{equation*}
    542 	|\{e_1, e_2\} \cap \{e_1', e_2'\}| = 1
    543 \end{equation*}
    544 These are called ``strong negatives'': negative samples that have precisely one entity in common.
    545 Negative sampling, especially with strong negatives, leads to another unfortunate assumption:
    546 \begin{assumption}[onetoone]{\(1\to1\)}
    547 	All relations are one-to-one.
    548 
    549 	\smallskip
    550 	\noindent
    551 	\( \forall r\in\relationSet\colon
    552 	r\relationComposition \breve{r} \relationOr \relationIdentity
    553 	= \breve{r}\relationComposition r \relationOr \relationIdentity
    554 	= \relationIdentity \)
    555 \end{assumption}
    556 Indeed, if a relation is not one-to-one, then there exists two facts \tripletHolds{e_1}{r}{e_2} and \tripletHolds{e_1}{r}{e_3} (or respectively with \(\breve{r}\)); however these two facts form a strong negative pair, therefore as per \loss{mtb} their representations must be pulled away from one another.
    557 
    558 Despite these assumptions, \textsc{mtb} showcase impressive results, both as a few-shot and supervised pre-training method.
    559 It obtained state-of-the-art results both on the SemEval 2010 Task 8 dataset with a macro-\(\overHalfdirected{\fone}\) %
    560 \begin{marginparagraph}[-6mm]
    561 	As a reminder, \(\overHalfdirected{\fone}\) is the half-directed metric described Section~\ref{sec:relation extraction:supervised evaluation}.
    562 	It is referred to as ``taking directionality into account'' in the SemEval dataset.
    563 \end{marginparagraph}
    564 of 82.7\% and on FewRel with an accuracy of 90.1\% on the 5 way 1 shot task.
    565 
    566 \subsection{Self\textsc{ore}}
    567 \label{sec:relation extraction:selfore}
    568 Self\textsc{ore} \parencitex{selfore} is a clustering approach similar to the one of \textcite{hasegawa} presented in Section~\ref{sec:relation extraction:hasegawa} but using deep neural network models for extracting sentence representations and for grouping these representations into relation clusters.
    569 Since they follow the experimental setup of \textcite{fitb}, which we present in Chapter~\ref{chap:fitb}, their results are listed in that chapter.
    570 
    571 Self\textsc{ore} uses \textsc{mtb}'s entity markers--entity start \bertcoder{} sentence representation.
    572 A clustering algorithm could be run to produce relation classes from these representations a la \textcite{hasegawa}.
    573 However, \textcite{selfore} introduce an iterative scheme to purify the clusters.
    574 This scheme is illustrated in Figure~\ref{fig:relation extraction:selfore} and works by alternatively optimizing two losses \loss{ac} and \loss{rc}.
    575 
    576 The first loss \loss{ac} is the clustering loss which comes from \textsc{dec} \parencitex{dec}.
    577 \textsc{dec} is a deep clustering algorithm that uses a denoising autoencoder \parencite{dae} to compress the input.
    578 In their case, the input \(\vctr{h}\) is the sentence encoded by \bertcoder{}.
    579 The denoising autoencoder is trained layer by layer with a small bottleneck which produces a compressed representation of the sentence \(\vctr{z}=\operatorname{Encoder}(\vctr{h})\).
    580 This is the space in which the clustering occurs.
    581 For each cluster \(j=1,\dotsc,K\), a centroid%
    582 \sidenote{
    583 	The \(k\)-means clustering algorithm is used to initialize the centroids.
    584 	In practice, the \(k\)-means clusters could directly be used as soft labels.
    585 	However, \textcite{selfore} show that this underperforms compared to refining the clusters with \loss{ac}.
    586 }
    587 \(\vctr{\mu}_j\) is learned such that a sentence is part of the cluster whose centroid is the closest to its compressed representation.
    588 This is modeled with a Student's \(t\)-distribution with one degree of freedom centered around the centroid:
    589 \begin{marginfigure}
    590 	\centering
    591 	\input{mainmatter/relation extraction/selfore.tex}
    592 	\scaption[Self\textsc{ore} iterative algorithm.]{
    593 		Self\textsc{ore} iterative algorithm.
    594 		\label{fig:relation extraction:selfore}
    595 	}
    596 \end{marginfigure}
    597 \begin{equation*}
    598 	q_{ij} = \frac{(1+\|\vctr{z}_i-\vctr{\mu}_j\|^2)^{-1}}{\sum_k (1+\|\vctr{z}_i-\vctr{\mu}_k\|^2)^{-1}}
    599 \end{equation*}
    600 To force the initial clusters to be more distinct, a target distribution \(p\) is defined as:
    601 \begin{equation}
    602 	p_{ij} = \frac{q_{ij}^2 \divslash f_j}{\sum_k q_{ik}^2 \divslash f_k}
    603 	\label{eq:relation extraction:selfore target}
    604 \end{equation}
    605 where \(f_j=\sum_i q_{ij}\) are soft cluster frequencies.
    606 To push \(\mtrx{Q}\) towards \(\mtrx{P}\), a Kullback--Leibler divergence is used:
    607 \begin{equation*}
    608 	\loss{ac} = \kl(\mtrx{P}\mathrel{\|}\mtrx{Q}) = \sum_{i=1}^{|\dataSet|} \sum_{j=1}^K p_{ij} \log \frac{p_{ij}}{q_{ij}}
    609 \end{equation*}
    610 This loss is minimized by backpropagating to the cluster centroids \(\vctr{\mu}_j\) and to the encoder's parameters in the \textsc{dae}.
    611 Note that the decoder of the \textsc{dae} is only used for initializing the encoder such that the input can be reconstructed.
    612 
    613 Optimizing \loss{ac} is the first step of Self\textsc{ore}; it assigns a pseudo-label to each sample in the dataset.
    614 The second step is to train a classifier to predict these pseudo-labels.
    615 The classifier is a simple multi-layer perceptron trained with the usual cross-entropy classification loss, which is called \loss{rc} in Self\textsc{ore}.
    616 This loss also backpropagate to the \bertcoder{} thus changing the sentence representations \(\vctr{h}\).
    617 Self\textsc{ore} is an iterative algorithm: changing the \(\vctr{h}\) modifies the clustering found by \textsc{dec}.
    618 Thus, the two steps, clustering and classification, are repeated several times until a stable label assignment is found.
    619 
    620 The central assumption of Self\textsc{ore} is that \bertcoder{} already produces a good representation for relation extraction, which, as we saw with the non-fine-tuned \bertcoder{} score on FewRel in Section~\ref{sec:relation extraction:mtb}, is rather accurate.
    621 However, Self\textsc{ore} also assumes \hypothesis{uniform}, i.e.~that all relations appear with the same frequency.
    622 This assumption is enforced by \loss{ac}, through the normalization of the target distribution \(\mtrx{P}\) by soft cluster frequencies \(f_j\).%
    623 \sidenote{For further details, \textcite{dec} contains an analysis of the \textsc{dec} clustering algorithm on imbalanced \textsc{mnist} data.}
    624 Indeed, the distribution \(\mtrx{P}\) is the original distribution \(\mtrx{Q}\) more concentrated (because of the square) and more uniform (because of the normalization by \(f_j\)).
    625 
    626 The interpretation of the concentration effect in terms of modeling hypotheses is more complex.
    627 The variable \(\vctr{h}\) is the concatenation of the two entity embeddings.
    628 Let's break down the \bertcoder{} function into two components: \(\operatorname{ctx}_1(s)\) and \(\operatorname{ctx}_2(s)\).
    629 These are simply the two contextualized embeddings of \texttt{<e1>} and \texttt{<e2>} (Section~\ref{sec:relation extraction:mtb}), in other words the function \(\operatorname{ctx}\) contextualize an entity surface form inside its sentence.
    630 When two sentence representations \(\vctr{h}\) and \(\vctr{h}'\) are close, their pseudo-labels tend to be the same, and thus their relation also tend to be the same.
    631 In other words:
    632 \begin{assumption}[ctxoneadjacency]{\ctxoneadj}
    633 	Two samples with the same contextualized representation of their entities' surface forms convey the same relation.
    634 
    635 	\smallskip
    636 	\noindent
    637 	\( \forall (s, \vctr{e}, r), (s', \vctr{e}', r')\in\dataSet_\relationSet\colon \)\\
    638 	\null\hfill\( \operatorname{ctx}_1(s)=\operatorname{ctx}_1(s') \land \operatorname{ctx}_2(s)=\operatorname{ctx}_2(s') \implies r=r' \)
    639 \end{assumption}
    640 If we assume \bertcoder{} only performs entity linking of the entities surface form, then \(\operatorname{ctx}_i(s)=e_i\) for \(i=1,2\), in this case \hypothesis{\ctxoneadj} collapses to \hypothesis{1-adjacency}, the contextualization inside the sentence \(s\) is ignored.
    641 On the other hand, if we assume \bertcoder{} provides no information about the entities and only encode the sentence, then \(\operatorname{ctx}_i(s)=s\) for \(i=1,2\) and \hypothesis{\ctxoneadj} only states that the entity identifiers \(\vctr{e}\in\entitySet^2\) should have no influence on the relation.
    642 The effective repercusion of \hypothesis{\ctxoneadj} lies somewhere half-way between these two extremes.