PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

conclusion.tex (11861B)


      1 \chapter{Conclusion}
      2 \label{chap:conclusion}
      3 During this Ph.D.\ candidacy, I---mostly%
      4 \sidenote{
      5 	With the occasional---and deeply appreciated---distraction of Syrielle Montariol on unrelated \textsc{nlp} projects \parencite{mmsrl}.
      6 }%
      7 ---focused on the study of unsupervised relation extraction.
      8 In this task, given a set of tagged sentences and pairs of entities, we seek the set of conveyed facts \((e_1, r, e_2)\), such that \(r\) embodies the relationship between \(e_1\) and \(e_2\) expressed in some sample.
      9 To tackle this task, we follow two main axes of research: first, the question of how to train a deep neural network for unsupervised relation extraction; second, the question of how to leverage the structure of an unsupervised dataset to gain additional information for the relation extraction task.
     10 
     11 \section*{Summary of Contributions}
     12 For more than a decade now, the field of machine learning has been overrun by deep learning approaches.
     13 Since I started working on unsupervised relation extraction in late 2017, the task followed the same fate.
     14 The \textsc{vae} model of \textcitex{vae_re} started introducing deep learning methods to the task.
     15 However, it was still limited by a sentence representation based on hand-engineered features.
     16 My first axis of research was to partake in this deep learning transition (Chapter~\ref{chap:fitb}).
     17 Subsequently, the use of deep learning was made simpler with the replacement of \textsc{cnn} and \textsc{lstm}-based models with pre-trained transformers.
     18 Indeed, a model like \textsc{bert} \parencite{bert} performs reasonably well on unsupervised relation extraction ``out of the box.''
     19 This was exploited by others, in the clustering setup by Self\textsc{ore} \parencitex{selfore}, and in the few-shot setup by \textsc{mtb} \parencitex{mtb}.
     20 My second axis of research was to exploit the regularities of the dataset to leverage additional information from its structure (Chapter~\ref{chap:graph}).
     21 While some works already used this information in supervised relation extraction \parencite{label_propagation_re, epgnn}, unsupervised models made no attempt at modeling it explicitly.
     22 Our proposed approaches are based on a graph representation of the dataset.
     23 As we have shown, they inscribe themselves in a general revival of graph-based approaches in deep learning \parencite{gcn_spectral_semi, graphsage}.
     24 We now describe the three main contributions we can draw from our work.
     25 
     26 \paragraph{Literature review with formalized modeling assumptions.}\leavevmode\null\\
     27 In Chapter~\ref{chap:relation extraction}, we presented relevant relation extraction models from the late 1990s until today.
     28 We first introduced supervised approaches, which we split into two main blocks:
     29 \begin{description}[font=\mdseries\itshape]
     30 	\item[Sentential methods] extract a relation for each sample in isolation.
     31 		In this setup, there is no difference between evaluating a model on a single dataset with a thousand samples or a thousand datasets containing one sample each.
     32 		Indeed, these models do not model the interactions between samples.
     33 	\item[Aggregate methods] map a set of unsupervised samples to a set of facts at once.
     34 		There is not necessarily a direct correspondence between extracted facts and samples in the dataset, even though most aggregate models still provide a sentential prediction.
     35 		In this setup, a dataset containing a single sentence would be meaningless; it would boil down to a sentential approach.
     36 \end{description}
     37 This distinction can also be made for unsupervised models, and indeed Chapter~\ref{chap:fitb} follows mostly a sentential approach, whereas Chapter~\ref{chap:graph} purposes to introduce the aggregate approach to the unsupervised setting.
     38 
     39 In Chapter~\ref{chap:relation extraction}, we also presented unsupervised relation extraction models.
     40 Unsupervised models need to rely on modeling hypotheses to capture the notion of relation.
     41 \begin{marginparagraph}
     42 	As a reminder, the modeling hypotheses are listed in Appendix~\ref{chap:assumptions}.
     43 \end{marginparagraph}
     44 While these hypotheses are not always clearly stated in articles, they are central to the design of unsupervised approaches.
     45 For our review, we decided to exhibit the key modeling hypotheses of relevant models.
     46 Formalizing these hypotheses allows us to have a clear understanding of what kind of relations cannot be modeled by a given model.
     47 Furthermore, it simplifies the usually challenging task of designing an unsupervised relation extraction loss.
     48 
     49 \paragraph{Regularizing discriminative approaches for deep encoders.}\leavevmode\null\\
     50 In Chapter~\ref{chap:fitb}, we introduced the first unsupervised model that does not rely on hand-engineered features.
     51 In particular, we identified two critical weaknesses of previous discriminative models which hindered the use of deep neural networks.
     52 These weaknesses relate to the model's output, which tends to collapse to a trivial---either deterministic or uniform---distribution.
     53 We introduced two relation distribution losses to alleviate these problems: a skewness loss pushes the prediction away from a uniform distribution, and a distribution distance loss prevents the output from collapsing to a deterministic distribution.
     54 This allowed us to train a \textsc{pcnn} model to cluster unsupervised samples in clusters conveying the same relation.
     55 
     56 \paragraph{Exploiting the dataset structure using graph-based models.}\leavevmode\null\\
     57 In Chapter~\ref{chap:graph}, we investigated aggregate approaches for unsupervised relation extraction.
     58 We encoded the relation extraction problem as a graph labeling---or attributing---problem.
     59 We then showed that information can be leveraged from this structure by probing distributional regularities of random paths.
     60 To exploit this information, we designed an assumption using our experience from Chapter~\ref{chap:relation extraction} to leverage the structure of the graph to supervise a relation extraction model.
     61 We then proposed an approach based on this hypothesis by modifying the Weisfeiler--Leman isomorphism test to use a 1-Wasserstein distance.
     62 
     63 \bigskip
     64 
     65 From a higher vantage point, we can say that we first assisted the development of deep learning approaches for the task of unsupervised relation extraction, and then helped open a new direction of research on aggregate approaches in the unsupervised setup using graph-based models.
     66 Both of these research objects were somewhat natural developments following current trends in machine learning research.
     67 
     68 \section*{Perspectives}
     69 
     70 \paragraph{Using language modeling for relation extraction.}
     71 A recent trend in \textsc{nlp} has been to encode all tasks as language models.
     72 The main embodiment of this trend is T5 \parencitex{t5}.
     73 \begin{marginparagraph}
     74 	The name T5 comes from ``Text-To-Text Transfer Transformer'' since it recasts every \textsc{nlp} task as a text-to-text problem.
     75 \end{marginparagraph}
     76 T5 is trained as a masked language model (\textsc{mlm}, Section~\ref{sec:context:mlm}) on a sizeable ``common crawl'' of the web.
     77 Then, it is fine-tuned by prefixing the sequence with a task-specific prompt such as ``translate English to German:''.
     78 Relation extraction can also be trained as a text-to-text model in the supervised setup \parencite{text_to_text_re}.
     79 Extending this model to the unsupervised setup---for example, through the creation of pseudo-labels---could allow us to leverage the large amount of linguistic information contained in the T5 parameters.
     80 In the same vein, \textcitex{lm_rp} propose to use predefined and learned prompts for relation prediction, for example by filling in the following template: ``Today, I finally discovered the relation between \(e_1\) and \(e_2\): \(e_1\) is the \blanktag{} of \(e_2\).''
     81 
     82 More generally, relation extraction is closely related to language models.
     83 The first model we experimented on during this Ph.D.\ candidacy was a pre-trained language model used to fill sentences such as ``The capital of Japan is \blanktag.''
     84 While \textcitex{transformers} was already published at the time, pre-trained transformer language models were not widely available yet.
     85 We used a basic \textsc{lstm}, which was strongly biased in favor of entities often appearing in the dataset.
     86 In practice, the model predicted ``London'' as the capital of most small countries.
     87 However, as we showcased in Section~\ref{sec:relation extraction:mtb}, large transformer-based models such as \textsc{bert} \parencite{bert} perform well out-of-the-box on unsupervised relation extraction.
     88 An additional argument in favor of transformer-based language models comes from Chapter~\ref{chap:fitb}.
     89 Indeed, the \emph{fill-in-the-blank} model seeks to predict an entity blanked in the input; this is similar to the \textsc{mlm} task.
     90 More abstractly, language purposes to describe a reality which can be understood---among other things---through the concept of relation.
     91 And indeed, if one understands language, one must understand the relations conveyed by language.
     92 Using a model of language as a basis for a model of relations is promising, as long as the semantic fragment of language unrelated to relations can be discarded.
     93 
     94 \paragraph{Dataset-level modeling hypotheses.}
     95 In the past few years, graph-based approaches have gained traction in the information extraction field \parencite{graphie, graphrel}\sidecite{graphie} and we can only expect this interest to continue growing in the future.
     96 While knowledge of the language should be sufficient to understand the relation underlying most samples, it is challenging to design an unsupervised loss solely relying on linguistic information.
     97 Furthermore, following distributional linguistics, language---and thus relations conveyed by language---are acquired through structured repetitions.
     98 The concept of repetition captured by graph adjacency can therefore also provide a theoretical basis for the design of modeling hypotheses.
     99 We can even argue that capturing the structure of the data is an ontologically prior modeling level.
    100 For this reason, we think that relation graphs should provide a better basis for the formulation of modeling hypotheses.
    101 
    102 \paragraph{Complex relations.}
    103 Several simplifying assumptions were made to define the relation extraction task.
    104 For example, we assume all relations to be binary, holding between exactly two entities.
    105 However, \(n\)-ary relations are needed to model complex interrelationships.
    106 For example, encoding the fact that ``a drug \(e_1\) can be used to treat a disease \(e_2\) when the patient has genetic mutation \(e_3\)'' necessitates a ternary relation.
    107 This problem has been tackled for a long time \parencite{n-ary_old, n-ary_recent}.
    108 The graph-based approaches have a natural extension to \(n\)-ary relation in the form of hypergraphs, which are graphs with \(n\)-ary edges.
    109 Since the hypergraph isomorphism problem can be polynomially reduced to the standard graph isomorphism problem \parencite{gicomplete}, we can expect \(n\)-ary extension of graph-based relation extraction approaches to work as well as standard relation extraction.
    110 
    111 A related problem is the one of fact qualification.
    112 The fact ``Versailles \textsl{capital of} France'' only held until the 1789 revolution.
    113 In the Wikidata parlance, these kinds of details are called \emph{qualifiers}.
    114 In particular, the temporal qualification can be critical to certain relation extraction datasets \parencite{time_aware_re}.
    115 Some information extraction datasets already include this information \parencite{knowledgenet}; however, little work has been made in this direction yet.
    116 Qualifiers could be generated from representations of relations in a continuous manifold such as the one induced by a similarity space for few-shot evaluation.
    117 However, learning to map relation embeddings to qualifiers in an unsupervised fashion might prove difficult.