introduction.tex (23028B)
1 \chapter{Introduction} 2 \begin{onehalfspace} 3 The world is endowed with a structure, which enables us to understand it. 4 This structure is most apparent through repetitions of sensory experiences. 5 Sometimes, we can see a cat, then another cat. 6 Entities emerge from the repetition of catness we experienced. 7 From time to time, we can also observe a cat \textsl{inside} a cardboard box or a person \textsl{inside} a room. 8 \begin{marginparagraph} 9 Relations---albeit in a more restrictive sense---are one of Aristotle's ten \emph{praedicamenta}, the categories of objects of human apprehension \parencite{sep_medieval_categories}. 10 \end{marginparagraph} 11 Relations are the explanatory device underlying this second kind of repetition. 12 A relation governs an interaction between two or more objects. 13 We assume an \textsl{inside} relation exists because we repeatedly experienced the same interaction between a container and its content. 14 The twentieth century saw the rise of structuralism, which regarded the interrelations of phenomena as more enlightening than the study of phenomena in isolation. 15 In other words, we might better understand what a cat is by studying its relationships to other entities instead than by listing the characteristics of catness. 16 From this point of view, the concept of relation is crucial to our understanding of the world. 17 18 \begin{marginparagraph} 19 \includegraphics[width=\marginparwidth]{frontmatter/Cheshire Cat.png} 20 The Cheshire Cat from \textcite{cat} provides you with an experience of catness. 21 \end{marginparagraph} 22 23 Natural languages capture the underlying structure of these repetitions through a process we do not fully understand. 24 One of the endeavors of artificial intelligence, called natural-language understanding, is to mimic this process with definite algorithms. 25 Since the aforementioned goal is still elusive, we strive to model only parts of this process. 26 This thesis, consequent to the structuralist perspective, focuses on extracting relations conveyed by natural language. 27 Assuming natural language is representative of the underlying structure of sensory experiences,% 28 \sidenote{ 29 The repetitions of sensory experiences and words need not be alike. 30 We are only concerned with the possibility of resolving references here. 31 Even though our experiences of trees are more often than not accompanied with experiences of bark, the words ``tree'' and ``bark'' do not co-occur as often in natural language utterances. 32 However, their meronymic relationship is understandable both through experiences of trees and inter alia through the use of the preposition ``of'' in textual mentions of barks. 33 } 34 we should be able to capture relations through the exploitation of repetitions alone---i.e.\ in an unsupervised fashion. 35 36 Extracting relations can help better our understanding of how languages work. 37 For example, whether languages can be understood through a small amount of data is still a somewhat open question in linguistics. 38 The poverty of the stimulus argument states that children should not be able to acquire proficiency from being exposed to so little data. 39 It is one of the major arguments in favor of the controversial universal grammar theory. 40 Capturing relations from nothing more than a small number of natural language utterances would be a step towards disproving the poverty of the stimulus claim. 41 42 This kind of incentive for tackling the relation extraction problem stems from an \emph{episteme}% 43 \sidenote{From the Ancient Greek \foreignlanguage{greek}{ἐπιστήμη}: knowledge, know-how.} 44 endeavor. 45 However, most of the traction for this problem stems from a \emph{techne}% 46 \sidenote{From the Ancient Greek \foreignlanguage{greek}{τέχνη}: craft, art.} 47 undertaking. 48 The end goal is to build a system with real-world applications. 49 Under this perspective, the point of artificial intelligence is to replace or assist humans on specific tasks. 50 Most tasks of interest necessitate some form of technical knowledge (e.g.\ diagnosing a disease requires knowledge of the relationship between symptoms and diseases). 51 The principal vector of knowledge is language (e.g.\ through education). 52 Thus, knowledge acquisition from natural language is fundamental for systems purposing to have such applications. 53 54 For an analysis of the real-world impact of systems extracting knowledge from text, refer to \textcitex{assisted_curation}. 55 Their article shows that human curators can use a machine learning system to better extract a set of protein--protein interactions from biomedical literature. 56 This is clearly a \emph{techne} endeavor: the protein--protein interactions are not new knowledge, they are already published; however, the system improves the work of the human operator. 57 58 \begin{epigraph} 59 {Willard Van Orman Quine} 60 {\citetitle{quine_two_dogma}} 61 {\cite*{quine_two_dogma}} 62 [][-26mm] 63 Once the theory of meaning is sharply separated from the theory of reference, it is a short step to recognizing as the business of the theory of meaning simply the synonymy of linguistic forms and the analyticity of statements; meanings themselves, as obscure intermediary entities, may well be abandoned. 64 \end{epigraph} 65 66 This example of application is revealing of the larger problem of information explosion. 67 The quantity of published information has grown relentlessly throughout the last decades. 68 Machine learning can be used to filter or aggregate this large amount of data. 69 In this case, the object of interest is not the text in itself but the conveyed semantic, its meaning. 70 This begs the question: how to define the meaning we are seeking to process? 71 Indeed, foundational theories of meaning are the object of much discussion in the philosophy community \parencite{sep_meaning}. 72 While some skeptics, like Quine, do not recognize meaning as a concept of interest, they reckon that a minimal description of meaning should at least encompass the recognition of synonymy. 73 This follows from the above discussion about the recognition of repetitions: if \input{frontmatter/gavagai 1.tex} is a repetition of \input{frontmatter/gavagai 2.tex}, we should be able to say that \input{frontmatter/gavagai 1.tex} and \input{frontmatter/gavagai 2.tex} are synonymous. 74 In practice, this implies that we ought to be able to extract classes of linguistic forms with the same meaning or referent---the difference between the two is not relevant to our problem. 75 76 \begin{marginparagraph}[-47mm] 77 \includegraphics[width=\marginparwidth]{frontmatter/Paris Quadrifolia.jpg} 78 Paris (\wdent{162121}) is neither capital of France, nor prince of Troy, it is the genus of the true lover's knot plant. 79 The capital of France would be Paris (\wdent{90}) and the prince of Troy, son of Priam, Paris (\wdent{167646}). 80 Illustration from \textcite{paris_quadrifolia}. 81 \label{margin:introduction:paris quadrifolia} 82 \end{marginparagraph} 83 84 While the above discussion of meaning is essential to define our objects of interest, relations, it is important to note that we work on language; we want to extract relations from language, not from repetitions of abstract entities. 85 Yet, the mapping between linguistic signifiers and their meaning is not bijective. 86 We can distinguish two kinds of misalignment between the two: either two expressions refer to the same object (synonymy), or the same expression refers to different objects depending on the context in which it appears (homonymy). 87 The first variety of misalignment is the most common one, especially at the sentence level. 88 For example, ``Paris is the capital of France'' and ``the capital of France is Paris'' convey the same meaning despite having different written and spoken forms. 89 On the other hand, the second kind is principally visible at the word level. 90 For example, the preposition ``from'' in the phrases ``retinopathy from diabetes'' and ``Bellerophon from Corinth'' conveys either a \textsl{has effect} relationship or a \textsl{birthplace} one. 91 To distinguish these two uses of ``from,'' we can use relation identifiers such as \wdrel{1542} for \textsl{has effect} and \wdrel{19} for \textsl{birthplace}. 92 An example with entity identifiers---which purpose to uniquely identify entity concepts---is provided in the margin of page~\pageref{margin:introduction:paris quadrifolia}. 93 94 \begin{marginparagraph}[-2cm] 95 Throughout this thesis, we will be using Wikidata identifiers (\url{https://www.wikidata.org}) to index entities and relations. 96 Entities identifiers start with \texttt{Q}, while relation identifiers start with \texttt{P}. 97 For example, \wdent{35120} is an entity. 98 \end{marginparagraph} 99 100 While the preceding discussion makes it seems as if all objects can fit nicely into clearly defined concepts, in practice, this is far from the truth. 101 Early in the knowledge-representation literature, \textcite{is-a_analysis} remarked the difficulty to clearly define even seemingly simple relations such as \textsl{instance of} (\wdrel{31}). 102 This problem ensues from the assumption that synonymy is transitive, and therefore, induces equivalence classes. 103 This assumption is fairly natural since it already applies to the link between language and its references: even though two cats might be very unlike one another, we still group them under the same signifier. 104 However, language is flexible. 105 When trying to capture the entity ``cat,'' it is not entirely clear whether we should group ``a cat with the body of a cherry pop tart'' with regular experiences of catness.% 106 \sidenote{The reader who would describe this as a cat is invited to replace various body parts of this imaginary cat with food items until they stop experiencing catness.} 107 To circumvent this issue, some recent works \parencite{fewrel} on the relation extraction problem define synonymy as a continuous intransitive association. 108 Instead of grouping linguistic forms into clear-cut classes with a single meaning, they extract a similarity function defining how similar two objects are. 109 110 Now that we have conceptualized our problem, let us focus on our proposed technical approach. 111 First, to summarize, this thesis focus on unsupervised relation extraction from text.% 112 \sidenote[][-11mm]{We use text as it is the most definite and easy-to-process rendition of language.} 113 Since relations are objects capturing the interactions between entities, our task is to find the relation linking two given entities in a piece of text. 114 For example, in the three following samples where entities are underlined: 115 \begin{marginparagraph}[-11mm] 116 \includegraphics[width=\marginparwidth]{frontmatter/Ship of Theseus.jpg} 117 Ariadne waking on the shore of Naxos where she was abandoned, wall painting from Herculaneum in the collection of \textcite{ship_of_theseus}. 118 The ship in the distance can be identified as the ship of Theseus, for now. 119 Depending on the philosophical view of the reader (\wdent{1050837}), its identity as the ship of Theseus might not linger for long. 120 \end{marginparagraph} 121 \begin{indentedexample} 122 \uhead{Megrez} is a star in the northern circumpolar constellation of \utail{Ursa Major}. 123 \smallskip 124 125 \uhead{Posidonius} was a Greek philosopher, astronomer, historian, mathematician, and teacher native to \utail{Apamea, Syria}. 126 \smallskip 127 128 \uhead{Hipparchus} was born in \utail{Nicaea, Bithynia}, and probably died on the island of Rhodes, Greece. 129 \end{indentedexample} 130 we wish to find that the last two sentences convey the same relation---in this case, \sfTripletHolds{e_1}{born in}{e_2} (\wdrel{19})---or at the very least, following the discussion in the preceding paragraph about the difficulty of defining clear relation classes, we wish to find that the relations conveyed by the last two samples are closer to each other than the one conveyed by the first sample. 131 We propound that this can be performed by machine learning algorithms. 132 In particular, we study how to approach this task using deep learning. 133 While relation extraction can be tackled as a standard supervised classification problem, labeling a dataset with precise relations is a tedious task, especially with technical documents such as the biomedical literature studied by \textcite{assisted_curation}. 134 Another problem commonly encountered by annotators is the question of applicability of a relation, for example, should ``the \uhead{country}'s founding \utail{father}'' be labeled with the \textsl{product--producer} relation?% 135 \sidenote{ 136 The annotator of this sentence piece in the SemEval~2010 Task~8 dataset (Section~\ref{sec:datasets:semeval}) decided that it does convey the \textsl{product--producer} relation. 137 The difficulty of applying a definition is an additional argument in favor of similarity-function-based approaches over classification approaches. 138 } 139 We now discuss how deep learning became the most promising technique to tackle natural language processing problems. 140 141 The primary subject matter of the relation extraction problem is language. 142 Natural language processing (\textsc{nlp}) was already a prominent research interest in the early years of artificial intelligence. 143 This can be seen from the \emph{episteme} viewpoint in the seminal paper of \textcitex{turing_test}. 144 This paper proposes mastery of language as evidence of intelligence, in what is now known as the Turing test. 145 Language was also a subject of interest for \emph{techne} objectives.% 146 \begin{epigraph} 147 {Leon Dostert} 148 {``701~translator'' \textsc{ibm} press release} 149 {1954} 150 Five, perhaps three years hence, interlingual meaning conversion by electronic process in important functional areas of several languages may well be an accomplished fact. 151 \end{epigraph} 152 In January 1954, the Georgetown--\textsc{ibm} experiment tried to demonstrate the possibility of translating Russian into English using computers \parencite{georgetown-ibm}. 153 The experiment showcased the translation of sixty sentences using a bilingual dictionary to translate words individually and six kinds of grammatical rules to reorder tokens as needed. 154 Initial experiments created an expectation buildup, which was followed by an unavoidable disappointment, resulting in an ``\textsc{ai} winter'' where research fundings were restricted. 155 While translating word-by-word is somewhat easy in most cases, translating whole sentences is a lot harder. 156 Scaling up the set of grammatical rules in the Georgetown--\textsc{ibm} experiment proved impractical. 157 This limitation was not a technical one. 158 With the improvement of computing machinery, more rules could have easily been encoded. 159 One of the issues identified at the time was the commonsense knowledge problem \parencite{commonsense}. 160 In order to translate or, more generally, process a sentence, it needs to be understood in the context of the world in which it was uttered. 161 Simple rewriting rules cannot capture this process.% 162 \sidenote[][-32mm]{ 163 Furthermore, grammar is still an active area of research. 164 We do not perfectly understand the underlying reality captured by most words and are thus unable to write down complete formal rules for their usages. 165 For example, \textcite{over_grammar} is a 43~pages cognitive linguistics paper attempting to explain the various uses of the English preposition ``over.'' 166 This is one of the arguments for unsupervised approaches; we should avoid hand-labeled datasets if we want to outperform the human annotators. 167 } 168 In order to handle whole sentences, a paradigm shift was necessary. 169 170 A first shift occurred in the 1990s with the advent of statistical \textsc{nlp} \parencite{statistical_methods}. 171 This evolution can be partly attributed to the increase of computational power, but also to the progressive abandon of essentialist linguistics precepts% 172 \sidenote{ 173 Noam Chomsky, one of the most---if not the most---prominent essentialist linguists, considers that manipulating probabilities of text excerpt is not the way to acquire a better understanding of language. 174 Following the success of statistical approaches, he only recognized statistical \textsc{nlp} as a \emph{techne} achievement. 175 For an answer to this position, see \textcite{statistical_methods, norvig_chomsky}. 176 } 177 in favor of distributionalist ones. 178 Instead of relying on human experts to input a set of rules, statistical approaches leveraged the repetitions in large text corpora to infer these rules automatically. 179 Therefore, this progression can also be seen as a transition away from symbolic artificial intelligence models and towards statistical ones. 180 Coincidently, the relation extraction task was formalized at this time. 181 And while the earliest approaches were based on symbolic models using handwritten rules, statistical methods quickly became the norm after the 1990s. 182 However, statistical \textsc{nlp} models still relied on linguistic knowledge. 183 The relation extraction systems were usually split into a first phase of hand-specified linguistic features extraction and a second phase where a relation was predicted based on these features using shallow statistical models. 184 185 \tatefix{3mm}{5mm}{6mm} 186 \begin{cjkepigraph}[\traditionalChinese]{45mm} 187 {\begin{epigraphcontent}[35mm] 188 {} 189 {``Gongsun Longzi'' Chapter~2} 190 {circa~300~\textsc{bce}} 191 White horse is not horse. 192 \end{epigraphcontent}} 193 [% 194 A well-known paradox in early Chinese philosophy illustrating the difficulty of clearly defining the meaning conveyed by natural languages. 195 This paradox can be resolved by disambiguating the word ``horse.'' 196 Does it refers to the ``whole of all horse kind'' (the mereological view) or to ``horseness'' (the Platonic view)? 197 The mereological interpretation was famously---and controversly---introduced by \textcite{hansen_mass_noun_hypothesis}, see \textcite{chinese_ontology} for a discussion of early Chinese ontological views of language. 198 ] 199 白馬非馬 200 \end{cjkepigraph} 201 202 A second shift occurred in the 2010s when deep learning approaches erased the split between feature extraction and prediction. 203 Deep learning models are trained to directly process raw data, in our case text excerpts. 204 To achieve this feat, neural networks able to approximate any function are used. 205 However, the downside of these models is that they usually require large amounts of labeled data to be trained. 206 This is a particularly salient problem throughout this thesis since we deal with an unsupervised problem. 207 As the latest and most efficient technique available, deep learning proved to be a natural choice to tackle relation extraction. 208 However, this natural evolution came with serious complications that we try to address in this manuscript. 209 210 \begin{marginparagraph} 211 \includegraphics[width=\marginparwidth]{frontmatter/OuCuiPo.jpg} 212 Frontispiece of the OuCuiPian Library by \textcite{oucuipo}. 213 A different kind of cooking with letters. 214 \end{marginparagraph} 215 216 The evolution of unsupervised relation extraction methods closely follows the one of \textsc{nlp} methods described above. 217 The first deep learning approach was the one of \textcite{vae_re}. 218 However, only part of their model relied on deep learning techniques, the extraction of features was still done manually. 219 The reason why feature extraction could not be done automatically as is standard in deep learning approaches is closely related to the unsupervised nature of the problem. 220 Our first contribution is to propose a technique to enable the training of unsupervised fully-deep learning relation extraction approaches. 221 Afterward, different ways to tackle the relation extraction task emerged. 222 First, recent approaches use a softer definition of relations by extracting a similarity function instead of a classifier. 223 Second, they consider a broader context: instead of processing each sentence individually, the global consistency of extracted relations is considered. 224 However, this second approach was mostly limited to the supervised setting, with limited use in the unsupervised setting. 225 Our second contribution concerns using this broader context for unsupervised relation extraction, in particular for approaches defining a similarity function. 226 During the preparation of the thesis, we also published an article on multimodal semantic role labeling with Syrielle Montariol and her team \parencite{mmsrl}; since it is somewhat unrelated to unsupervised relation extraction, we do not include it in this thesis. 227 \begin{marginparagraph} 228 Syrielle Montariol,\textsuperscript{*} Étienne Simon,\textsuperscript{*} Arij Riabi, Djamé Seddah. \citefield{mmsrl}[linkedtitle]{title} \citefield{mmsrl}{shortseries}~\cite*{mmsrl} 229 230 \raggedleft\scriptsize\textsuperscript{*}\,Equal contributions 231 \end{marginparagraph} 232 233 We now describe the organization of the thesis. 234 Chapter~\ref{chap:context} provides the necessary background for using deep learning to tackle the relation extraction problem. 235 In particular, we focus on the concept of distributed representation, first of language, then of entities and relations. 236 Chapter~\ref{chap:relation extraction} formalizes the relation extraction task and presents the evaluation framework and relevant related works. 237 This chapter focuses first on supervised relation extraction using local information only, then on aggregate extraction, which exploits repetitions more directly, before delving into unsupervised relation extraction. 238 In Chapter~\ref{chap:fitb}, we propose a solution to train deep relation extraction models in an unsupervised fashion. 239 The problem we tackle is a stability problem between a powerful universal approximator and a weak supervision signal transpiring through the repetitions in the data. 240 This chapter was the object of a publication at \textsc{acl} \parencite{fitb}. 241 \begin{marginparagraph} 242 \hbadness=8000% Can't do better… :'( 243 Étienne Simon, Vincent Guigue, Benjamin Piwowarski. \citefield{fitb}[linkedtitle]{title} \citefield{fitb}{shortseries}~\cite*{fitb} 244 \end{marginparagraph} 245 Chapter~\ref{chap:graph} explores the methods to exploit the structure of the data more directly through the use of graph-based models. 246 \begin{marginparagraph} 247 The work presented in Chapter~\ref{chap:graph} still needs to be polished with more experimental work and is yet unpublished at the time of writing. 248 \end{marginparagraph} 249 In particular, we draw parallels with the Weisfeiler--Leman isomorphism test to design new methods using topological (dataset-level) and linguistic (sentence-level) features jointly. 250 Appendix~\ref{chap:french} contains the state-mandated thesis summary in French. 251 The other appendices provide valuable information that can be used as references. 252 We strongly encourage the reader to refer to them for additional details on the datasets (Appendix~\ref{chap:datasets}), but even more so for the list of assumptions made by relation extraction models (Appendix~\ref{chap:assumptions}). 253 These modeling hypotheses are central to the design of unsupervised approaches. 254 In addition to their definition and reference to the introducing section, Appendix~\ref{chap:assumptions} provides counterexamples, which might help the reader understand the nature of these assumptions. 255 \end{onehalfspace}