PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

history.tex (10925B)


      1 \section{Historical Development}
      2 \label{sec:context:history}
      3 In this section, we expose the rationale for applying deep learning to relation extraction, how the related fields appeared and why the task is relevant.
      4 Since algorithms were first given to train generic deep neural networks~\parencite{deepbeeliefnets,relu}, most problems tackled by machine learning can now be approached with deep learning methods.
      5 Over the last few years, deep learning has been very successful in a variety of tasks such as image classification~\parencite{cnn_imagenet}, machine translation~\parencite{nmt_encdec}, audio synthesis~\parencite{wavenet}, etc.
      6 This is why it is not surprising that deep learning is now applied to more tasks traditionally tackled by other machine learning methods, such as in this thesis, where we apply it to relation extraction.
      7 
      8 From a historical point of view, machine learning---and hence deep learning---are deeply anchored in \emph{empiricism}.
      9 Empiricism is the epistemological paradigm in which knowledge is anchored in sensory experiences of the world, which are called empirical evidence.
     10 This is not to say that there are no theoretical arguments motivating the use of certain machine learning methods; the universal approximation theorems~\parencite{universal_approximator_sigmoid, universal_approximator_nonpolynomial} can be seen as a theoretical argument for deep learning.
     11 But in the end, a machine learning method draws its legitimacy from the observation that they perform strongly on a real dataset.
     12 This is in stark contrast to the rationalist paradigm, which posits that knowledge comes primarily from reason.
     13 
     14 This strong leaning on empiricism can also be seen in \textsc{nlp}.
     15 \textsc{nlp} comes from the \emph{externalist} approach to linguistic theorizing, focusing its analyses on actual utterances.
     16 A linguistic tool that externalists often avoid while being widely used by other schools is elicitation through prospective questioning: ``Is this sentence grammatical?''
     17 Externalists consider that language is acquired through distributional properties of words and other constituents;%
     18 \sidenote{In other words, language is acquired by observing empirical co-occurrences: where words go and where they don't in actual utterances tell us where they can go and where they can't.}
     19 and study these properties by collecting corpora of naturally occurring utterances.
     20 The associated school of structural linguistics inscribes itself into the broader view of \emph{structuralism}, the belief that phenomena are intelligible through a concept of structure that connects them together, the focus being more on these interrelations instead of each individual object.
     21 In the case of linguistics, this view was pioneered by Ferdinand de Saussure which stated in its course in general linguistics:
     22 \begin{quote}
     23 	\begin{epigraph}{Ferdinand de Saussure}{\citetitle{linguistique_generale}}{\cite*{linguistique_generale}}
     24 		La langue est un système dont toutes les parties peuvent et doivent être considérées dans leur solidarité synchronique.
     25 	\end{epigraph}
     26 	Language is a system whose parts can and must all be considered in their synchronic%
     27 	\sidenote{
     28 		Saussure makes a distinction between syn\-chron\-ic---at a certain point in time---and dia\-chron\-ic---changing over time---analyses.
     29 		This does not mean that the meaning of a word is not influenced by its history, but that this influence is entirely captured by the relations of the word with others at the present time and that conditioned on these relations, the current meaning of the word is independent of its past meaning.
     30 	}
     31 	solidarity.\\
     32 	\null\hfill---
     33 	\begin{minipage}[t]{5cm}
     34 		Ferdinand de Saussure, \citetitle{linguistique_generale}~(\cite*{linguistique_generale})
     35 	\end{minipage}
     36 \end{quote}
     37 This train of thought gave rise to \emph{distributionalism} whose ideas are best illustrated by the distributional hypothesis stated in \textcite{distributional_hypothesis}:
     38 \begin{spacedblock}
     39 	\strong{Distributional Hypothesis:}
     40 	\emph{Words that occur in similar contexts convey similar meanings.}
     41 \end{spacedblock}
     42 This can be pushed further by stating that a word is solely characterized by the context in which it appears.
     43 
     44 On the artificial intelligence side, deep learning is usually compared to symbolic approaches.
     45 The distinction originates in the way information is represented by the system.
     46 In the symbolic approach, information is carried by strongly structured representations in which a concept is usually associated with a single entity, such as a variable in a formula or in a probabilistic graphical model.
     47 On the other hand, deep learning uses distributed representations in which there is a many-to-many relationship between concepts and neurons; each concept is represented by many neurons, and each neuron represents many concepts.
     48 The idea that mental phenomena can be represented using this paradigm is known as \emph{connectionism}.
     49 One particular argument in favor of connectionism is the ability to degrade gracefully: deleting a unit in a symbolic representation equates to deleting a concept, while deleting a unit in a distributed representation merely lowers the precision with which concepts are defined.
     50 Note that connectionism is not necessarily incompatible with a symbolic theory of cognition.
     51 Distributed representations can be seen as a low-level explanation of cognition, while from this point of view, symbolic representation is a high-level interpretation encoded by distributed representations.%
     52 \sidenote[][-3cm]{
     53 	This view on the relation between distributed and symbolic representations can be seen in the early neural networks literature as can be seen in \textcite{concept_backprop}, which is often cited for its formalization of the backpropagation algorithm.
     54 	More recently, \textcite{binding_symbolic} investigate the  binding problem between symbols and distributed representations.
     55 }
     56 
     57 Furthermore, we can make a distinction on how structured is the kind of data used.
     58 In this thesis, we will especially focus on the relationship between unstructured text%
     59 \sidenote{
     60 	Of course, language does have a structure.
     61 	We do not deny the existence of grammar but merely state that text is less structured than other structures studied in this chapter (see Section~\ref{sec:context:knowledge base}).
     62 }
     63 and structured data (in the form of knowledge bases).
     64 To give a sense of this difference, compare the following text from the Paris Wikipedia page to facts from the Wikidata knowledge base:
     65 \begin{spacedblock}
     66 \null\hfill%
     67 \begin{minipage}{0.425\textwidth}
     68 	Paris is the capital and most populous city of France.
     69 	The City of Paris is the centre and seat of government of the region and province of Île-de-France.
     70 \end{minipage}%
     71 \hfill%
     72 \begin{minipage}{0.425\textwidth}
     73 	\begin{itemize}[label={},leftmargin=0mm]
     74 		\item Paris \textsl{capital of} France
     75 		\item Paris \textsl{located in the administrative territorial entity} Île-de-France
     76 	\end{itemize}
     77 \end{minipage}%
     78 \hfill\null%
     79 \end{spacedblock}
     80 \begin{marginparagraph}[-1cm]
     81 	We use \textsl{slanted text} to indicate a relational surface form such as ``\textsl{capital of}'' in the fact ``Paris \textsl{capital of} France.''
     82 \end{marginparagraph}%
     83 
     84 Through this example, we see that both natural languages and knowledge bases encode meaning.
     85 To talk about what they encode, we assume the existence of a semantic space containing all possible meanings.
     86 We do not assume any theory of meaning used to define this space; this allows us to stay neutral on whether language is ontologically prior to propositional attitudes and its link with reality or semantically evaluable mental states.
     87 In the same way that different natural languages are different methods to address this semantic space, knowledge bases seek to refer to the same semantic space%
     88 \sidenote{
     89 	Strictly speaking, practical knowledge bases only seek to index a subset of this space, see note~\ref{note:context:knowledge vs meaning} in the margin of page \pageref{note:context:knowledge vs meaning}.
     90 }
     91 with an extremely rigid grammar.
     92 
     93 Both natural language and knowledge bases are discrete systems.
     94 For both these systems, we can use the distributional hypothesis to obtain continuous distributed representations.
     95 These representations purpose to capture the semantic as a simple topological space such as a Euclidean vector space where distance encodes dissimilarity, as shown in Figure~\ref{fig:context:word2vec pca}.
     96 Moreover, using a differentiable manifold allows us to train these representations through backpropagation using neural architectures.
     97 
     98 The question of how to process texts algorithmically has evolved over the last fifty years.
     99 Language being conveyed through symbolic representations, it is quite natural for us to manipulate them.
    100 As such, early machine learning models strongly relied on them.
    101 For a long time, symbolic approaches had an empirical advantage: they worked better.
    102 However, in the last few years, distributed representations have shown unyielding results, and most tasks are now tackled with deep learning using distributed representations.
    103 \begin{marginparagraph}
    104 	This transition from rule-based models to statistical models to neural network models can also be seen in relation extraction with Hearst (\cite*{hearst_hyponyms}, symbolic rule-based, Section~\ref{sec:relation extraction:bootstrap}), \textsc{sift} (\cite*{sift}, symbolic statistical, Section~\ref{sec:relation extraction:hand-designed features}) and \textsc{pcnn} (\cite*{pcnn}, distributed neural, Section~\ref{sec:relation extraction:pcnn}).
    105 \end{marginparagraph}
    106 As an example, this can be seen in the machine translation task.
    107 Early models from the 1950s onward were rule-based.
    108 Starting in the 1990s, statistical approaches were used, first using statistics of words then of phrases.
    109 Looking at the Workshop on statistical machine translation (\textsc{wmt}): at the beginning of the last decade, no neural approaches were used and the report~\parencite{wmt2010} deplored the disappearance of rule-based systems, at the end of the decade, most systems were based on distributed representations~\parencite{wmt2020}.%
    110 \sidenote{To be more precise, most models use transformers which are a kind of neural network introduced in Section~\ref{sec:context:transformers}.}
    111 While this transition occurred in \textsc{nlp}, knowledge representation has been a stronghold of symbolic approaches until very recently.
    112 The research reported in this thesis aims to develop the distributed approach to knowledge representation for the task of relation extraction.
    113 In the remainder of this chapter, we first report the distributed approaches to \textsc{nlp}, which showcased state-of-the-art results for the last decade, before presenting a structured symbolic representation, knowledge bases, and some methods to obtain distributed representations from them.