introduction.tex (5350B)
1 \begin{marginparagraph} 2 \citationBadness 3 This chapter is an adaptation of an article published at \textsc{acl} with some supplementary results:\\ 4 \fullcite{fitb} 5 \end{marginparagraph} 6 All the works presented thus far follow the same underlying dynamic. 7 There is a movement away from symbolic representations toward distributed ones, as well as a movement away from shallow models toward deeper ones. 8 This can be seen in word, sentence and knowledge base representations (Chapter~\ref{chap:context}), as well as in relation extraction (Chapter~\ref{chap:relation extraction}). 9 As we exposed in Chapter~\ref{chap:relation extraction}, a considerable amount of work has been conducted on supervised or weakly-supervised relation extraction (Sections~\ref{sec:relation extraction:sentential} and~\ref{sec:relation extraction:aggregate}), with recent state-of-the-art models using deep neural networks (Section~\ref{sec:relation extraction:pcnn}). 10 However, human annotation of text with knowledge base triplets is expensive and virtually impractical when the number of relations is large. 11 Weakly-supervised methods such as distant supervision (Section~\ref{sec:relation extraction:distant supervision}) are also restricted to a handcrafted relation domain. 12 Going further, purely unsupervised relation extraction methods working on raw texts, without any access to a knowledge base, have been developed (Section~\ref{sec:relation extraction:unsupervised}). 13 14 The first unsupervised models used a clustering (Section~\ref{sec:relation extraction:hasegawa}) or generative (Section~\ref{sec:relation extraction:rellda}) approach. 15 The latter, which obtained state-of-the-art performance, still makes a lot of simplifying hypotheses, such as \hypothesis{biclique}, assuming that the entities are conditionally independent between themselves given the relation. 16 We posit that discriminative approaches can help further expressiveness, especially considering recent results with neural network models. 17 The open question then becomes how to provide a sufficient learning signal to the classifier. 18 The \textsc{vae} model of \textcite{vae_re} introduced in Section~\ref{sec:relation extraction:vae} followed this path by leveraging representation learning for modeling knowledge bases and proposed to use an auto-encoder model: their encoder extracts the relation from a sentence that the decoder uses to predict a missing entity. 19 However, their encoder is still limited compared to its supervised counterpart (e.g.~\textsc{pcnn}) and relies on handcrafted features extracted by natural language processing tools (Section~\ref{sec:relation extraction:hand-designed features}). 20 These features tend to contain errors and prevent the discovery of new patterns, which might hinder performances. 21 22 While the transition to deep learning approaches can bring more expressive models to the task, it also raises new problems. 23 This chapter tackles a problem specific to unsupervised discriminative relation extraction models. 24 In particular, we focus on the \textsc{vae} model of Section~\ref{sec:relation extraction:vae}. 25 These models tend to be hard to train because of the way \hypothesis{uniform} is enforced, expressly, how we ensure that all relations are conveyed the same amount of time.% 26 \sidenote{However, this problem can be generalized to how we enforce all relations are conveyed reasonably often.} 27 To tackle this issue, we propose two new regularizing losses on the distribution of relations. 28 With these, we hope to leverage the expressivity of discriminative approaches---in particular, of deep neural network classifiers---while staying in an unsupervised setting. 29 Indeed, these models are hard to train without supervision, and the solutions proposed at the time were unstable. 30 Discriminative approaches have less inductive bias, but this makes them more sensitive to noise. 31 32 Indeed, our initial experiments showed that the \textsc{vae} relation extraction model was unstable, especially when using a deep neural network relation classifier. 33 It converges to either of the two following regimes, depending on hyperparameter settings: always predicting the same relation or predicting a uniform distribution. 34 To overcome these limitations, we propose to use two new losses alongside an entity prediction loss based on a fill-in-the-blank task and show experimentally that this is key to learning deep neural network models. 35 Our contributions are the following: 36 \begin{itemize} 37 \item We propose two RelDist losses: a skewness loss, which encourages the classifier to predict a class with confidence for a single sentence, and a distribution distance loss, which encourages the classifier to scatter a set of sentences into different classes; 38 \item We perform extensive experiments on the usual \nytfb{} dataset, as well as two new datasets; 39 \item We show that our RelDist losses allow us to train a deep \textsc{pcnn} classifier and improve the performances of feature-based models. 40 \end{itemize} 41 42 In this chapter, we first describe our model in Section~\ref{sec:fitb:model} before revisiting the related works pertinent to the experimental setup in Section~\ref{sec:fitb:related work}. 43 We present our main experimental results in Section~\ref{sec:fitb:experiments} before studying some possible improvements we considered in Section~\ref{sec:fitb:variants}.