appendix.tex (14936B)
1 \chapter{Datasets} 2 \label{chap:datasets} 3 In this appendix, we present the primary datasets used throughout this thesis. 4 Each section corresponds to a dataset or group of datasets. 5 We focus on the peculiarities which make each dataset unique and provide some statistics relevant to our task. 6 7 \section{\textsc{ace}} 8 \label{sec:datasets:ace} 9 Automatic content extraction (\textsc{ace}) is a \textsc{nist} program that developed several datasets for the evaluation of entity chunking and relation extraction. 10 It is the spiritual successor of \textsc{muc} (Section~\ref{sec:datasets:muc}). 11 In their nomenclature, the task of relation extraction is called relation detection and categorization (\textsc{rdc}). 12 Datasets for relation extraction were released yearly between 2002 and 2005.% 13 \sidenote{ 14 The dataset from September~2002 is called \textsc{ace-2}. 15 This refers to the ``second phase'' of \textsc{ace}. 16 The pilot and first phase corpora only dealt with entity detection. 17 } 18 This makes comparison difficult; for example, in Chapter~\ref{chap:relation extraction}, we mention an \textsc{ace} dataset for several models (Sections~\ref{sec:relation extraction:hand-designed features}, \ref{sec:relation extraction:kernel}, \ref{sec:relation extraction:label propagation} and~\ref{sec:relation extraction:epgnn}); however, the versions of the datasets differs. 19 20 A peculiarity of the \textsc{ace} dataset is its hierarchy of relations. 21 For example, the \textsc{ace-2003} dataset contains a \textsl{social} relation type, which is divided into several relation subtypes such as \textsl{grandparent} and \textsl{sibling}. 22 Results can be reported either on the relation types or subtypes, usually using an \fone{} measure or a custom metric designed by \textsc{ace} \parencitex{ace_evaluation} to handle directionality and the ``\textsl{other}'' relation (Section~\ref{sec:relation extraction:other}). 23 24 \section{FewRel} 25 \label{sec:datasets:fewrel} 26 FewRel \parencitex{fewrel} is a few-shot relation extraction dataset. 27 Given a query and several candidates, the model must decide which candidate conveys the relation closest to the one conveyed by the query. 28 Therefore, FewRel is used to evaluate continuous relation representations; it is not typically used to evaluate a clustering model. 29 For details on the few-shot setup, refer to Section~\ref{sec:relation extraction:few-shot}. 30 31 The dataset was first constructed by aligning Wikipedia with Wikidata (Section~\ref{sec:datasets:wikidata}) using distant supervision (Section~\ref{sec:relation extraction:distant supervision}). 32 Human annotators then hand-labeled the samples. 33 The resulting dataset is perfectly balanced; all relations are represented by precisely 700 samples. 34 The set of the 100 most common relations with good inter-annotator agreement was then divided into three splits, whose sizes are given in Table~\ref{tab:datasets:fewrel}. 35 Since common relations were strongly undersampled to obtain a balanced dataset, entities do not repeat much. 36 The attributed multigraph (Section~\ref{sec:graph:encoding}) corresponding to the train split of FewRel is composed of several connected components. 37 The larger one covers approximately 21\% of the vertices, while more than half of all vertices are in connected components of size three or less. 38 39 \begin{margintable} 40 \centering 41 \input{backmatter/datasets/fewrel.tex} 42 \scaption[Statistics of the FewRel dataset.]{ 43 Statistics of the FewRel dataset. 44 The test relations and samples are not publicly available. 45 \label{tab:datasets:fewrel} 46 } 47 \end{margintable} 48 49 FewRel can be used for \(n\) way \(k\) shot evaluation, where usually \(n\in\{5,10\}\) and \(k\in\{1,5\}\). 50 For reference, \textcite{fewrel} provides human performance on 5 way 1 shot (92.22\% accuracy) and 10 way 1 shot (85.88\% accuracy). 51 52 A subsequent dataset released by the same team called FewRel~2.0 \parencitex{fewrel2} revisited the task by adding two variations: 53 \begin{description}[nosep] 54 \item[Domain adaptation,] the training set of the original FewRel is used (Wikipedia--Wikidata), but the model is evaluated on biomedical literature (PubMed--\textsc{umls}) containing relations such as \textsl{may treat} and \textsl{manifestation of}. 55 \item[Detecting \textsl{other} relation,] also called none-of-the-above, when the relation conveyed by the query does not appear in the candidates. 56 \end{description} 57 While domain adaptation is an interesting problem, for unsupervised approaches, the detection of \textsl{other} seems to defeat the point of modeling a similarity space instead of clustering relations. 58 Furthermore, we only use FewRel as an evaluation tool and never train on it; using this second dataset made, therefore, little sense. 59 60 \section{Freebase} 61 \label{sec:datasets:freebase} 62 Freebase \parencitex{freebase} is a knowledge base (Section~\ref{sec:context:knowledge base}) started in~2007 and discontinued in~2016. 63 \begin{margintable} 64 \centering 65 \input{backmatter/datasets/freebase.tex} 66 \scaption[Statistics of the Freebase knowledge base.]{ 67 Statistics of the Freebase knowledge base at the time of its termination. 68 Most relations (around 81\%) appear only once in the knowledge base. 69 \label{tab:datasets:freebase} 70 } 71 \end{margintable} 72 As one of the first widely available knowledge bases containing general knowledge, Freebase was widely used for weak supervision. 73 In particular, it is the knowledge base used in the original distant supervision article \parencite{distant}. 74 Freebase was a collaborative knowledge base; as such, its content evolved through its existence. 75 Therefore, even though \textcite{distant}, \textcite{rellda} and \textcite{vae_re} all run experiments on Freebase, their results are not comparable since they use different versions of the dataset. 76 Data dumps are still provided by \textcite{freebase_data}; however, most of the facts were transferred to the Wikidata knowledge base (Section~\ref{sec:datasets:wikidata}). 77 Some statistics about the latest version of Freebase are provided in Table~\ref{tab:datasets:freebase}. 78 However, note that most relations in Freebase are scarcely used; only 6\,760 relations appear in more than 100 facts. 79 Furthermore, the concept of entities is quite wide in Freebase, in particular it makes use of a concept called mediator \parencite{freebase_processing}: 80 \begin{indentedexample} 81 \texttt{/m/02mjmr} \textsl{/topic/notable\_for} \textcolor{Dark2-B}{\texttt{/g/125920}}\\ 82 \textcolor{Dark2-B}{\texttt{/g/125920}} \textsl{/c…/notable\_for/object} \texttt{/gov…/us\_president}\\ 83 \textcolor{Dark2-B}{\texttt{/g/125920}} \textsl{/c…/notable\_for/predicate} \texttt{/type/object/type} 84 \end{indentedexample} 85 Here \texttt{/m/02mjmr} refers to ``Barack Obama,'' while \texttt{/g/125920} is the mediator entity which is used to group together several statements about \texttt{/m/02mjmr}. 86 87 \section{\textsc{muc-7 tr}} 88 \label{sec:datasets:muc} 89 The message understanding conferences (\textsc{muc}) were organized by \textsc{darpa} in the 1980s and 1990s. 90 The seventh---and last---conference \parencitex{muc7} introduced a relation extraction task called ``template relation'' (\textsc{tr}). 91 Three relations needed to be extracted: \textsl{employee of}, \textsl{location of} and \textsl{product of}. 92 Both the train set and evaluation set contained 100 articles. 93 The task was very much still in the ``template filling'' mindset; this can be seen by the following example of extracted fact: 94 \begin{indentedexample} 95 \texttt{<\textsc{employee\_of}-9602040136-5> :=}\\ 96 \null\qquad\texttt{\textsc{person}: <\textsc{entity}-9602040136-11>}\\ 97 \null\qquad\texttt{\textsc{organization}: <\textsc{entity}-9602040136-1>} 98 99 \medskip 100 101 \texttt{<\textsc{entity}-9602040136-11> :=}\\ 102 \null\qquad\texttt{\textsc{ent\_name}: "Dennis Gillespie"}\\ 103 \null\qquad\texttt{\textsc{ent\_type}: \textsc{person}}\\ 104 \null\qquad\texttt{\textsc{ent\_descriptor}: "Capt."}\\ 105 \null\qquad\texttt{/ "the commander of Carrier Air Wing 11"}\\ 106 \null\qquad\texttt{\textsc{ent\_category}: \textsc{per\_mil}} 107 108 \medskip 109 110 \texttt{<\textsc{entity}-9602040136-1> :=}\\ 111 \texttt{\textsc{ent\_name}: "\textsc{navy}"}\\ 112 \texttt{\textsc{ent\_type}: \textsc{organization}}\\ 113 \texttt{\textsc{ent\_category}: \textsc{org\_govt}} 114 \end{indentedexample} 115 116 \section{New York Times} 117 \label{sec:datasets:nyt} 118 The New York Times Annotated Corpus (\textsc{nyt}, \citex{nyt}) was widely used for relation extraction. 119 The full dataset contains 1.8 million articles from 1987 to 2007; however, smaller---and sadly, different---subsets are in use. 120 The subset we use in Chapter~\ref{chap:fitb} was first extracted by \textcitex{vae_re} and is supposed to be similar---but not identical---to the one of \textcite{rellda}. 121 This \textsc{nyt} subset only contains articles from 2000 to 2007 from which ``noisy documents'' were filtered out. 122 Semi-structured information such as tables and lists were also removed. 123 The version of the dataset we received from Diego Marcheggiani was already preprocessed, with features listed in Section~\ref{sec:fitb:baselines} already extracted. 124 125 The original dataset can be obtained from the following website: 126 \begin{center} 127 \url{https://catalog.ldc.upenn.edu/LDC2008T19} 128 \end{center} 129 At the time of writing, once the license fee is paid, the only way to obtain the subset of \textcite{vae_re} and Chapter~\ref{chap:fitb} is through someone with access to this specific subset. 130 This burdensome---and expensive---procedure is one of the reasons for which we introduced \textsc{t-re}x-based alternatives in Chapter~\ref{chap:fitb}. 131 132 \section{SemEval 2010 Task 8} 133 \label{sec:datasets:semeval} 134 SemEval is the international workshop on semantic evaluation, which was started in~1998 (then called Senseval) with the goal of emulating the message understanding conferences (Section~\ref{sec:datasets:muc}). 135 In~2010, eighteen different tasks were evaluated. 136 Task number~8 was relation extraction. 137 SemEval~2010 Task~8 \parencitex{semeval2010task8} therefore refers to the dataset provided at the time of this challenge. 138 It is a supervised relation extraction dataset without entity linking and with non-unique entity reference (Section~\ref{sec:relation extraction:entity}). 139 Its statistics are listed in Table~\ref{tab:datasets:semeval}. 140 \begin{margintable} 141 \input{backmatter/datasets/semeval.tex} 142 \scaption[Statistics of the SemEval~2010 Task~8 dataset.]{ 143 Statistics of the Sem\-Eval~2010 Task~8 dataset. 144 \label{tab:datasets:semeval} 145 } 146 \end{margintable}% 147 All samples were hand-labeled by human annotators with one of 19 relations. 148 These 19 relations are built from 9 base relations, which can appear in both directions (Section~\ref{sec:relation extraction:directionality}), plus the \textsl{other} relation (Section~\ref{sec:relation extraction:other}). 149 The 9 base relations in the dataset are: 150 \begin{itemize}[nosep] 151 \item \textsl{cause--effect} 152 \item \textsl{instrument--agency} 153 \item \textsl{product--producer} 154 \item \textsl{content--container} 155 \item \textsl{entity--origin} 156 \item \textsl{entity--destination} 157 \item \textsl{component--whole} 158 \item \textsl{member--collection} 159 \item \textsl{message--topic} 160 \end{itemize} 161 SemEval~2010 Task~8 introduced an extensive evaluation system, most of which is described in Section~\ref{sec:relation extraction:supervised evaluation}. 162 In particular, the official score of the competition was the half-directed macro-\(\overHalfdirected{\fone}\) (described in Section~\ref{sec:relation extraction:supervised evaluation}) which was referred to as ``\(9+1\)-way evaluation taking directionality into account.'' 163 164 \section{\textsc{t-re}x} 165 \label{sec:datasets:trex} 166 \textsc{t-re}x \parencitex{trex} is an alignment of Wikipedia with Wikidata. 167 In particular, \textsc{t-re}x uses \textsc{db}pedia abstracts \parencite{dbpedia_abstracts}, that is, the introductory paragraphs of Wikipedia's articles. 168 Its statistics are listed in Table~\ref{tab:datasets:trex}. 169 170 \begin{margintable} 171 \centering 172 \input{backmatter/datasets/trex.tex} 173 \scaption[Statistics of the \textsc{t-re}x dataset.]{ 174 Statistics of the \textsc{t-re}x dataset. 175 \label{tab:datasets:trex} 176 } 177 \end{margintable} 178 179 In the final dataset, entities are linked using the \textsc{db}pedia spotlight entity linker \parencite{spotlight}. 180 Furthermore, indirect entity links are extracted using coreference resolution and a ``NoSub Aligner,'' which assumes that the title of the article is implicitly mentioned by all sentences. 181 Finally, some sequences of words are also linked to relations using exact matches of Wikidata relation names. 182 Both the datasets used in Chapters~\ref{chap:fitb} and~\ref{chap:graph} only consider entities extracted by the spotlight entity linker (tagged \texttt{Wikidata\_Spotlight\_Entity\_Linker}). 183 The two datasets of Chapter~\ref{chap:fitb} were filtered based on the tag of the predicate. 184 \textsc{spo} only contains samples whose predicate's surface form appears in the sentence (tagged \texttt{Wikidata\_Property\_Linker}), while \textsc{ds} contains all samples with the two entities occurring in the same sentence (in other words, all samples except those tagged \texttt{NoSubject-Triple-aligner}). 185 186 187 \section{Wikidata} 188 \label{sec:datasets:wikidata} 189 Wikidata \parencitex{wikidata} is a knowledge base (Section~\ref{sec:context:knowledge base}) started in~2012. 190 Similar to the other projects of the Wikimedia Foundation, it is a collaborative enterprise; everyone can contribute new facts and entities. 191 The introduction of new relations is made through the consensus of long-term contributors to avoid the explosion of relations types observed on Freebase (section~\ref{sec:datasets:freebase}). 192 193 \begin{figure} 194 \centering 195 \input{backmatter/datasets/wikidata.tex} 196 \scaption[Structure of a Wikidata page.]{ 197 Structure of a Wikidata page. 198 Facts related to two relations are shown (``statement groups'' in Wikidata parlance). 199 This page can be translated into three \(\entitySet^2\times\relationSet\) facts; the first has four additional qualifiers and the second has two additional qualifiers. 200 \label{fig:datasets:wikidata} 201 } 202 \end{figure} 203 204 Contrary to the way knowledge bases are presented in Section~\ref{sec:context:knowledge base}, Wikidata is not structured as a set of \(\entitySet^2\times\relationSet\) triplets. 205 Instead, in Wikidata, all entities have a page that lists facts of which the entity is the subject. 206 These constitute our set \(\kbSet\subseteq\entitySet^2\times\relationSet\). 207 Furthermore, Wikidata facts can be qualified by additional \(\relationSet\times\entitySet\) pairs. 208 For example, Douglas Adams was \textsl{educated at} St John's College \underLine{\textsl{until} 1974}. 209 This structure is illustrated in Figure~\ref{fig:datasets:wikidata}. 210 To be more precise, Wikidata could be modeled as a set of qualified facts, where a qualified fact is an element of \(\entitySet^2\times\relationSet\times2^{\relationSet\times\entitySet}\).