Étienne Simon

Reasoning on representations learnt by neural networks

Problem

Given a set of entities E, a set of relations R and a training set composed of triplets (h, r, t) with h, t E and r R (e.g. (Rome, capital of, Italy)); learn a representation for E and R in order to predict unseen triplets.

In [1], Tomas Mikolov exhibits figure 1 which nicely shows what Antoine Bordes exploits in [2] : while learning embeddings for words, relations between them appear as translations.

In [2], Antoine Bordes proposes the following model : entities and relations are represented as d-dimensional vectors (with d50). If (h, r, t) is true then h+rt. Each triplet (h, r, t) is given a score sim(h+r, t) where sim is a similarity function (usually L1 norm or cosine). In order to learn the matrices E and R, a contrastive objective is used, for each valid triplet (h, r, t), two corrupted entities are picked h', t' E, the left score of the triplet is defined as: $\max\left(0, \gamma + sim\left(h+r, t\right) - sim\left(h\text{'}+r, t\right)\right)$ Were, γ is the contrastive margin. Similarly, the right score is defined by replacing the third member of the triplet instead of the first, the final objective is to minimize is the sum of the left and right score.

A nice example of dataset is Freebase, it shows that a solution to this problem could be used as an inference engine to solve general question/answer.

Solutions

Reimplementation : Link to the repository.

Other transformations

The first tentative was to replace the translation x+r by some other transformation, here is a list:

Name Equation 2D Illustration Similarity FB15k
Micro Macro
Mean Top10 Mean Top10
Translation x+r L1233.6136.74361.2732.98
L2309.6729.85257.2246.79
Cosine690.3614.37690.9519.64
Point reflection 2r-x L1298.2731.91626.1932.00
L2419.0423.75582.0039.52
Cosine1564.6616.432244.5121.48
Reflection $x - 2 \left(x·r\right)/\left(r·r\right) r$ L1249.1629.27269.2529.15
L2378.5321.45456.9927.92
Cosine401.1820.24383.6725.43
Offsetted reflection $x - 2 \left(x·r₀ - r₁\right)/\left(r₀·r₀\right) r₀$ L1251.9633.26275.6129.51
L2431.6720.87412.0728.00
Cosine408.9020.89328.6630.33
Anisotropic Scaling xr ×rx ×ry L1258.7433.91459.6939.71
L2550.6718.481150.8118.89
Cosine470.6414.41987.9115.36
Homotheties $r₀+r₁\left(x-r₀\right)$ ×r₁ L1400.4423.67627.7427.63
L2501.2521.17750.7431.90
Cosine403.5226.89938.4436.51
Anisotropic homotheties $r₀+r₁\odot \left(x-r₀\right)$ x y r₁x r₁y L1268.5432.53472.4339.48
L2457.0021.51760.5430.35
Cosine434.1519.89752.7124.93
Element-wise affine $r₀\odot x+r₁$ None L1262.9533.21420.3040.09
L2417.5022.87692.1432.32
Cosine401.1421.20738.3826.19

Where is the element-wise product and · the dot product.

FB15k is a subset of Freebase with 14951 entities, 1345 relations and 483142 training triplets.

When presented with a training triplet, the left and right score are computed for all 14951 entities, the scores are then sorted and we report as micro mean the mean rank of the correct entity and as micro top10 the percentage of correct entity placed in the top 10. The macro mean is the mean mean rank over all relation : the mean rank is computed for each relation and then the mean over all relation's mean rank is taken (the top10 is defined similarly).

Detailed visualisation of results.

Archive with (obsolete) results.

Combining models

The relations in the dataset are hard to analyse since we only have a subset of the graph defining it, however it seems natural that the optimal class of transformation depends on the relation. A first approach was implemented with a product of experts / mixture of models kind of model.

Results :

Models Composition FB15k
Micro Macro
Mean Top10 Mean Top10
All L1×185.8247.20111.3660.88

Still working on it. See [3].

Archive with (obsolete) results.

References

[1]
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. .
[2]
Bordes, Antoine, et al. "Translating embeddings for modeling multi-relational data." Advances in Neural Information Processing Systems. .
[3]
Hinton, Geoffrey E. "Training products of experts by minimizing contrastive divergence." Neural computation 14.8 (): 1771-1800.