taxi

Winning entry to the Kaggle taxi competition
git clone https://esimon.eu/repos/taxi.git
Log | Files | Refs | README

report.tm (15687B)


      1 <TeXmacs|1.99.2>
      2 
      3 <style|generic>
      4 
      5 <\body>
      6   <doc-data|<doc-title|Taxi Destination Prediction Challenge<next-line>Winner
      7   Team's Report>|<doc-author|<author-data|<\author-affiliation>
      8     <em|Montréal, July 2015>
      9   </author-affiliation>>>>
     10 
     11   <center|<tabular*|<tformat|<table|<row|<cell|<name|Alex
     12   Auvolat>>|<cell|<name|Alexandre de Brébisson>>|<cell|<name|Étienne
     13   Simon>>>|<row|<cell|ENS Paris>|<cell|Université de Montréal>|<cell|ENS
     14   Cachan>>|<row|<cell|France>|<cell|Québec,
     15   Canada>|<cell|France>>|<row|<cell|<verbatim|alexis211@gmail.com>>|<cell|<verbatim|<strong|adbrebs@gmail.com>>>|<cell|<verbatim|esimon@esimon.eu>>>>>>>
     16 
     17   <section|Summary>
     18 
     19   Our model is based on a multi-layer perceptron (MLP). Our MLP model is
     20   trained by stochastic gradient descent (SGD) on the training trajectories.
     21   The inputs of our MLP are the 5 first and 5 last positions of the known
     22   part of the trajectory, as well as embeddings for the context information
     23   (date, client and taxi identification). \ The embeddings are trained with
     24   SGD jointly with the MLP parameters. The MLP outputs probabilities for 3392
     25   target points, and a mean is calculated to get a unique destination point
     26   as an output. We did no ensembling and did not use any external data.
     27 
     28   <section|Feature Selection/Extraction>
     29 
     30   We used a mean-shift algorithm on the destination points of all the
     31   training trajectories to extract 3392 classes for the destination point.
     32   These classes were used as a fixed softmax layer in the MLP architecture.
     33 
     34   We used the embedding method which is common in neural language modeling
     35   approaches (see [1]) to take the metainformation into account in our model.
     36   The following embeddings were used (listed with corresponding
     37   dimensionnality):
     38 
     39   <big-table|<tabular|<tformat|<table|<row|<cell|<tabular|<tformat|<cwith|1|1|1|-1|cell-bborder|1px>|<table|<row|<cell|<strong|Meta-data>>|<cell|<strong|Embedding
     40   Dimension>>|<cell|<strong|Number of classes>>>|<row|<cell|Unique caller
     41   number>|<cell|10>|<cell|57125>>|<row|<cell|Unique stand
     42   number>|<cell|10>|<cell|64>>|<row|<cell|Unique taxi
     43   number>|<cell|10>|<cell|448>>|<row|<cell|Week of
     44   year>|<cell|10>|<cell|54>>|<row|<cell|Day of
     45   week>|<cell|10>|<cell|7>>|<row|<cell|1/4 of hour of the
     46   day>|<cell|10>|<cell|96>>|<row|<cell|Day type (invalid
     47   data)>|<cell|10>|<cell|3>>>>>>>>>>|Embeddings and corresponding dimensions
     48   used by the model>
     49 
     50   The embeddings were first initialized to random variables and were then let
     51   to evolve freely with SGD along with the other model parameters.
     52 
     53   The geographical data input in the network is a centered and normalized
     54   version of the GPS data points.
     55 
     56   We did no other preprocessing or feature selection.
     57 
     58   <section|Modelling Techniques and Training>
     59 
     60   Here is a brief description of the model we used:
     61 
     62   <\itemize>
     63     <item><strong|Input.> The input layer of the MLP is the concatenation of
     64     the following inputs:
     65 
     66     <\itemize>
     67       <item>Five first and five last points of the known part of the
     68       trajectory.
     69 
     70       <item>Embeddings for all the metadata.
     71     </itemize>
     72 
     73     <item><strong|Hidden layer.> We use a single hidden layer MLP. The hidden
     74     layer is of size 500, and the activation function is a Rectifier Linear
     75     Unit (ie <math|f<around*|(|x|)>=max<around*|(|0,x|)>>) [2].
     76 
     77     <item><strong|Output layer.> The output layer predicts a probability
     78     vector for the 3392 output classes that we obtained with our clustering
     79     preprocessing step. If <math|\<b-p\>> is the probability vector output by
     80     our MLP (output by a softmax layer) and <math|c<rsub|i>> is the centroid
     81     of cluster <math|i>, our prediciton is given by:
     82 
     83     <\eqnarray*>
     84       <tformat|<table|<row|<cell|<wide|y|^>>|<cell|=>|<cell|<big|sum><rsub|i>p<rsub|i>*c<rsub|i>>>>>
     85     </eqnarray*>
     86 
     87     Since <math|\<b-p\>> sums to one, this is a valid point on the map.
     88 
     89     <item><strong|Cost.> We directly train using an approximation
     90     (Equirectangular projection) of the mean Haversine Distance as a cost.
     91 
     92     <item><strong|SGD and optimization.> We used a minibatch size of 200. The
     93     optimization algorithm is simple SGD with a fixed learning rate of 0.01
     94     and a momentum of 0.9.
     95 
     96     <item><strong|Validation.> To generate our validation set, we tried to
     97     create a set that looked like the training set. For that we generated
     98     ``cuts'' from the training set, i.e. extracted all the taxi rides that
     99     were occuring at given times. The times we selected for our validation
    100     set are similar to those of the test set, only one year before:
    101 
    102     <code|1376503200, # 2013-08-14 18:00<next-line>1380616200, # 2013-10-01
    103     08:30<next-line>1381167900, # 2013-10-07 17:45<next-line>1383364800, #
    104     2013-11-02 04:00<next-line>1387722600 \ # 2013-12-22 14:30>
    105   </itemize>
    106 
    107   <big-figure|<image|winning_model.png|577px|276px||>|Illustration of the
    108   winning model.>
    109 
    110   <section|Code Description>
    111 
    112   Here is a brief description of the Python files in the archive:
    113 
    114   <\itemize>
    115     <item><verbatim|config/*.py> : configuration files for the different
    116     models we have experimented with
    117 
    118     The model which gets the best solution is
    119     <verbatim|mlp_tgtcls_1_cswdtx_alexandre.py>
    120 
    121     <item><verbatim|data/*.py> : files related to the data pipeline:
    122 
    123     <\itemize>
    124       <item><verbatim|__init__.py> contains some general statistics about the
    125       data
    126 
    127       <item><verbatim|csv_to_hdf5.py> : convert the CSV data file into an
    128       HDF5 file usable directly by Fuel
    129 
    130       <item><verbatim|hdf5.py> : utility functions for exploiting the HDF5
    131       file
    132 
    133       <item><verbatim|init_valid.py> : initializes the HDF5 file for the
    134       validation set
    135 
    136       <item><verbatim|make_valid_cut.py> : generate a validation set using a
    137       list of time cuts. Cut lists are stored in Python files in
    138       <verbatim|data/cuts/> (we used a single cut file)
    139 
    140       <item><verbatim|transformers.py> : Fuel pipeline for transforming the
    141       training dataset into structures usable by our model
    142     </itemize>
    143 
    144     <item><strong|<verbatim|data_analysis/*.py>> : scripts for various
    145     statistical analyses on the dataset
    146 
    147     <\itemize>
    148       <item><verbatim|cluster_arrival.py> : the script used to generate the
    149       mean-shift clustering of the destination points, producing the 3392
    150       target points
    151     </itemize>
    152 
    153     <item><verbatim|model/*.py> : source code for the various models we tried
    154 
    155     <\itemize>
    156       <item><verbatim|__init__.py> contains code common to all the models,
    157       including the code for embedding the metadata
    158 
    159       <item><verbatim|mlp.py> contains code common to all MLP models
    160 
    161       <item><verbatim|dest_mlp_tgtcls.py> containts code for our MLP
    162       destination prediction model using target points for the output layer
    163     </itemize>
    164 
    165     <item><verbatim|error.py> contains the functions for calculating the
    166     error based on the Haversine Distance
    167 
    168     <item><verbatim|ext_saveload.py> contains a Blocks extension for saving
    169     and reloading the model parameters so that training can be interrupted
    170 
    171     <item><verbatim|ext_test.py> contains a Blocks extension that runs the
    172     model on the test set and produces an output CSV submission file
    173 
    174     <item><verbatim|train.py> contains the main code for the training and
    175     testing
    176   </itemize>
    177 
    178   In the archive we have included only the files listed above, which are the
    179   strict minimum to reproduce our results. More files for the other models we
    180   tried are available on GitHub at <hlink|https://github.com/adbrebs/taxi|><hlink||https://github.com/adbrebs/taxi>.
    181 
    182   <section|Dependencies>
    183 
    184   We used the following packages developped at the MILA lab:
    185 
    186   <\itemize>
    187     <item><strong|Theano.> A general GPU-accelerated python math library,
    188     with an interface similar to numpy (see [3, 4]).
    189     <hlink|http://deeplearning.net/software/theano/|>
    190 
    191     <item><strong|Blocks.> A deep-learning and neural network framework for
    192     Python based on Theano. <hlink|https://github.com/mila-udem/blocks|>
    193 
    194     <item><strong|Fuel.> A data pipelining framework for Blocks.
    195     <hlink|https://github.com/mila-udem/fuel|>
    196   </itemize>
    197 
    198   We also used the <verbatim|scikit-learn> Python library for their
    199   mean-shift clustering algorithm. <verbatim|numpy>, <verbatim|cPickle> and
    200   <verbatim|h5py> are also used at various places.
    201 
    202   <section|How To Generate The Solution>
    203 
    204   <\enumerate>
    205     <item>Set the <verbatim|TAXI_PATH> environment variable to the path of
    206     the folder containing the CSV files.
    207 
    208     <item>Run <verbatim|data/csv_to_hdf5.py> to generate the HDF5 file (which
    209     is generated in <verbatim|TAXI_PATH>, along the CSV files). This takes
    210     around 20 minutes on our machines.
    211 
    212     <item>Run <verbatim|data/init_valid.py> to initialize the validation set
    213     HDF5 file.
    214 
    215     <item>Run <verbatim|data/make_valid_cut.py test_times_0> to generate the
    216     validation set. This can take a few minutes.
    217 
    218     <item>Run <verbatim|data_analysis/cluster_arrival.py> to generate the
    219     arrival point clustering. This can take a few minutes.
    220 
    221     <item>Create a folder <verbatim|model_data> and a folder
    222     <verbatim|output> (next to the training script), which will receive
    223     respectively a regular save of the model parameters and many submission
    224     files generated from the model at a regular interval.
    225 
    226     <item>Run <verbatim|./train.py dest_mlp_tgtcls_1_cswdtx_alexandre> to
    227     train the model. Output solutions are generated in <verbatim|output/>
    228     every 1000 iterations. Interrupt the model with three consecutive Ctrl+C
    229     at any times. The training script is set to stop training after 10 000
    230     000 iterations, but a result file produced after less than 2 000 000
    231     iterations is already the winning solution. We trained our model on a
    232     GeForce GTX 680 card and it took about an afternoon to generate the
    233     winning solution.
    234 
    235     When running the training script, set the following Theano flags
    236     environment variable to exploit GPU parallelism:
    237 
    238     <verbatim|THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN>
    239 
    240     Theano is only compatible with CUDA, which requires an Nvidia GPU.
    241     Training on the CPU is also possible but much slower.
    242   </enumerate>
    243 
    244   <section|Additional Comments and Observations>
    245 
    246   The training examples fed to the model are not full trajectories, since
    247   that would make no sense, but prefixes of those trajectories that are
    248   generated on-the-fly by a Fuel transformer, <verbatim|TaxiGenerateSplits>,
    249   whose code is available in <verbatim|data/transformers.py>. The data
    250   pipeline is as follows:
    251 
    252   <\itemize>
    253     <item>Select a random full trajectory from the dataset
    254 
    255     <item>Generate a maximum of 100 prefixes for that trajectory. If the
    256     trajectory is smaller than 100 data points, generate all possible
    257     prefixes. Otherwise, chose a random subset of prefixes. Keep the final
    258     destination somewhere as it is used as a target for the training.
    259 
    260     <item>Take only the 5 first and 5 last points of the trajectory.
    261 
    262     <item>At this points we have a stream of prefixes sucessively taken from
    263     different trajectories. We create batches of size 200 with the items of
    264     the previous stream, taken in the order in which they come. The prefixes
    265     generated from a single trajectory may end up in two sucessive batches,
    266     or all in a single batch.
    267   </itemize>
    268 
    269   <section|References>
    270 
    271   <\enumerate>
    272     <item><label|gs_cit0>Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C.
    273     (2003). A neural probabilistic language model.
    274     <with|font-shape|italic|The Journal of Machine Learning Research>,
    275     <with|font-shape|italic|3>, 1137-1155.
    276 
    277     <item><label|gs_cit0>Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep
    278     sparse rectifier neural networks. In <with|font-shape|italic|International
    279     Conference on Artificial Intelligence and Statistics> (pp. 315-323).
    280 
    281     <item><label|gs_cit0>Bergstra, J., Bastien, F., Breuleux, O., Lamblin,
    282     P., Pascanu, R., Delalleau, O., ... & Bengio, Y. (2011). Theano: Deep
    283     learning on gpus with python. In <with|font-shape|italic|NIPS 2011,
    284     BigLearning Workshop, Granada, Spain>.
    285 
    286     <item><label|gs_cit0>Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J.,
    287     Goodfellow, I., Bergeron, A., ... & Bengio, Y. (2012). Theano: new
    288     features and speed improvements. <with|font-shape|italic|arXiv preprint
    289     arXiv:1211.5590>.
    290   </enumerate>
    291 </body>
    292 
    293 <\initial>
    294   <\collection>
    295     <associate|page-medium|paper>
    296     <associate|page-screen-margin|true>
    297   </collection>
    298 </initial>
    299 
    300 <\references>
    301   <\collection>
    302     <associate|auto-1|<tuple|1|1>>
    303     <associate|auto-10|<tuple|8|?>>
    304     <associate|auto-2|<tuple|2|1>>
    305     <associate|auto-3|<tuple|1|1>>
    306     <associate|auto-4|<tuple|3|1>>
    307     <associate|auto-5|<tuple|1|2>>
    308     <associate|auto-6|<tuple|4|3>>
    309     <associate|auto-7|<tuple|5|3>>
    310     <associate|auto-8|<tuple|6|4>>
    311     <associate|auto-9|<tuple|7|4>>
    312     <associate|firstHeading|<tuple|1|?>>
    313     <associate|footnote-1|<tuple|1|?>>
    314     <associate|footnr-1|<tuple|1|?>>
    315     <associate|gs_cit0|<tuple|4|4>>
    316   </collection>
    317 </references>
    318 
    319 <\auxiliary>
    320   <\collection>
    321     <\associate|table>
    322       <tuple|normal|Embeddings and corresponding dimensions used by the
    323       model|<pageref|auto-3>>
    324     </associate>
    325     <\associate|toc>
    326       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|1<space|2spc>Summary>
    327       <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    328       <no-break><pageref|auto-1><vspace|0.5fn>
    329 
    330       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|2<space|2spc>Feature
    331       Selection/Extraction> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    332       <no-break><pageref|auto-2><vspace|0.5fn>
    333 
    334       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|3<space|2spc>Modelling
    335       Techniques and Training> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    336       <no-break><pageref|auto-4><vspace|0.5fn>
    337 
    338       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|4<space|2spc>Code
    339       Description> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    340       <no-break><pageref|auto-5><vspace|0.5fn>
    341 
    342       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|5<space|2spc>Dependencies>
    343       <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    344       <no-break><pageref|auto-6><vspace|0.5fn>
    345 
    346       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|6<space|2spc>How
    347       To Generate The Solution> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    348       <no-break><pageref|auto-7><vspace|0.5fn>
    349 
    350       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|7<space|2spc>Additional
    351       Comments and Observations> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    352       <no-break><pageref|auto-8><vspace|0.5fn>
    353 
    354       <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|8<space|2spc>References>
    355       <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
    356       <no-break><pageref|auto-9><vspace|0.5fn>
    357     </associate>
    358   </collection>
    359 </auxiliary>