taxi

Winning entry to the Kaggle taxi competition
git clone https://esimon.eu/repos/taxi.git
Log | Files | Refs | README

commit c5187418bc93c34e3fdce4fdc1a3b5316812b69a
parent 793be7b049cecba43072858341dc7006fef352e7
Author: Alex Auvolat <alex@adnab.me>
Date:   Fri, 10 Jul 2015 19:20:57 -0400

Add documentation

Diffstat:
M.gitignore | 7++++++-
AMakefile | 23+++++++++++++++++++++++
Adoc/biblio.bib | 36++++++++++++++++++++++++++++++++++++
Adoc/report.tm | 353+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 418 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore @@ -1,5 +1,9 @@ .idea/* +# Source archive +submission.tgz +*.pdf + # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] @@ -58,8 +62,9 @@ docs/_build/ # PyBuilder target/ -# Vim tmpfiles +# Vim & other tmpfiles *.swp +*~ # Random datafiles *.csv diff --git a/Makefile b/Makefile @@ -0,0 +1,23 @@ +FILES=config/__init__.py \ + config/dest_mlp_tgtcls_1_cswdtx_alexandre.py \ + data/cuts/__init__.py \ + data/cuts/test_times_0.py \ + data/__init__.py \ + data/csv_to_hdf5.py \ + data/hdf5.py \ + data/init_valid.py \ + data/make_valid_cut.py \ + data/transformers.py \ + model/__init__.py \ + model/mlp.py \ + model/dest_mlp_tgtcls.py \ + data_analysis/cluster_arrival.py \ + doc/report.pdf \ + __init__.py \ + error.py \ + ext_saveload.py \ + ext_test.py \ + train.py + +submission.tgz: $(FILES) + tar czf $@ $(FILES) diff --git a/doc/biblio.bib b/doc/biblio.bib @@ -0,0 +1,36 @@ + +@article{bengio2003neural, + title={A neural probabilistic language model}, + author={Bengio, Yoshua and Ducharme, R{\'e}jean and Vincent, Pascal and Janvin, Christian}, + journal={The Journal of Machine Learning Research}, + volume={3}, + pages={1137--1155}, + year={2003}, + publisher={JMLR. org} +} + +@inproceedings{bergstra2010theano, + title={Theano: a CPU and GPU math expression compiler}, + author={Bergstra, James and Breuleux, Olivier and Bastien, Fr{\'e}d{\'e}ric and Lamblin, Pascal and Pascanu, Razvan and Desjardins, Guillaume and Turian, Joseph and Warde-Farley, David and Bengio, Yoshua}, + booktitle={Proceedings of the Python for scientific computing conference (SciPy)}, + volume={4}, + pages={3}, + year={2010}, + organization={Austin, TX} +} + +@article{bastien2012theano, + title={Theano: new features and speed improvements}, + author={Bastien, Fr{\'e}d{\'e}ric and Lamblin, Pascal and Pascanu, Razvan and Bergstra, James and Goodfellow, Ian and Bergeron, Arnaud and Bouchard, Nicolas and Warde-Farley, David and Bengio, Yoshua}, + journal={arXiv preprint arXiv:1211.5590}, + year={2012} +} + +@inproceedings{glorot2011deep, + title={Deep sparse rectifier neural networks}, + author={Glorot, Xavier and Bordes, Antoine and Bengio, Yoshua}, + booktitle={International Conference on Artificial Intelligence and Statistics}, + pages={315--323}, + year={2011} +} + diff --git a/doc/report.tm b/doc/report.tm @@ -0,0 +1,352 @@ +<TeXmacs|1.99.2> + +<style|generic> + +<\body> + <doc-data|<doc-title|Taxi Destination Prediction Challenge<next-line>Winner + Team's Report>|<doc-author|<author-data|<\author-affiliation> + <em|Montréal, July 2015> + </author-affiliation>>>> + + <center|<tabular*|<tformat|<table|<row|<cell|<name|Alex + Auvolat>>|<cell|<name|Alexandre De Brébisson>>|<cell|<name|Étienne + Simon>>>|<row|<cell|ENS Paris>|<cell|Université de Montréal>|<cell|ENS + Cachan>>|<row|<cell|France>|<cell|Québec, + Canada>|<cell|France>>|<row|<cell|<verbatim|alexis211@gmail.com>>|<cell|<verbatim|<strong|adbrebs@gmail.com>>>|<cell|<verbatim|esimon@esimon.eu>>>>>>> + + <section|Summary> + + Our model is based on a multi-layer perceptron (MLP), a simple feed-forward + neural network architecture. Our MLP model is trained by stochastic + gradient descent (SGD) on the training trajectories. The inputs to our MLP + are the 5 first and 5 last positions of the known part of the trajectory, + as well as embeddings for the context information (date, client and taxi + identification). \ The embeddings are trained with SGD jointly with the MLP + parameters. The MLP outputs probabilities for 3392 target points, and a + mean is calculated to get a unique destination point as an output. We did + no ensembling and used no external data. + + <section|Feature Selection/Extraction> + + We used a mean-shift algorithm on the destination points of all the + training trajectories to extract 3392 classes for the destination point. + These classes were used as a fixed output layer for the MLP architecture. + + We used the embedding method which is common in neural language modeling + approaches (see [1]) to take the metainformation into account in our model. + The following embeddings were used (listed with corresponding + dimensionnality): + + <big-table|<tabular|<tformat|<table|<row|<cell|<tabular|<tformat|<cwith|1|1|1|-1|cell-bborder|1px>|<table|<row|<cell|<strong|Meta-data>>|<cell|<strong|Embedding + Dimension>>|<cell|<strong|Number of classes>>>|<row|<cell|Unique caller + number>|<cell|10>|<cell|57125>>|<row|<cell|Unique stand + number>|<cell|10>|<cell|64>>|<row|<cell|Unique taxi + number>|<cell|10>|<cell|448>>|<row|<cell|Week of + year>|<cell|10>|<cell|54>>|<row|<cell|Day of + week>|<cell|10>|<cell|7>>|<row|<cell|1/4 of hour of the + day>|<cell|10>|<cell|96>>|<row|<cell|Day type (invalid + data)>|<cell|10>|<cell|3>>>>>>>>>>|Embeddings and corresponding dimensions + used by the model> + + The embeddings were first initialized to random variables and were then let + to evolve freely with SGD along with the other model parameters. + + The geographical data input in the network is a centered and normalized + version of the GPS data points. + + We did no other preprocessing or feature selection. + + <section|Modelling Techniques and Training> + + Here is a brief description of the model we used: + + <\itemize> + <item><strong|Input.> The input layer of the MLP is the concatenation of + the following inputs: + + <\itemize> + <item>Five first and five last points of the known part of the + trajectory. + + <item>Embeddings for all the metadata. + </itemize> + + <item><strong|Hidden layer.> We use a single hidden layer MLP. The hidden + layer is of size 500, and the activation function is a Rectifier Linear + Unit (ie <math|f<around*|(|x|)>=max<around*|(|0,x|)>>). See [2] for more + information about ReLUs. + + <item><strong|Output layer.> The output layer predicts a probability + vector for the 3392 output classes that we obtained with our clustering + preprocessing step. If <math|\<b-p\>> is the probability vector output by + our MLP (output by a softmax layer) and <math|c<rsub|i>> is the centroid + of cluster <math|i>, our prediciton is given by: + + <\eqnarray*> + <tformat|<table|<row|<cell|<wide|y|^>>|<cell|=>|<cell|<big|sum><rsub|i>p<rsub|i>*c<rsub|i>>>>> + </eqnarray*> + + Since <math|\<b-p\>> sums to one, this is a valid point on the map. + + <item><strong|Cost.> We directly train using an approximation of the mean + Haversine Distance as a cost. + + <item><strong|SGD and optimization.> We used a minibatch size of 200. The + optimization algorithm is simple SGD with a learning rate of 0.01 and a + momentum of 0.9. + + <item><strong|Validation.> To generate our validation set, we tried to + create a set that looked like the training set. For that we generated + ``cuts'' from the training set, i.e. extracted all the taxi rides that + were occuring at given times. The times we selected for our validation + set are similar to those of the test set, only one year before. + </itemize> + + <section|Code Description> + + Here is a brief description of the Python files in the archive: + + <\itemize> + <item><verbatim|config/*.py> : configuration files for the different + models we have experimented with + + The model which gets the best solution is + <verbatim|mlp_tgtcls_1_cswdtx_alexandre.py> + + <item><verbatim|data/*.py> : files related to the data pipeline: + + <\itemize> + <item><verbatim|__init__.py> contains some general statistics about the + data + + <item><verbatim|csv_to_hdf5.py> : convert the CSV data file into an + HDF5 file usable directly by Fuel + + <item><verbatim|hdf5.py> : utility functions for exploiting the HDF5 + file + + <item><verbatim|init_valid.py> : initializes the HDF5 file for the + validation set + + <item><verbatim|make_valid_cut.py> : generate a validation set using a + list of time cuts. Cut lists are stored in Python files in + <verbatim|data/cuts/> (we used a single cut file) + + <item><verbatim|transformers.py> : Fuel pipeline for transforming the + training dataset into structures usable by our model + </itemize> + + <item><strong|<verbatim|data_analysis/*.py>> : scripts for various + statistical analyses on the dataset + + <\itemize> + <item><verbatim|cluster_arrival.py> : the script used to generate the + mean-shift clustering of the destination points, producing the 3392 + target points + </itemize> + + <item><verbatim|model/*.py> : source code for the various models we tried + + <\itemize> + <item><verbatim|__init__.py> contains code common to all the models, + including the code for embedding the metadata + + <item><verbatim|mlp.py> contains code common to all MLP models + + <item><verbatim|dest_mlp_tgtcls.py> containts code for our MLP + destination prediction model using target points for the output layer + </itemize> + + <item><verbatim|error.py> contains the functions for calculating the + error based on the Haversine Distance + + <item><verbatim|ext_saveload.py> contains a Blocks extension for saving + and reloading the model parameters so that training can be interrupted + + <item><verbatim|ext_test.py> contains a Blocks extension that runs the + model on the test set and produces an output CSV submission file + + <item><verbatim|train.py> contains the main code for the training and + testing + </itemize> + + In the archive we have included only the files listed above, which are + strictly necessary for reproducing our results. More files for the other + models we have tried are available on GitHub at + <hlink|https://github.com/adbrebs/taxi|><hlink||https://github.com/adbrebs/taxi>. + + <section|Dependencies> + + We used the following packages developped at the MILA lab: + + <\itemize> + <item><strong|Thano.> A general GPU-accelerated python math library, with + an interface similar to numpy (see [3, 4]). + <hlink|http://deeplearning.net/software/theano/|> + + <item><strong|Blocks.> A deep-learning and neural network framework for + Python based on Theano. <hlink|https://github.com/mila-udem/blocks|> + + <item><strong|Fuel.> A data pipelining framework for Blocks. + <hlink|https://github.com/mila-udem/fuel|> + </itemize> + + We also used the <verbatim|scikit-learn> Python library for their + mean-shift clustering algorithm. <verbatim|numpy>, <verbatim|cPickle> and + <verbatim|h5py> are also used at various places. + + <section|How To Generate The Solution> + + <\enumerate> + <item>Set the <verbatim|TAXI_PATH> environment variable to the path of + the folder containing the CSV files. + + <item>Run <verbatim|data/csv_to_hdf5.py> to generate the HDF5 file (which + is generated in <verbatim|TAXI_PATH>, along the CSV files). This takes + around 20 minutes on our machines. + + <item>Run <verbatim|data/init_valid.py> to initialize the validation set + HDF5 file. + + <item>Run <verbatim|data/make_valid_cut.py test_times_0> to generate the + validation set. This can take a few minutes. + + <item>Run <verbatim|data_analysis/cluster_arrival.py> to generate the + arrival point clustering. This can take a few minutes. + + <item>Create a folder <verbatim|model_data> and a folder + <verbatim|output> (next to the training script), which will recieve + respectively a regular save of the model parameters and many submission + files generated from the model at a regular interval. + + <item>Run <verbatim|./train.py dest_mlp_tgtcls_1_cswdtx_alexandre> to + train the model. Output solutions are generated in <verbatim|output/> + every 1000 iterations. Interrupt the model with three consecutive Ctrl+C + at any times. The training script is set to stop training after 10 000 + 000 iterations, but a result file produced after less than 2 000 000 + iterations is already the winning solution. The training is quite long + though: we trained our model on a GeForce GTX 680 card and it took about + an afternoon to generate the winning solution. + + When running the training script, set the following Theano flags + environment variable to exploit GPU parallelism: + + <verbatim|THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN> + + Theano is only compatible with CUDA, which requires an Nvidia GPUs. + Training on the CPU is also possible but much slower. + </enumerate> + + <section|Additional Comments and Observations> + + The training examples fed to the model are not full trajectories, since + that would make no sense, but prefixes of those trajectories that are + generated on-the-fly by a Fuel transformer, <verbatim|TaxiGenerateSplits>, + whose code is available in <verbatim|data/transformers.py>. The data + pipeline is as follows: + + <\itemize> + <item>Select a random full trajectory from the dataset + + <item>Generate a maximum of 100 prefixes for that trajectory. If the + trajectory is smaller than 100 data points, generate all possible + prefixes. Otherwise, chose a random subset of prefixes. Keep the final + destination somewhere as it is used as a target for the training. + + <item>Take only the 5 first and 5 last points of the trajectory. + + <item>At this points we have a stream of prefixes sucessively taken from + different trajectories. We create batches of size 200 with the items of + the previous stream, taken in the order in which they come. The prefixes + generated from a single trajectory may end up in two sucessive batches, + or all in a single batch. + </itemize> + + <section|References> + + <\enumerate> + <item><label|gs_cit0>Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. + (2003). A neural probabilistic language model. + <with|font-shape|italic|The Journal of Machine Learning Research>, + <with|font-shape|italic|3>, 1137-1155. + + <item><label|gs_cit0>Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep + sparse rectifier neural networks. In <with|font-shape|italic|International + Conference on Artificial Intelligence and Statistics> (pp. 315-323). + + <item><label|gs_cit0>Bergstra, J., Bastien, F., Breuleux, O., Lamblin, + P., Pascanu, R., Delalleau, O., ... & Bengio, Y. (2011). Theano: Deep + learning on gpus with python. In <with|font-shape|italic|NIPS 2011, + BigLearning Workshop, Granada, Spain>. + + <item><label|gs_cit0>Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., + Goodfellow, I., Bergeron, A., ... & Bengio, Y. (2012). Theano: new + features and speed improvements. <with|font-shape|italic|arXiv preprint + arXiv:1211.5590>. + </enumerate> +</body> + +<\initial> + <\collection> + <associate|page-medium|paper> + <associate|page-screen-margin|true> + </collection> +</initial> + +<\references> + <\collection> + <associate|auto-1|<tuple|1|1>> + <associate|auto-10|<tuple|8|?>> + <associate|auto-2|<tuple|2|1>> + <associate|auto-3|<tuple|1|1>> + <associate|auto-4|<tuple|3|2>> + <associate|auto-5|<tuple|4|2>> + <associate|auto-6|<tuple|5|3>> + <associate|auto-7|<tuple|6|3>> + <associate|auto-8|<tuple|7|4>> + <associate|auto-9|<tuple|8|4>> + <associate|gs_cit0|<tuple|4|4>> + </collection> +</references> + +<\auxiliary> + <\collection> + <\associate|table> + <tuple|normal|Embeddings and corresponding dimensions used by the + model|<pageref|auto-3>> + </associate> + <\associate|toc> + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|1<space|2spc>Summary> + <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-1><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|2<space|2spc>Feature + Selection/Extraction> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-2><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|3<space|2spc>Modelling + Techniques and Training> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-4><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|4<space|2spc>Code + Description> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-5><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|5<space|2spc>Dependencies> + <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-6><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|6<space|2spc>How + To Generate The Solution> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-7><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|7<space|2spc>Additional + Comments and Observations> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-8><vspace|0.5fn> + + <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|8<space|2spc>References> + <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>> + <no-break><pageref|auto-9><vspace|0.5fn> + </associate> + </collection> +</auxiliary>+ \ No newline at end of file