Merge branch 'master' of github.com:adbrebs/taxi - taxi - Winning entry to the Kaggle taxi competition

commit 771dad76442a632c37656701f0ec6988a28c5a8d
parent b53bb5ec249b31226bcb4a5c6a0c6bed12e959f6
Author: Alex Auvolat <katchup@adnab.me>
Date:   Tue, 14 Jul 2015 09:19:01 -0400

Merge branch 'master' of github.com:adbrebs/taxi

Diffstat:
M .gitignore  | 2 +-
M README.md  | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
M data/transformers.py  | 2 +-
A doc/heatmap_3_5.png  | 0 
M doc/kaggle_blog_post.pptx  | 0 
M doc/memory_taxi.png  | 0 
A doc/short_report.pdf  | 0 
A prepare.sh  | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

8 files changed, 161 insertions(+), 2 deletions(-)
diff --git a/.gitignore b/.gitignore
@@ -2,7 +2,7 @@
 
 # Source archive
 submission.tgz
-*.pdf
+#*.pdf
 
 # Byte-compiled / optimized / DLL files
 __pycache__/
diff --git a/README.md b/README.md
@@ -1,3 +1,56 @@
 Winning entry to the Kaggle ECML/PKDD destination competition.
 
 https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
+
+
+## Dependencies
+
+We used the following packages developped at the MILA lab:
+
+* Theano. A general GPU-accelerated python math library, with an interface similar to numpy (see [3, 4]). http://deeplearning.net/software/theano/
+* Blocks. A deep-learning and neural network framework for Python based on Theano. https://github.com/mila-udem/blocks
+* Fuel. A data pipelining framework for Blocks. https://github.com/mila-udem/fuel 
+
+We also used the scikit-learn Python library for their mean-shift clustering algorithm. numpy, cPickle and h5py are also used at various places.
+
+
+## Structure
+
+Here is a brief description of the Python files in the archive:
+
+* `config/*.py`: configuration files for the different models we have experimented with the model which gets the best solution is `mlp_tgtcls_1_cswdtx_alexandre.py`
+* `data/*.py` : files related to the data pipeline:
+  * `__init__.py` contains some general statistics about the data
+  * `csv_to_hdf5.py` : convert the CSV data file into an HDF5 file usable directly by Fuel
+  * `hdf5.py` : utility functions for exploiting the HDF5 file
+  * `init_valid.py` : initializes the HDF5 file for the validation set
+  * `make_valid_cut.py` : generate a validation set using a list of time cuts. Cut lists are stored in Python files in `data/cuts/` (we used a single cut file)
+  * `transformers.py` : Fuel pipeline for transforming the training dataset into structures usable by our model
+* `data_analysis/*.py` : scripts for various statistical analyses on the dataset
+  * `cluster_arrival.py` : the script used to generate the mean-shift clustering of the destination points, producing the 3392 target points
+* `model/*.py` : source code for the various models we tried
+  * `__init__.py` contains code common to all the models, including the code for embedding the metadata
+  * `mlp.py` contains code common to all MLP models
+  * `dest_mlp_tgtcls.py` containts code for our MLP destination prediction model using target points for the output layer
+* `error.py` contains the functions for calculating the error based on the Haversine Distance
+* `ext_saveload.py` contains a Blocks extension for saving and reloading the model parameters so that training can be interrupted
+* `ext_test.py` contains a Blocks extension that runs the model on the test set and produces an output CSV submission file
+* `train.py` contains the main code for the training and testing
+  
+## How to reproduce the winning results?
+
+There is an helper script `prepare.sh` which might helps you (by performing steps 1-6 and some other checks), but if you encounter an error, the script will re-execute all the steps from the beginning (before the actual training, steps 2, 4 and 5 are quite long).
+
+Note that some script expect the repository to be in your PYTHONPATH (go to the root of the repository and type `export PYTHONPATH="$PWD:$PYTHONPATH"`).
+  
+1. Set the `TAXI_PATH` environment variable to the path of the folder containing the CSV files.
+2. Run `data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5"` to generate the HDF5 file (which is generated in `TAXI_PATH`, along the CSV files). This takes around 20 minutes on our machines.
+3. Run `data/init_valid.py valid.hdf5` to initialize the validation set HDF5 file.
+4. Run `data/make_valid_cut.py test_times_0` to generate the validation set. This can take a few minutes.
+5. Run `data_analysis/cluster_arrival.py` to generate the arrival point clustering. This can take a few minutes.
+6. Create a folder `model_data` and a folder `output` (next to the training script), which will receive respectively a regular save of the model parameters and many submission files generated from the model at a regular interval.
+7. Run `./train.py dest_mlp_tgtcls_1_cswdtx_alexandre` to train the model. Output solutions are generated in `output/` every 1000 iterations. Interrupt the model with three consecutive Ctrl+C at any times. The training script is set to stop training after 10 000 000 iterations, but a result file produced after less than 2 000 000 iterations is already the winning solution. We trained our model on a GeForce GTX 680 card and it took about an afternoon to generate the winning solution.
+   When running the training script, set the following Theano flags environment variable to exploit GPU parallelism:
+   `THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN`
+
+*More information in this pdf: https://github.com/adbrebs/taxi/blob/master/doc/short_report.pdf*
diff --git a/data/transformers.py b/data/transformers.py
@@ -14,7 +14,7 @@ fuel.config.default_seed = 123
 
 def at_least_k(k, v, pad_at_begin, is_longitude):
     if len(v) == 0:
-        v = numpy.array([data.porto_center[1 if is_longitude else 0]], dtype=theano.config.floatX)
+        v = numpy.array([data.train_gps_mean[1 if is_longitude else 0]], dtype=theano.config.floatX)
     if len(v) < k:
         if pad_at_begin:
             v = numpy.concatenate((numpy.full((k - len(v),), v[0]), v))
diff --git a/doc/heatmap_3_5.png b/doc/heatmap_3_5.png
Binary files differ.
diff --git a/doc/kaggle_blog_post.pptx b/doc/kaggle_blog_post.pptx
Binary files differ.
diff --git a/doc/memory_taxi.png b/doc/memory_taxi.png
Binary files differ.
diff --git a/doc/short_report.pdf b/doc/short_report.pdf
Binary files differ.
diff --git a/prepare.sh b/prepare.sh
@@ -0,0 +1,106 @@
+#!/bin/sh
+
+RESET=`tput sgr0`
+BOLD="`tput bold`"
+RED="$RESET`tput setaf 1`$BOLD"
+GREEN="$RESET`tput setaf 2`"
+YELLOW="$RESET`tput setaf 3`"
+BLUE="$RESET`tput setaf 4`$BOLD"
+
+export PYTHONPATH="$PWD:$PYTHONPATH"
+
+echo "${YELLOW}This script will prepare the data."
+echo "${YELLOW}You should run it from inside the repository."
+echo "${YELLOW}You should set the TAXI_PATH variable to where the data downloaded from kaggle is."
+echo "${YELLOW}Three data files are needed: ${BOLD}train.csv${YELLOW}, ${BOLD}test.csv${YELLOW} and ${BOLD}metaData_taxistandsID_name_GPSlocation.csv.zip${YELLOW}. They can be found at the following url: ${BOLD}https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data"
+if [ ! -e train.py ]; then
+    echo "${RED}train.py not found, you are not inside the taxi repository."
+    exit 1
+fi
+
+
+echo -e "\n$BLUE# Checking dependencies"
+
+python_import(){
+    echo -n "${YELLOW}$1... $RESET"
+    if ! python -c "import $1; print '${GREEN}version', $1.__version__, '${YELLOW}(we used version $2)'"; then
+        echo "${RED}failed, $1 is not installed"
+        exit 1
+    fi
+}
+
+python_import h5py 2.5.0
+python_import theano 0.7.0.dev
+python_import fuel 0.0.1
+python_import blocks 0.0.1
+python_import sklearn 0.16.1
+
+
+echo -e "\n$BLUE# Checking data"
+
+echo "${YELLOW}TAXI_PATH is set to $TAXI_PATH"
+
+md5_check(){
+    echo -n "${YELLOW}md5sum $1... $RESET"
+    if [ ! -e "$TAXI_PATH/$1" ]; then
+        echo "${RED}file not found, are you sure you set the TAXI_PATH variable correctly?"
+        exit 1
+    fi
+    md5=`md5sum "$TAXI_PATH/$1" | sed -e 's/ .*//'`
+    if [ $md5 = $2 ]; then
+        echo "$GREEN$md5 ok"
+    else
+        echo "$RED$md5 failed"
+        exit 1
+    fi
+}
+
+md5_check train.csv 68cc499ac4937a3079ebf69e69e73971
+md5_check test.csv f2ceffde9d98e3c49046c7d998308e71
+md5_check metaData_taxistandsID_name_GPSlocation.csv.zip fecec7286191af868ce8fb208f5c7643
+
+
+echo -e "\n$BLUE# Extracting metadata"
+
+echo -n "${YELLOW}unziping... $RESET"
+unzip -o "$TAXI_PATH/metaData_taxistandsID_name_GPSlocation.csv.zip" -d "$TAXI_PATH"
+echo "${GREEN}ok"
+
+echo -n "${YELLOW}patching error in metadata csv... $RESET"
+sed -e 's/41,Nevogilde,41.163066654-8.67598304213/41,Nevogilde,41.163066654,-8.67598304213/' -i "$TAXI_PATH/metaData_taxistandsID_name_GPSlocation.csv"
+echo "${GREEN}ok"
+
+md5_check metaData_taxistandsID_name_GPSlocation.csv 724805b0b1385eb3efc02e8bdfe9c1df
+
+
+echo -e "\n$BLUE# Conversion of training set to HDF5"
+echo "${YELLOW}This might take some time$RESET"
+data/csv_to_hdf5.py "$TAXI_PATH" "$TAXI_PATH/data.hdf5"
+
+
+echo -e "\n$BLUE# Generation of validation set"
+echo "${YELLOW}This might take some time$RESET"
+
+echo -n "${YELLOW}initialization... $RESET"
+data/init_valid.py
+echo "${GREEN}ok"
+
+echo -n "${YELLOW}cutting... $RESET"
+data/make_valid_cut.py test_times_0
+echo "${GREEN}ok"
+
+
+echo -e "\n$BLUE# Generation of destination cluster"
+echo "${YELLOW}This might take some time$RESET"
+echo -n "${YELLOW}generating... $RESET"
+data_analysis/cluster_arrival.py
+echo "${GREEN}ok"
+
+
+echo -e "\n$BLUE# Creating output folders"
+echo -n "${YELLOW}mkdir model_data... $RESET"; mkdir model_data; echo "${GREEN}ok"
+echo -n "${YELLOW}mkdir output... $RESET"; mkdir output; echo "${GREEN}ok"
+
+echo -e "\n$GREEN${BOLD}The data was successfully prepared"
+echo "${YELLOW}To train the winning model on gpu, you can now run the following command:"
+echo "${YELLOW}THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN ./train.py dest_mlp_tgtcls_1_cswdtx_alexandre"

	taxi Winning entry to the Kaggle taxi competition
	git clone https://esimon.eu/repos/taxi.git
	Log \| Files \| Refs \| README

M	.gitignore	\|	2	+-
M	README.md	\|	53	+++++++++++++++++++++++++++++++++++++++++++++++++++++
M	data/transformers.py	\|	2	+-
A	doc/heatmap_3_5.png	\|	0
M	doc/kaggle_blog_post.pptx	\|	0
M	doc/memory_taxi.png	\|	0
A	doc/short_report.pdf	\|	0
A	prepare.sh	\|	106	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++