Evolutionary Computation for Big Data and Big Learning Workshop
Data Mining Competition 2014: Self-deployment track



The competition is closed now.

The results of the competition were presented at the ECBDL'14 workshop within GECCO-2014 in Vancouver, July 13th, 2014

Competition report

Aim

The data mining competition of the Evolutionary Computation for Big Data and Big Learning workshop has the aim of assessing the state of the art in evolutionary computation methods for big data and big learning.

The main competition/exercice of the workshop, the Deployment-as-a-Service track provides a framework that enables large-scale data mining tasks to be distributed in cloud environments with minimal changes in the core machine learning methods. It allows us to perform a very fair comparison exercice because we can ensure that all methods are allocated uniform amount of resources. The framework controls the overall learning strategy, and the participants just provide the methods.

In contrast, the aim of this self-deployment track is to give total flexibility to the participants so that they can use any training strategy with their own resources. We just provide a large dataset (details below) and receive predictions from the participants.

Instructions

  1. Participants firstly need to register their team. Afterwards, they will receive a team code, that is required to submit predictions.
  2. To prepare a predictions submission, participants need to create a plain-text file with one line for each instance in the test set containing the predicted class (0 or 1).
  3. Predictions can be submitted here. Participants need to enter their team name and code, upload the list of predictiond and provide a brief description of the method and the resources used for its training.
  4. Participants can see the performance of their method in the ranking page.
  5. Participants can submit as many predictions as they like, but we will very grateful if participants submit at most one prediction every hour.

Dataset

The dataset select for this competition comes from the Protein Structure Prediction field, and it was originally generated to train a predictor for the residue-residue contact prediction track of the CASP9 competiton. The dataset has 32 million instances, 631 attributes, 2 classes, 98% of negative examples and occupies, when uncompressed, about 56GB of disk space. The details of the dataset generation and a learning strategy used to train a method for this problem using evolutionary computation are available at http://bioinformatics.oxfordjournals.org/content/28/19/2441. The dataset is available in the ARFF format of the WEKA machine learning package.

Evaluation

For each prediction we will compute four metrics: true positive rate (TPR), true negative rate (TNR), accuracy, and the final score of TPR · TNR. We have chosen this final score because of the huge class imbalance of the dataset. We want to reward methods that try to predict well the minority class of the problem. During the workshop we will evaluate qualitatively the balance between the final scores and the amount of resources used in each predictor.

Deadline: We will accept predictions until June the 30th, 2014.