Machine Learning Reproducibility at Scale using Open Source Tooling

Workshop / Overview

Workshop documents are available on GitHub

Despite the many amazing applications of statistics, machine learning, and visualization in industry, many attempts at production-scale machine learning are anything but scientific. Specifically, ML workflows often lack reproducibility, a key tenet of science in general and a precursor to having true collaborative environment. In this workshop, we will discuss the importance of reproducibility and data provenance in any applied machine learning workflow. We will then implement a realistic machine learning workflow, emphasizing these points and utilizing open source tooling to overcome the challenges associated with reproducibility.

Roadmap:

(1) We will first expose a breakdown in reproducibility using common machine learning tooling and processes. We utilize Python to implement training and inference for a common classification problem, and we will see that, even in this common use case, it is a challenge to maintain any semblance of reproducibility and result provenance.

(2) We will solve some of the environment-related challenges of scaling machine learning workflows using the a first open source tool, Docker. Docker will allow us to run our training and inference reliably and as expected in environments that our different than our development environment.

(3) We will address the challenge of maintaining reproducibility for multi-stage workflows by introducing the open source orchestration tool Kubernetes and the open source data pipelining tool Pachyderm. These systems will allow us to predictably deploy, run, and trigger our multi-stage ML workflows.

(4) We will update our model and run our ML workflow multiple times. We will then demonstrate how you can link particular results to particular versions of your models and particular versions of your data. This linking, or provenance, gives you full reproducibility over time.

Workshop / Outcome

Each participant in the workshop will deploy their own data pipeline on their own production cluster (which they will be given access to during the workshop). This will allow the users to be hands on in a realistic environment and get practical skills which they will be able to apply after the workshop. After the workshop, they will be able to Docker-ize and deploy their own multi-stage ML workflows in a scalable and reproducible manner.

Workshop / Difficulty

Intermediate level

Workshop / Prerequisites

laptop with the ability to ssh into a remote machine
some familiarity with the command line

Track / Co-organizers

Daniel Whitenack

Data Scientist, SIL International

AMLD EPFL 2018 / Workshops

View workshops

TensorFlow Basics 2018 – Sunday

With Bartek Wołowiec, Ruslan Habalov & Andreas Steiner

09:00-12:00 January 281BC

TensorFlow Basics 2018 – Saturday

With Bartek Wołowiec, Ruslan Habalov & Andreas Steiner

09:00-12:00 January 274ABC

Financial Predictions with Machine Learning

With Stefano Tempesta

13:30-16:30 January 275BC

AMLD / Global partners

AMLD EPFL 2018 Machine Learning Reproducibility at Scale using Open Source Tooling