Active Learning in the Real World

Workshop / Overview

⚠️ A valid COVID certificate must be presented on site to enter the event. ⚠️

Supervised machine learning algorithms require more and more labeled data to train. When dealing with large datasets or when the labeling cost is high, selecting only the most informative samples allows to reach better performance at lesser costs. This is the promise of Active Learning (AL) which relies on the following iterative process: a model is first trained on already labeled data, then this model and data insights are used to query an oracle with new samples to annotate. This data is added to the labeled pool and the procedure starts again. Surprisingly, modern active learning techniques bear marginal improvement over old ones. Moreover, no method can bring guarantee to beat a strategy as simple as random on an unseen dataset. For this reason, we are advocating toward the use of metrics to monitor what is exactly happening in an active learning experiment.

The organisation of the workshop will be as follow:

Setting up, 15 minutes
Introduction to Active Learning, 45 minutes – Presentation of Active learning with the basic methods, the more advanced ones, the various applications, and state of the art results.
Setting up an active learning experiment, 15 minutes – Thanks to guided notebooks, set up the main loop of the experiment.
Testing various active learning strategies, 30 minutes – We propose to discover cardinal, alipy, and modAL python packages. Each of them features specific strategies.
Break, 15 minutes
Evaluating the performance of active learning methods online, 30 minutes – After having simulated active learning experiments and plotted performance graphs on known datasets, we propose to use a notebook-based interactive interface to perform an actual experiment on an unknown dataset. Based on the previous observations and active learning metrics, the user has to determine which method to use, and when to change. Results are then compared among participants.
Industry talk, 45 minutes – eee below
Closing remarks, 15 minutes

Industry talk, 45 minutes:
Asset Integrity Inspection using (Active) Machine Learning, by Nader Salman, Project Manager Data Science Platform.

General visual inspection required to assess subsea asset integrity in hazardous working conditions for humans is assured by remotely operated vehicles and gradually displaced by autonomous underwater vehicles with onboard real-time camera imagery. These images are analyzed using state-of-the-art object detection techniques that are able to detect and track defects in a context of low resolution and high noise. While they may look incredible in practice and in many cases surpass human performance, these techniques require a great amount of manually labeled data by experts.

Often considered slow to change, the oil and gas industry has tackled the labeling problem by pushing on the research front to optimize the cost of labeling while ensuring top performance of object detectors. With such technology in place one can provide automated anomaly detection on real large-scale subsea infield and pipeline inspection data.
In his talk, Nader will show a concrete example on active learning including advanced metrics and how they can help produce efficient workflows for computer vision from ingesting the data, labeling the data in a collaborative fashion, and training for machine learning for automated feature detection onboard the robots. Very promising efficiencies can be gained through the machine learning workflow and through automated anomaly detection.

Workshop / Outcome

After this workshop, participants will understand the basics of Active Learning and will be able to set up an experiment. They will know the landscape of query strategies and how to select the best fit for their use case. They will also know how to monitor an experiment and take decisions based on their insights. Finally, the industrial talk will give them a glimpse on how Active Learning is implemented in the real world.

Workshop / Difficulty

Intermediate level

Workshop / Prerequisites

Active Learning is an experimental framework that makes use of machine learning models. We will therefore use Random Forest Classifiers but they will not be explained.
A general knowledge of Machine Learning is therefore preferred.
Laptop with access to google collab or a python 3.5+ installation featuring:
- jupyter notebooks
- cardinal (and all its dependencies)
- modAL
- alipy
Notebook: https://github.com/dataiku-research/active-learning-tutorial

Track / Co-organizers

Alexandre Abraham

Research Scientist, Dataiku

Léo Dreyfus-Schmidt

Research Director, Dataiku

Nader Salman

Project Manager Data Science Platform, Schlumberger

AMLD EPFL 2021 / Workshops

View workshops

Towards ethical AI – practical tools for responsible data scientists

With Johan Rochel & Lea Strohm

10:00-11:30 November 10Online

How to make your NLP system multilingual

With Adam Bittlingmayer & Nerses Nersesyan

10:00-12:00 March 02Online

Deep Learning-Driven Text Summarization & Explainability with Reuters News Data

With Nadja Herger, Nina Hristozova & Andreea Iuga

15:00-17:30 March 02Online

AMLD / Global partners

AMLD EPFL 2021 Active Learning in the Real World