Document Digitization Challenge

Workshop / Overview

We propose a challenge consisting in extracting information from scanned incorporation documents for companies taking loans from a large financial company. There are two main aspects to the challenge. The first is the Optical Character Recognition, which needs to be done on documents having in some cases heavy noise, water stains, stamps or background patterns. The second is the text understanding, where relevant passages have to be returned according the business requirements. A set of sample documents will be provided for testing purposes, as well as the exact elements of text that will need to be extracted.

In-between a tutorial and a mini-hackathon, the workshop aims to provide something useful for all levels of Machine Learning skill. For beginners we will provide an initial hands-on tutorial, going step by step through a baseline implementation of the full processing stack, from the optical character recognition, image processing, all the way to natural language processing. After the tutorial they will be able to work on the challenge itself. Advanced users will be able to work on the challenge from the very beginning, on a set of documents already passed through OCR.

Workshop / Outcome

Participants will learn about a practical use case of automating a business process using NLP.
Participants will explore a baseline implementation for this use case going through the full processing pipeline, from OCR to the automated business decision.
Participants will work in teams to improve the baseline result.

Workshop / Difficulty

Beginner level

Workshop / Prerequisites

A laptop with Anaconda and Tesseract 4.0 pre-installed
An AIcrowd account
Python programming knowledge

Track / Co-organizers

Mihai Gurban

Business Intelligence Tools, Credit Suisse

Raquel Terrés Cristofani

Assistant Vice President - RegTech Program Manager, Credit Suisse

AMLD EPFL 2019 / Workshops

View workshops

TensorFlow Basics 2019 – Saturday

With Bartek Wołowiec, Megan Ruthven, Ruslan Habalov & Andreas Steiner

09:00-16:30 January 26

TensorFlow Basics 2019 – Sunday

With Bartek Wołowiec, Megan Ruthven, Ruslan Habalov & Andreas Steiner

09:00-16:30 January 274ABC

Blue Brain Nexus, a knowledge graph for data driven projects

With Mohameth François Sy, Samuel Kerrien & Huanxiang Lu

09:00-16:30 January 272BC

AMLD / Global partners

AMLD EPFL 2019 Document Digitization Challenge