We propose a challenge consisting in extracting information from scanned incorporation documents for companies taking loans from a large financial company. There are two main aspects to the challenge. The first is the Optical Character Recognition, which needs to be done on documents having in some cases heavy noise, water stains, stamps or background patterns. The second is the text understanding, where relevant passages have to be returned according the business requirements. A set of sample documents will be provided for testing purposes, as well as the exact elements of text that will need to be extracted.
In-between a tutorial and a mini-hackathon, the workshop aims to provide something useful for all levels of Machine Learning skill. For beginners we will provide an initial hands-on tutorial, going step by step through a baseline implementation of the full processing stack, from the optical character recognition, image processing, all the way to natural language processing. After the tutorial they will be able to work on the challenge itself. Advanced users will be able to work on the challenge from the very beginning, on a set of documents already passed through OCR.
Participants will learn about a practical use case of automating a business process using NLP.
Participants will explore a baseline implementation for this use case going through the full processing pipeline, from OCR to the automated business decision.
Participants will work in teams to improve the baseline result.
Beginner level
- A laptop with Anaconda and Tesseract 4.0 pre-installed
- An AIcrowd account
- Python programming knowledge