Using PySpark and interactive Jupyter notebook on Amazon Clusters

Workshop / Overview

Working with Big Data sometimes requires access to remote distributing systems such as Amazon or Google Cloud services. In this workshop, I will be showing how you can set up PySpark on Amazon Elastic Map Reduce (EMR) and do interactive data processing and machine learning on EMR from a Jupyter notebook on your local computer.

Workshop / Outcome

At the end of the workshop, participants will be able to use Pyspark for data processing and machine learning on Amazon EMR. They also learn how to set up an interactive Jupyter notebook to connect with Amazon EMR clusters.

Workshop / Difficulty

Intermediate level

Workshop / Prerequisites

Important: Please create an AWS account before the workshop. Note that even though you will get some free usage, you need to provide your credit card info in the AWS registration process. If you are a student, you can register for an AWS educate account in which case you will have more free tier and may not need to provide your credit card information. Please beware that the verification of student accounts may take up to 48 hours.
Know how to use PySpark (or have already participated in the PySpark: Big Data Processing and Machine Learning with Python workshop)
Please download all data from here to save time during the workshop

Track / Co-organizers

Hamed Razavi

Scientist, EPFL

AMLD EPFL 2019 / Workshops

View workshops

TensorFlow Basics 2019 – Saturday

With Bartek Wołowiec, Megan Ruthven, Ruslan Habalov & Andreas Steiner

09:00-16:30 January 26

Document Digitization Challenge

With Mihai Gurban & Raquel Terrés Cristofani

09:00-16:30 January 262A

TensorFlow Basics 2019 – Sunday

With Bartek Wołowiec, Megan Ruthven, Ruslan Habalov & Andreas Steiner

09:00-16:30 January 274ABC

AMLD / Global partners

AMLD EPFL 2019 Using PySpark and interactive Jupyter notebook on Amazon Clusters