PySpark: Big Data Processing and Machine Learning with Python

Workshop / Overview

While Pandas and Scikit-learn are very popular and easy to use for data processing and machine learning in Python, they cannot handle Big Data. Pandas can handle as much data as the RAM can take. What if the data is big (even just in orders of GB)? Then, PySpark is a great solution. PySpark is the Python API for Spark. It can process data in SQL-like language, and it comes with machine learning libraries as well.

Workshop / Outcome

t the end of the workshop, participants will be able to use PySpark for Big Data processing and machine learning.

Workshop / Difficulty

Intermediate level

Workshop / Prerequisites

Be familiar with Pandas library of Python
Have your laptop with Anaconda already installed
It is highly recommended that participants install PySpark in advance. Please follow the instructions here
Don't hesitate to send an email to razavi@umich if you have any questions about installing PySpark or any other questions regarding the workshop
Please download all data from here to save time during the workshop

Track / Co-organizers

Hamed Razavi

Scientist, EPFL

AMLD EPFL 2019 / Workshops

View workshops

TensorFlow Basics 2019 – Saturday

With Bartek Wołowiec, Megan Ruthven, Ruslan Habalov & Andreas Steiner

09:00-16:30 January 26

Document Digitization Challenge

With Mihai Gurban & Raquel Terrés Cristofani

09:00-16:30 January 262A

TensorFlow Basics 2019 – Sunday

With Bartek Wołowiec, Megan Ruthven, Ruslan Habalov & Andreas Steiner

09:00-16:30 January 274ABC

AMLD / Global partners

AMLD EPFL 2019 PySpark: Big Data Processing and Machine Learning with Python