While Pandas and Scikit-learn are very popular and easy to use for data processing and machine learning in Python, they cannot handle Big Data. Pandas can handle as much data as the RAM can take. What if the data is big (even just in orders of GB)? Then, PySpark is a great solution. PySpark is the Python API for Spark. It can process data in SQL-like language, and it comes with machine learning libraries as well.
t the end of the workshop, participants will be able to use PySpark for Big Data processing and machine learning.
Intermediate level
- Be familiar with Pandas library of Python
- Have your laptop with Anaconda already installed
- It is highly recommended that participants install PySpark in advance. Please follow the instructions here
- Don't hesitate to send an email to razavi@umich if you have any questions about installing PySpark or any other questions regarding the workshop
- Please download all data from here to save time during the workshop