Spark Fundamentals II
Building on your foundational knowledge of Spark, take this opportunity to move your skills to the next level. With a focus on Spark Resilient Distributed Data Set operations this course exposes you to concepts that are critical to your success in this field.
ABOUT THIS COURSE
Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.
Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
Learn how to optimize your data for joins using Spark’s memory caching.
Learn how to use the more advanced operations available in the API.
The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.
IBM Data Science Experience provides you with Jupyter notebooks that is already connected to Spark and supports Python, R, and Scala so that you start creating your Spark projects and collaborating with other data scientists. When you sign up, you get free access to Data Science Experience and all other IBM services for 30 days. Start now and take advantage of this offer.
COURSE SYLLABUS
Module 1 – Introduction to Notebooks
Understand how to use Zeppelin in your Spark projects
Identify the various notebooks you can use with Spark
Module 2 – Spark RDD Architecture
Understand how Spark generates RDDs
Manage partitions to improve RDD performance
Module 3 – Optimizing Transformations and Actions
Use advanced Spark RDD operations
Identify what operations cause shuffling
Module 4 – Caching and Serialization
Understand how and when to cache RDDs
Understand storage levels and their uses
Module 5 – Develop and Testing
Understand how to use sbt to build Spark projects
Understand how to use Eclipse and IntelliJ for Spark development