Improve your coding skills from beginner to expert with the largest online Java e-learning platform

Spark for Java Developers

Big Data with Java Lambdas!
  • Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers
  • All of the fundamentals you need to understand the main operations you can perform in Spark Core.
  • Deploy to a live EMR hardware cluster.
  • Understand the internals of Spark and how it optimizes your execution plans.
  • Get some great practice with Java 8 Lambdas - our most "functional" course to date!
  • There will be a follow on module covering SparkSQL later in the year.

Pre-requisites

You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.

Contents - The course is 6 hours long and would be equivalent to a 3 day training course.

 

Having problems? check the errata for this course.

1

Introduction Preview
16m 56s
A brief overview of Spark and some of the jargon terms you'll be encountering.

2

Getting Started Preview
21m 35s
Let's get Spark "installed" - it's just a maven dependency.

3

Reduces Watch
14m 19s
Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea.
Update - problems with NotSerializableExceptions? Watch
6m 28s
If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround.

4

Mapping Watch
17m 45s
Mapping allows you transform the RDD from one form to another.

5

Tuples Watch
18m 12s
Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward.

6

PairRDDs Watch
41m 30s
A PairRDD is a key/value representation of a dataset.

7

FlatMap and Filtering Watch
14m 46s
FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter.

8

Reading Files Watch
13m 26s
We can read local files, or from S3 or HDFS big data file systems.

9

Keyword Ranking Watch
41m 47s
A major exercise, we'll automatically generate keywords for training courses based on their subtitle files.

10

Sorts and Coalesces Watch
28m 44s
There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used).

11

Deploying to EMR Watch
40m 42s
We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware.

12

Joins Watch
27m 27s
One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins.

13

Big Data Big Exercise Watch
51m 35s
A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers.

14

Performance Watch
80m 8s
A deeper look into the internals of Spark.

Let the Course Come to You

About Us Contact Privacy T&Cs
Facebook Twitter YouTube LinkedIn