Spark for Java Developers

Big Data with Java Lambdas!

Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers
All of the fundamentals you need to understand the main operations you can perform in Spark Core.
Deploy to a live EMR hardware cluster.
Understand the internals of Spark and how it optimizes your execution plans.
Get some great practice with Java 8 Lambdas - our most "functional" course to date!
There will be a follow on module covering SparkSQL later in the year.

Pre-requisites

You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.

Contents - The course is 6 hours long and would be equivalent to a 3 day training course.

Having problems? check the errata for this course.

1	Introduction	Preview 16m 56s
A brief overview of Spark and some of the jargon terms you'll be encountering.
2	Getting Started	Preview 21m 35s
Let's get Spark "installed" - it's just a maven dependency.
3	Reduces	Watch 14m 19s
Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea.
	Update - problems with NotSerializableExceptions?	Watch 6m 28s
If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround.
4	Mapping	Watch 17m 45s
Mapping allows you transform the RDD from one form to another.
5	Tuples	Watch 18m 12s
Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward.
6	PairRDDs	Watch 41m 30s
A PairRDD is a key/value representation of a dataset.
7	FlatMap and Filtering	Watch 14m 46s
FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter.
8	Reading Files	Watch 13m 26s
We can read local files, or from S3 or HDFS big data file systems.
9	Keyword Ranking	Watch 41m 47s
A major exercise, we'll automatically generate keywords for training courses based on their subtitle files.
10	Sorts and Coalesces	Watch 28m 44s
There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used).
11	Deploying to EMR	Watch 40m 42s
We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware.
12	Joins	Watch 27m 27s
One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins.
13	Big Data Big Exercise	Watch 51m 35s
A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers.
14	Performance	Watch 80m 8s
A deeper look into the internals of Spark.

Improve your coding skills from beginner to expert with the largest online Java e-learning platform

Spark for Java Developers

Pre-requisites

Contents - The course is 6 hours long and would be equivalent to a 3 day training course.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Let the Course Come to You