You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.
Having problems? check the errata for this course.
1 |
Introduction |
Preview
16m 56s |
|
A brief overview of Spark and some of the jargon terms you'll be encountering. | |||
2 |
Getting Started |
Preview
21m 35s |
|
Let's get Spark "installed" - it's just a maven dependency. | |||
3 |
Reduces |
Watch
14m 19s |
|
Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea. | |||
Update - problems with NotSerializableExceptions? |
Watch
6m 28s |
||
If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround. | |||
4 |
Mapping |
Watch
17m 45s |
|
Mapping allows you transform the RDD from one form to another. | |||
5 |
Tuples |
Watch
18m 12s |
|
Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward. | |||
6 |
PairRDDs |
Watch
41m 30s |
|
A PairRDD is a key/value representation of a dataset. | |||
7 |
FlatMap and Filtering |
Watch
14m 46s |
|
FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter. | |||
8 |
Reading Files |
Watch
13m 26s |
|
We can read local files, or from S3 or HDFS big data file systems. | |||
9 |
Keyword Ranking |
Watch
41m 47s |
|
A major exercise, we'll automatically generate keywords for training courses based on their subtitle files. | |||
10 |
Sorts and Coalesces |
Watch
28m 44s |
|
There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used). | |||
11 |
Deploying to EMR |
Watch
40m 42s |
|
We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware. | |||
12 |
Joins |
Watch
27m 27s |
|
One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins. | |||
13 |
Big Data Big Exercise |
Watch
51m 35s |
|
A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers. | |||
14 |
Performance |
Watch
80m 8s |
|
A deeper look into the internals of Spark. |