Previous knowledge of RDDs in Spark is assumed - module 1 in the series covers this.
Having problems? check the errata for this course.
1 |
Introduction |
Preview
6m 29s |
|
What do DataFrames and SparkSQL offer compared to SparkCore (RDDs)? | |||
2 |
Getting Started |
Preview
20m 10s |
|
We'll read in a DataSet (DataFrame) to get started | |||
3 |
Working with DataSets |
Preview
29m 3s |
|
For our first real task with SparkSQL, we'll see how do filters | |||
4 |
Full SQL Syntax |
Watch
13m 45s |
|
How to query Spark using the full SQL syntax | |||
5 |
In Memory Data |
Watch
15m 4s |
|
In Module 1 we used parallelize to use in memory data - useful for unit tests. This is how to do it using DataFrames. | |||
6 |
Grouping and Aggregating |
Watch
12m 59s |
|
Understanding the Group By clause in SparkSQL | |||
7 |
Date Formatting |
Watch
6m 30s |
|
How to use the date_format function in SparkSQL | |||
8 |
Multiple Groupings |
Watch
13m 59s |
|
More than one group by column? | |||
9 |
Ordering |
Watch
16m 36s |
|
How to use the order by clause | |||
10 |
DataFrames API |
Watch
28m 4s |
|
We've concentrated on the SQL syntax so far, but we can also use a Java API to do everything (and more) that SQL can. | |||
11 |
Pivot Tables |
Preview
21m 21s |
|
In DataFrames, we can produce Pivot Tables as with spreadsheets and databases. But for Big Data! | |||
12 |
General Aggregations |
Watch
18m 49s |
|
The agg method is the most flexible aggregating function, so we'll see how to use it. | |||
13 |
Practical Session |
Watch
8m 12s |
|
A short exercise | |||
14 |
User Defined Functions |
Watch
23m 55s |
|
How to use lambdas to add your own functions to the SQL syntax and DataFrame API | |||
15 |
Performance |
Watch
25m 56s |
|
Using the SparkUI to analyse tasks. We ask the question: is the SQL syntax slower than the DataFrame API? Answers will follow in the next video... | |||
16 |
HashAggregation |
Watch
39m 21s |
|
Spark has two strategies for grouping - HashAggregation is extremely efficient but can only be used in restricted circumstances. Find out how to make sure HashAggegration is used instead of the (usually) slower SortAggregate routine. | |||
17 |
SparkSQL vs SparkRDD |
Watch
6m 55s |
|
Which performs "better"? | |||
Update - Tuning the spark.sql.shuffle.partitions Property |
Watch
8m 18s |
||
An update - by default you will have a large number of partitions when shuffling (such as when grouping) - this can kill performance on small jobs. This is how to fix the problem. | |||
18 |
Module Summary |
Watch
2m 24s |
|
Coming up later in 2018 is a module on SparkML. |