Spark Module 2 SparkSQL and DataFrames

featuring SQL and DataFrames.

The second module in the Spark series moves on to explore the SparkSQL and DataFrames API. This allows us to concentrate on the Data Science and work at a much higher level of abstraction, working with SQL style syntax instead of worrying about RDDs.
This course is designed for all Java developers who want to explore Spark. No previous data science experience is assumed, so every concept is explained in detail.

Pre-requisites

Previous knowledge of RDDs in Spark is assumed - module 1 in the series covers this.

Contents - This course is around 5 hours long.

Having problems? check the errata for this course.

1	Introduction	Preview 6m 29s
What do DataFrames and SparkSQL offer compared to SparkCore (RDDs)?
2	Getting Started	Preview 20m 10s
We'll read in a DataSet (DataFrame) to get started
3	Working with DataSets	Preview 29m 3s
For our first real task with SparkSQL, we'll see how do filters
4	Full SQL Syntax	Watch 13m 45s
How to query Spark using the full SQL syntax
5	In Memory Data	Watch 15m 4s
In Module 1 we used parallelize to use in memory data - useful for unit tests. This is how to do it using DataFrames.
6	Grouping and Aggregating	Watch 12m 59s
Understanding the Group By clause in SparkSQL
7	Date Formatting	Watch 6m 30s
How to use the date_format function in SparkSQL
8	Multiple Groupings	Watch 13m 59s
More than one group by column?
9	Ordering	Watch 16m 36s
How to use the order by clause
10	DataFrames API	Watch 28m 4s
We've concentrated on the SQL syntax so far, but we can also use a Java API to do everything (and more) that SQL can.
11	Pivot Tables	Preview 21m 21s
In DataFrames, we can produce Pivot Tables as with spreadsheets and databases. But for Big Data!
12	General Aggregations	Watch 18m 49s
The agg method is the most flexible aggregating function, so we'll see how to use it.
13	Practical Session	Watch 8m 12s
A short exercise
14	User Defined Functions	Watch 23m 55s
How to use lambdas to add your own functions to the SQL syntax and DataFrame API
15	Performance	Watch 25m 56s
Using the SparkUI to analyse tasks. We ask the question: is the SQL syntax slower than the DataFrame API? Answers will follow in the next video...
16	HashAggregation	Watch 39m 21s
Spark has two strategies for grouping - HashAggregation is extremely efficient but can only be used in restricted circumstances. Find out how to make sure HashAggegration is used instead of the (usually) slower SortAggregate routine.
17	SparkSQL vs SparkRDD	Watch 6m 55s
Which performs "better"?
	Update - Tuning the spark.sql.shuffle.partitions Property	Watch 8m 18s
An update - by default you will have a large number of partitions when shuffling (such as when grouping) - this can kill performance on small jobs. This is how to fix the problem.
18	Module Summary	Watch 2m 24s
Coming up later in 2018 is a module on SparkML.

Improve your coding skills from beginner to expert with the largest online Java e-learning platform

Spark Module 2 SparkSQL and DataFrames

Pre-requisites

Contents - This course is around 5 hours long.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Let the Course Come to You