Great Apache Spark tutorial videos on YouTube

This post provides great Apache Spark video available on YouTube.

Sameer Farooqui delivers a hands-on tutorial using Spark SQL and DataFrames to retrieve insights and visualizations from datasets published by the City of San Francisco. [Topics Indexed Below]

The labs are targeted for an audience with some general programming or SQL query experience, but little to no experience with Spark. Sameer will begin with some brief theory and lecture on Spark, before diving into several demos performing visualizations and analysis on calls made to the San Francsico Fire Department on July 4th.

Follow Along:
+ Databricks Community Edition:
+ Labs:
+ Learning Material:

—–Jump to Topic—–
00:00:06 – Workshop Intro & Environment Setup
00:13:06 – Brief Intro to Spark
00:17:32 – Analysis Overview: SF Fire Department Calls for Service
00:23:22 – Analysis with PySpark DataFrames API
00:29:32 – Doing Date/Time Analysis
00:47:53 – Memory, Caching and Writing to Parquet
01:00:40 – SQL Queries
01:21:11 – Convert a Spark DataFrame to a Pandas DataFrame
—–Q & A—–
01:24:43 – Spark DataFrames vs. SQL: Pros and Cons?
01:26:57 – Workflow for Chaining Databricks notebooks into Pipeline?
01:30:27 – Is Spark 2.0 ready to use in production?

SPARK 2.0 TRAINING | NewCircle | Onsite & Public Classes
+ Programming for Spark 2.0 (3 days)
+ Spark 2.0 for Machine Learning & Data Science (3 days)
Learn more:…

++Code for San Francisco++…

++Learn more about Databricks++…

All Apache Spark Courses from newcircle training:

Adam Breindel, lead Spark instructor at NewCircle, talks about which APIs to use for modern Spark with a series of brief technical explanations and demos that highlight best practices, latest APIs, and new features. (Topics Indexed Below)

We’ll look at how Dataset and DataFrame behave in Spark 2.0, Whole-Stage Code Generation, and go through a simple example of Spark 2.0 Structured Streaming (Streaming with DataFrames) that you can run in your own free instance of Databricks.

00:00:40 – Intro: What is “Modern Spark”
00:01:26 – DataFrame
00:05:07 – Why not use RDD?
00:09:15 – Intro to DataFrame and Dataset
00:10:13 – DataFrame versus Dataset
00:14:42 – Dataset Queries and Dataset with Scala classes
00:19:07 – Spark Query Optimizer
00:23:26 – Whole-Stage Codegen
00:27:21 – Hive integration
00:29:28 – Wrapping Up DataFrame/Dataset Benefits
00:30:54 – One More Thing – Structured Streaming
00:36:47 – Conclusion

Try the Examples:
+ Databricks Community Edition:
+ Get this Notebook:

SPARK 2.0 TRAINING | NewCircle | Onsite & Public Classes
+ Programming for Spark 2.0 (3 days):

+ Spark 2.0 for Machine Learning & Data Science (3 days):


“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL’s query optimizer, to all users of Spark. I’ll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” – Michael


Databricks Blog: “Deep Dive into Spark SQL’s Catalyst Optimizer”…

// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Follow Michael on –