Mô tả

As part of this course, you will learn all the Data Engineering Essentials related to building Data Pipelines using SQL, Python as Hadoop, Hive, or Spark SQL as well as PySpark Data Frame APIs. You will also understand the development and deployment lifecycle of Python applications using Docker as well as PySpark on multinode clusters. You will also gain basic knowledge about reviewing Spark Jobs using Spark UI.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.

Here are some of the challenges the learners have to face to learn key Data Engineering Skills such as Python, SQL, PySpark, etc.

  • Having an appropriate environment with Apache Hadoop, Apache Spark, Apache Hive, etc working together.

  • Good quality content with proper support.

  • Enough tasks and exercises for practice

This course is designed to address these key challenges for professionals at all levels to acquire the required Data Engineering Skills (Python, SQL, and Apache Spark).

  • Setup Environment to learn Data Engineering Essentials such as SQL (using Postgres), Python, etc.

  • Setup required tables in Postgres to practice SQL

  • Writing basic SQL Queries with practical examples using WHERE, JOIN, GROUP BY, HAVING, ORDER BY, etc

  • Advanced SQL Queries with practical examples such as cumulative aggregations, ranking, etc

  • Scenarios covering troubleshooting and debugging related to Databases.

  • Performance Tuning of SQL Queries

  • Exercises and Solutions for SQL Queries.

  • Basics of Programming using Python as Programming Language

  • Python Collections for Data Engineering

  • Data Processing or Data Engineering using Pandas

  • 2 Real Time Python Projects with explanations (File Format Converter and Database Loader)

  • Scenarios covering troubleshooting and debugging in Python Applications

  • Performance Tuning Scenarios related to Data Engineering Applications using Python

  • Getting Started with Google Cloud Platform to setup Spark Environment using Databricks

  • Writing Basic Spark SQL Queries with practical examples using WHERE, JOIN, GROUP BY, HAVING, ORDER BY, etc

  • Creating Delta Tables in Spark SQL along with CRUD Operations such as INSERT, UPDATE, DELETE, MERGE, etc

  • Advanced Spark SQL Queries with practical examples such as ranking

  • Integration of Spark SQL and Pyspark

  • In-depth coverage of Apache Spark Catalyst Optimizer for Performance Tuning

  • Reading Explain Plans of Spark SQL Queries or Pyspark Data Frame APIs

  • In-depth coverage of columnar file formats and Performance tuning using Partitioning

Bạn sẽ học được gì

Setup Environment to learn SQL and Python essentials for Data Engineering

Database Essentials for Data Engineering using Postgres such as creating tables, indexes, running SQL Queries, using important pre-defined functions, etc.

Data Engineering Programming Essentials using Python such as basic programming constructs, collections, Pandas, Database Programming, etc.

Data Engineering using Spark Dataframe APIs (PySpark) using Databricks. Learn all important Spark Data Frame APIs such as select, filter, groupBy, orderBy, etc.

Data Engineering using Spark SQL (PySpark and Spark SQL). Learn how to write high quality Spark SQL queries using SELECT, WHERE, GROUP BY, ORDER BY, ETC.

Relevance of Spark Metastore and integration of Dataframes and Spark SQL

Ability to build Data Engineering Pipelines using Spark leveraging Python as Programming Language

Use of different file formats such as Parquet, JSON, CSV etc in building Data Engineering Pipelines

Setup Hadoop and Spark Cluster on GCP using Dataproc

Understanding Complete Spark Application Development Life Cycle to build Spark Applications using Pyspark. Review the applications using Spark UI.

Yêu cầu

  • Laptop with decent configuration (Minimum 4 GB RAM and Dual Core)
  • Sign up for GCP with the available credit or AWS Access
  • Setup self support lab on cloud platforms (you might have to pay the applicable cloud fee unless you have credit)
  • CS or IT degree or prior IT experience is highly desired

Nội dung khoá học

53 sections

Introduction to Data Engineering Essentials using SQL, Python, and PySpark

10 lectures
Introduction to Data Engineering Essentials Course
02:41
Overview of our support to Data Engineering Essentials course
04:40
Overview of SQL topics covered in the course
06:57
Overview of Python topics covered in the course
07:02
Overview of Getting Started with GCP related to the course
04:09
Overview of Spark and Databricks Environment related topics
04:46
Detailed outline of Spark SQL Topics in the course
06:35
Detailed outline of Pyspark Topics in the course
05:10
Detailed outline of ELT Data Pipelines on Databricks
02:27
Overview of Performance Tuning of Spark covered in the course
05:14

Getting Started with SQL for Data Engineering

7 lectures
Introduction to SQL for Data Engineering
01:36
Overview of Application Architecture and RDBMS
03:58
Overview of Database Technologies and relevance of SQL
06:16
Overview of Purpose Built Databases
06:00
Overview of Data Warehouse and Data Lake
03:13
Usage of RDBMS and Data Warehouse technologies
03:36
Differences and Similarities between RDBMS and Data Warehouse Technologies
03:52

Setup Tools for Data Engineering Essentials

10 lectures
Introduction to Setting up Tools for Data Engineering Essentials
02:29
Setup VS Code on Windows
03:22
Setup Python 3.9 on Windows
05:42
Configure Environment Variable PATH for Python on Windows
04:17
Overview of learning Python using Python CLI
03:22
Integrate VSCode with Python on Windows
05:01
Install Postgres 14 on Windows 11
05:13
Getting Started with pgAdmin on Windows
02:36
Getting Started with pgAdmin on Mac
05:30
Conclusion of Setting up Tools for Data Engineering Essentials
01:44

Setup Application Tables and Data in Postgres Database

8 lectures
Overview of Postgres Database Server and pgAdmin
08:07
Overview of Database Connection Details
07:03
Overview of Connecting to External Databases using pgAdmin
04:19
Create Application Database and User in Postgres Database Server
03:47
Clone Data Sets from Git Repository for Database Scripts
02:50
Register Server in pgAdmin using Application Database and User
03:58
Setup Application Tables and Data in Postgres Database
07:19
Overview of pgAdmin to write SQL Queries
11:41

Writing Basic SQL Queries

15 lectures
Review Data Model Diagram
04:32
Define Problem Statement for SQL Queries
01:42
Filtering Data using SQL Queries
06:27
Total Aggregations using SQL Queries
05:26
Group By Aggregations using SQL Queries
07:29
Order of Execution of SQL Queries
07:31
Rules and Restrictions to Group and Filter Data in SQL queries
05:19
Filter Data based on Aggregated Results using Group By and Having
04:31
Inner Joins using SQL Queries
05:14
Outer Joins using SQL Queries
07:12
Filter and Aggregate on Join Results using SQL
03:57
Overview of Database Views
06:02
Overview of Common Table Expressions or CTEs
04:03
Outer Join with Additional Conditions in SQL Queries
08:10
Explanation about Fix of SQL Queries with Filtering on Outer Join Results
04:20

Cumulative Aggregations and Ranking in SQL Queries

13 lectures
Introduction to Cumulative Aggregations and Ranking in SQL Queries
01:28
Overview of CTAS to create tables based on Query Results
04:26
Create Tables for Cumulative Aggregations and Ranking
02:11
Overview of OVER and PARTITION BY Clause in SQL Queries
06:58
Compute Total Aggregation using OVER and PARTITION BY in SQL Queries
01:47
Overview of Ranking in SQL
03:09
Compute Global Ranks using SQL
03:54
Compute Ranks based on key using SQL
04:15
Rules and Restrictions to Filter Data based on Ranks in SQL
02:45
Filtering based on Global Ranks using Nested Queries and CTEs in SQL
05:04
Filtering based on Ranks per Partition using Nested Queries and CTEs in SQL
04:51
Create Students table with Data for ranking using SQL
03:11
Difference between rank and dense rank using SQL
05:39

SQL Troubleshooting and Debugging Guide

15 lectures
Introduction to SQL Troubleshooting and Debugging Guide
01:48
Overview of Database Connectivity Issues
07:17
Validate and Setup Telnet on Mac or PC
04:11
Validate Connectivity to Database Server using telnet
04:27
Troubleshoot Database Connectivity Issue with Correct Host Details
06:12
Current Databases and Users in Postgres Database Server
04:38
Troubleshoot Database Credentials and Permissions Issues
07:55
Overview of Compilation of SQL Queries
05:31
Troubleshooting Syntax Errors in SQL Queries
04:03
Troubleshooting Semantec Errors in SQL Queries
04:56
Overview of Bugs in SQL Queries
02:56
Development Best Practices with tips to troubleshoot SQL bugs
03:47
Develop Initial Solution based on the requirement
04:14
Identify and Troubleshoot Bugs in SQL Queries
08:06
Develop Solution using Development Best Practices
04:39

Performance Tuning of SQL Queries

14 lectures
Introduction to Performance Tuning of SQL Queries
03:32
Overview of SQL Compilation Process and Explain Plans
04:26
Generate Explain Plans for SQL Queries
05:58
Review Tables used for Performance Tuning of SQL Queries
07:02
Review Data Storage Internals for Tables and Indexes
05:15
Review key terms used in Explain Plans for SQL Queries
03:42
Interpret Explain Plans for Basic SQL Queries
05:23
Review the Common Application Scenarios for Performance Tuning
03:01
Write SQL Queries for Customer Orders
05:36
Performance Testing of SQL Queries using Stored Procedure
04:04
Add Required Indexes to tune performance of SQL Queries
07:26
Guidelines on adding Indexes on Tables for SQL Queries
03:40
Interpreting the explain plan for SQL Queries using Indexes
05:38
Conclusion of Performance Tuning of SQL Queries
02:41

Exercises for Basic SQL Queries

2 lectures
Simple Exercises for Filtering and Aggregations
04:23
Exercises on Joins and Aggregations using SQL
02:50

Solutions for Basic SQL Queries

8 lectures
Solutions for Filtering and Aggregations
04:56
Solutions for Filtering and Aggregations
05:31
Validate Data and Review Data Model Diagram
03:25
Solution for Exercise 1 to get Customer Order Count
07:50
Solution for Exercise 2 to get Dormant Customers using Outer Join
10:55
Solution for Exercise 3 to get Revenue Per Customer using Outer Join
07:49
Solution for Exercise 4 to get Revenue Per Category
07:37
Solution for Exercise 5 to get Product Count Per Department
08:29

Getting Started with Python

13 lectures
Setup Visual Studio Workspace for Python Application Development
06:04
Setup Notebook Environment in VS Code Workspace
06:48
Overview of VS Code Notebook Environment
04:38
Overview of Cells in VS Code Notebook
06:06
Defining Functions in VS Code Notebooks
05:49
Run the Code in VS Code Notebook Cell by Line
05:18
Constants and Variables in Python
05:42
Overview of Python Data Types
04:10
Getting help on Python Variables and Functions
06:15
Pre-Defined String Manipulation Functions
06:37
Overview of Python Lists
06:42
Loops and Conditions in Python
08:49
User Defined Functions in Python
06:29

Python Collections for Data Engineering

16 lectures
Overview of File IO using Python
08:21
Read Data from CSV File into Python List
03:55
Overview of Python Collections
08:26
Getting Started with Processing Python Lists
03:17
Overview of Lambda Functions in Python
06:30
Usage of Lambda Functions
03:31
Filter Data in Python Lists using filter and lambda
05:42
Get unique values from list using map and set
05:23
Sort Python lists using key
07:42
Overview of JSON Strings and Files
03:54
Read JSON Strings to Python dicts or lists
06:34
Read JSON Schemas from file to Python dicts
06:18
Overview of Processing JSON Data using Python
03:03
Extract Details from Complex JSON Arrays using Python
03:39
Sort Data in JSON Arrays using Python
04:17
Create Function to get Column Details from Schemas JSON File
05:28

Data Processing using Pandas Dataframe APIs

12 lectures
Overview of Pandas for Data Processing
04:43
Overview of Reading CSV Data using Pandas
04:38
Read Data from CSV Files to Pandas Dataframes
04:43
Filter Data in Pandas Dataframe using query
07:06
Get Count by Status using Pandas Dataframe APIs
05:10
Get count by Month and Status using Pandas Dataframe APIs
04:32
Create Dataframes using dynamic column list on CSV Data
04:47
Performing Inner Join between Pandas Dataframes
04:21
Perform Aggregations on Join results
05:00
Sort Data in Pandas Dataframes
06:45
Overview of Writing Pandas Dataframes to Files
03:10
Write Pandas Dataframes to JSON Files
07:16

Project 1 - File Format Converter using Python

24 lectures
Project 1 - File Format Converter Handout
01:04
Get File Names to be processed using glob
05:53
Get Column Names using Schemas File
04:03
Get Data Set Names from File Names or Paths using regular expressions
05:34
Read CSV Data into Pandas Dataframe with Schema Dynamically
05:04
Generate File Paths for Target JSON Files Dynamically
04:01
Recap of Writing Pandas Dataframe to JSON File
02:24
Write Pandas Dataframe to JSON Files
04:15
Modularize File Format Converter for Dataset
05:18
Wrapper to Process all Data Sets
06:37
Setup Project for File Format Converter using Python
04:17
Install Dependencies for the Python Project using pip
03:30
Add Core Logic to Python Application
04:10
Overview of Run-time Arguments and Environment Variables
02:02
Using Run Time Arguments in Python Applications
05:32
Overview of Environment Variables
04:21
Setting Environment Variables on Windows or Mac or Linux
04:53
Use Environment Variables in Python Applications
05:34
Use Environment Variables in File Format Converter
09:15
Pass JSON Array as argument to Python Applications
05:51
Pass Data Sets as Run Time Arguments to File Format Converter
09:20
Exception Handling in Python Applications
03:35
Raising Exceptions in Python Applications
04:07
Exception Handling in File Format Converter Application
04:38

Project 2 - Files to Database Loader

7 lectures
Project 2 - Files To Database Loader Handout
01:15
Install Python Dependencies for Pandas and Database Integration
02:10
Run Queries from Notebook using SQL Magic
07:58
Validate Pandas and SQL Integration
05:21
Write CSV Data from File to Database Table
05:43
Write CSV Data from Files to Database Tables in Chunks
07:05
Overview of Deploying File to DB Loader Project
09:12

Troubleshooting and Debugging Python Issues

19 lectures
Introduction to Troubleshooting and Debugging Python issues
02:47
Guidelines for Troubleshooting and Debugging Python related Issues
01:47
Overview of Database Connectivity using Python Applications
07:32
Overview of Database Connectivity using Python
04:14
Troubleshoot Network Connectivity to the Database Server using telnet
03:44
Troubleshoot Module Related issues for Database Connectivity using Python
05:08
Troubleshoot Credentials Related issues for Database Connectivity using Python
03:08
Overview of Python process to run Python Applications
02:40
Troubleshooting Compilation Errors in Python
04:55
Troubleshooting Run Time Errors in Python
03:58
Overview of Software Development Life Cycle
03:31
Overview of Unit Testing or Validation of Applications
03:11
Overview of Debugging VS Code Notebooks using Debug Feature
09:14
Debug VS Code Notebooks using Debug Feature
09:46
Getting Started with Debugging of Python Programs using VS Code
02:26
Recap of running File Format Converter application
06:26
Debug Python Application using VS Code with breakpoints
06:46
Managing Breakpoints for Debugging in VS Code
02:56
Conclusion to Troubleshooting and Debugging Python Issues
02:02

Performance Tuning of Python Applications

18 lectures
Introduction to Performance of Python Applications
07:00
Setup Database Loader Python Application
05:42
Ensure Postgres Database is setup for file to db loader Python Application
09:22
Cleanup the tables to run file to db loader application
03:08
Run and Validate File to DB Loader Application
06:07
Fix the error message in file to db loader application
02:29
Overview of Execution of file to db loader application
07:15
Performance Tuning using Chunksize in Pandas
06:50
Review Pandas Data Frame API to load data into the target table
08:29
Overview of multi or batch insert into Database Tables
04:53
Develop application for multiprocessing
03:05
Getting Started with Multiprocessing using Python
05:14
Invoking User Defined Functions using multiprocessing in Python
06:21
Refactor File to Database Loader Application for Multiprocessing
04:37
Add Parallel Processing to file to db loader Python Application
04:41
Validate File to DB Loader Application with Multiprocessing
07:06
Understanding the concept of Multiprocessing in Python
08:00
Performance Tuning Scenarios of Python Applications
05:11

Getting Started with GCP

15 lectures
Introduction to Getting Started with GCP
00:59
Pre-requisite Skills to Sign up for course on GCP Data Analytics
02:03
Overview of Cloud Platforms
04:00
Overview of Google Cloud Platform or GCP
03:20
Overview of Signing for GCP Account
01:42
Create New Google Account using Non Gmail Id
02:16
Sign up for GCP using Google Account
03:16
Overview of GCP Credits
03:36
Overview of GCP Project and Billing
02:11
Overview of Google Cloud Shell
03:28
Install Google Cloud SDK on Windows
04:38
Initialize gcloud CLI using GCP Project
03:25
Reinitialize Google Cloud Shell with Project id
03:04
Overview of Analytics Services on GCP
02:20
Conclusion to Get Started with GCP for Data Engineering
01:05

Overview of Big Data and Data Lakes

11 lectures
Different Types of Databases
01:59
Usecases for Different Types of Databases
04:01
Technologies for Different Types of Databases
03:41
Volumes for Different Types of Databases
03:29
Overview of Big Data
04:19
Evolution of Big Data Technologies
02:33
Overview of Data Lake using Hadoop eco system
06:13
Limitations of Hadoop eco system
05:30
Overview of Modern Data Lakes on Cloud
02:36
Implementation of Modern Data Lakes on Cloud
03:23
Advantages of Modern Data Lakes on Cloud
04:11

Overview of Spark and Spark Architecture

13 lectures
Overview of Data Processing
05:42
Overview of Data Processing Libraries
01:56
Setup Environment to explore Pandas, Dask and Pyspark
04:19
Code Examples of Pandas, Dask and Pyspark
06:21
Differences between Pandas, Dask and Pyspark
06:13
Overview of Distributed Computing
05:16
Overview of Official Documentation of Apache Spark
03:37
Overview of Spark Key Features and Platforms
05:01
Overview of Spark Infrastructure
03:58
Overview of Spark Cluster using Databricks
06:14
Overview of Executors in Spark Cluster
03:03
Overview of Spark Glossary
01:44
Understand Spark Key Terms
06:42

Setup Databricks Environment using GCP

14 lectures
Overview of Databicks on GCP
01:08
Signing up for Databricks on GCP
04:21
Create Databricks Workspace on GCP
04:46
Getting Started with Databricks Clusters on GCP
03:17
Getting Started with Databricks Notebook
03:51
Overview of Databricks on GCP
07:11
High level architecture of Databricks
05:19
Setup Databricks CLI on Mac or Windows
05:07
Overview of Databricks CLI and other clients
05:20
Configure Databricks CLI on Mac or Windows
05:18
Troubleshoot issues to configure Databricks CLI
05:11
Overview of Databricks CLI Commands
05:20
Setup Data Repository for Data Sets
01:59
Setup Data Sets in DBFS using Databricks CLI Commands
04:50

Basic Transformations using Spark SQL

12 lectures
Process Data in DBFS using Databricks Spark SQL
05:03
Getting Started with Spark SQL Example using Databricks
04:44
Create Temporary Views using Spark SQL
06:34
Exercise to create temporary views using Spark SQL
01:27
Spark SQL Query to compute Daily Product Revenue
06:10
Save Query Result to DBFS using Spark SQL
04:25
Overview of Pyspark Examples on Databricks
01:04
Process Schema Details in JSON using Pyspark
07:32
Create Dataframe with Schema from JSON File using Pyspark
06:03
Transform Data using Spark APIs
04:13
Get Schema Details for all Data Sets using Pyspark
04:08
Convert CSV to Parquet with Schema using Pyspark
05:01

Create Delta Tables using Spark SQL

14 lectures
Introduction to Creating Delta Tables using Spark SQL
01:34
Overview of Supported Providers for Spark Metastore Tables
02:06
Create Database and Review the Details
05:58
Create and Review Managed Spark Metastore Table using Delta Format
04:31
Copy Data into Spark Metastore Managed Table
04:41
Validate Data in Spark Metastore Managed Table
03:17
Create and Review External Spark Metastore Table using Delta Format
04:08
Insert Data into Spark Metastore External Table
03:13
Validate Data in Spark Metastore External Table
03:00
Overview of Spark Metastore
04:36
Difference Between Managed and External Spark Metastore Tables
06:37
Perform CRUD Operations on Delta Tables in Spark Metastore
06:14
Using Merge to Update and Insert into Delta Tables in Spark Metastore
09:30
Conclusion of Creating Delta Tables using Spark SQL
02:15

Pre-Defined Functions in Spark SQL

31 lectures
Overview of Functions in Spark SQL
03:08
Validate Functions in Spark SQL
04:01
Overview of String Manipulation Functions in Spark SQL
01:54
Case Conversion and Length of Strings using Spark SQL
03:51
Extract Substring using substr in Spark SQL
05:19
Extract Substrings from Delimited Strings using split in Spark SQL
04:33
Trimming Characters or Strings using Spark SQL
09:32
Padding Characters to Strings using Spark SQL
07:32
Reverse and Concatenate Strings using Spark SQL
07:06
Overview of Date Manipulation Functions in Spark SQL
01:37
Overview of Standard Date and Timestamp in Spark SQL
03:27
Date Arithmetic using Spark SQL Functions
04:40
Overview of trunc and date_trunc in Spark SQL
05:36
Extract Information from Date or Time using Spark SQL
08:23
Convert Non Standard Dates or Timestamps to Standard Ones using Spark SQL
07:39
Extract Information using Calendar Functions from Date or Timestamp using Spark
02:51
Dealing with Unix Timestamp using Spark SQL
08:54
Overview of Numeric Functions in Spark SQL
12:15
Data Type Conversion using Spark SQL
06:15
Overview of Handling Null Values using Spark SQL
03:51
Replace Null Values with default values using nvl and coalesce in Spark SQL
07:15
Conditional Logic on Null Values using nvl2 and case in Spark SQL
06:08
Overview of Case and When in Spark SQL
03:07
Using CASE and WHEN for conditional logic in Spark SQL
06:08
Aggregate using CASE and WHEN in GROUP BY in Spark SQL
03:59
Word Count Query using Pre-defined Functions in Spark SQL
06:34
Exercises for Pre-defined functions in Spark SQL
02:02
Solutions for Exercises 1 and 2 on Pre-defined Functions in Spark SQL
08:27
Solutions for Exercises 3 and 4 on Pre-defined Functions in Spark SQL
05:49
Solutions for Exercises 5 and 6 on Pre-defined Functions in Spark SQL
09:59
Solutions for Exercises 7 and 8 on Pre-defined Functions in Spark SQL
10:11

Setup Spark Metastore Tables for Basic Transformations

3 lectures
Introduction to Basic Transformations using Spark SQL
01:45
Prepare Spark Metastore Tables for Basic Transformations
03:47
Projecting Data using Spark SQL
06:01

Filtering Data using Spark SQL Queries

5 lectures
Filtering Data using Equal Condition in Spark SQL
04:56
Using IN, LIKE and BETWEEN in Spark SQL Queries
04:48
Filter Data using Boolean AND in Spark SQL Queries
04:25
Filter Data using Boolean OR in Spark SQL Queries
04:18
Dealing with NULLS while Filtering Data in Spark SQL Queries
03:14

Aggregations using Spark SQL Queries

5 lectures
Perform Total Aggregations using Spark SQL Queries
06:45
Overview of Aggregations using GROUP BY in Spark SQL Queries
03:59
GROUP BY Examples using Spark SQL Queries
04:45
Order of Execution of Spark SQL Queries
06:07
Filter Data based on Aggregate Results using HAVING in Spark SQL Queries
06:22

Joins using Spark SQL Queries

8 lectures
Overview of Joins in Spark SQL Queries
04:20
Inner Join using Spark SQL Queries
05:47
Concepts Behind Inner Joins in Spark SQL
05:03
Outer Joins using Spark SQL Queries
08:39
Example - Inner Join along with GROUP BY using Spark SQL Queries
03:36
Example - Outer Join along with GROUP BY using Spark SQL Queries
03:29
Example - Filtering and Outer Joins along with GROUP BY in Spark SQL Queries
03:33
Example - Filtering and Outer Joins along with GROUP BY in Spark SQL Queries
03:58

Sorting using Spark SQL Queries

2 lectures
Sorting Data using Spark SQL Queries
04:01
Dealing with Nulls while Sorting Data using Spark SQL Queries
03:28

Copy Query Results into Spark Metastore Tables

6 lectures
Overview of Copying Query Results into Spark Metastore Tables
02:09
Query to Compute Daily Revenue using Spark SQL
02:16
Copy Query Results into Spark Metastore Tables using CTAS
04:25
Copy Query Results into Spark Metastore Tables using INSERT
03:38
Design Pipeline using CTAS and INSERT in Spark SQL
09:02
Copy Query Results into Spark Meatstore Tables using MERGE
07:14

Ranking using Spark SQL Windowing Functions

6 lectures
Ranking using Spark SQL Windowing Functions
01:31
Create Temporary View for ranking using Spark SQL Windowing Functions
01:36
Compute Global Rank using Spark SQL Windowing Functions
05:37
Compute Ranks Per Key using Spark SQL Windowing Functions
03:55
Difference Between rank and dense_rank
04:10
Filter on Ranks using Spark SQL Windowing Functions
06:59

Processing JSON like Data using Spark SQL

11 lectures
Overview of JSON
04:41
Creating Spark Metastore Tables with Array Type Columns
04:01
Dealing with Array Type Columns using Spark SQL Queries
04:42
Creating Spark Metastore Tables with Struct Type Columns
03:03
Projecting Data From Struct Type Fields in Spark SQL
03:54
Creating Spark Metastore Tables with Array of Struct Column
03:48
Dealing with Array of Struct Type Columns using Spark SQL Queries
04:30
Overview of Important Functions to Process JSON Data in Spark SQL
03:31
Generate Array Type Columns from Regular Columns in Spark SQL
04:35
Generate Array of Struct Type Columns from Regular Columns in Spark SQL
03:53
Processing Delimited Strings using Spark SQL Queries
04:08

Getting Started with Pyspark Data Frame APIs

6 lectures
Overview of Pyspark Examples on Databricks
01:04
Process Schema Details in JSON using Pyspark
07:32
Create Dataframe with Schema from JSON File using Pyspark
06:03
Transform Data using Spark APIs
04:13
Get Schema Details for all Data Sets using Pyspark
04:08
Convert CSV to Parquet with Schema using Pyspark
05:01

Create Spark Data Frames using Pyspark Data Frame APIs

11 lectures
Create Spark Data Frame using Pyspark Data Frame APIs
04:26
Introduction to Processing JSON like Data using Spark SQL
01:03
Overview of Data Processing using Conventional loops
06:50
Overview of Data Frame Concepts
05:43
Advantages of Pyspark Data Frames
03:40
Overview of Spark Data Frames and their Characteristics
05:13
Projecting Data in Spark Data Frames using Select
03:57
Using drop to drop columns from Spark Data Frame
02:28
Applying functions on Spark Data Frame Columns
05:55
Using withColumn to apply transformations on Spark Data Frames
03:19
Overview of Writing Data in Data Frame to Delta Files
04:49

Basic Transformations using Pyspark Data Frame APIs

13 lectures
Overview of Basic Transformations using Pyspark Data Frame APIs
03:00
Overview of Row Level Transformations
05:15
Apply Row Level transformations using Pyspark Data Frame APIs
07:27
Filtering Data using Pyspark Data Frame APIs
04:39
Filtering Data with Multiple Conditions using Pyspark Data Frame APIs
05:54
Perform Aggregations by Key using Spark Data Frame APIs
04:31
Perform Aggregations by Key using Spark Data Frame APIs
06:03
Perform Aggregations by Key using Spark Data Frame APIs
03:57
Sorting Data using Spark Data Frame APIs
05:20
Composite Sorting using Spark Data Frame APIs
05:56
Develop Spark SQL Queries for Sorting Data
02:50
Review Data Set with Nulls for Sorting using Spark Data Frame APIs
05:26
Dealing with Nulls while Sorting the Data in Spark Data Frames
04:09

Joining Data using Spark Data Frame APIs

9 lectures
Introduction to Joining Data using Spark Data Frame APIs
01:41
Create Data Frames to Join using Spark Data Frame APIs
05:04
Review Syntax for join using Spark Data Frame APIs
02:44
Inner Join using Spark Data Frame APIs
05:31
Join and other Spark Data Frame APIs to process the data
06:56
Analyze Data for outer joins using Spark Data Frame APIs
03:26
Left Outer Join using Spark Data Frame APIs
06:07
Right Outer Join using Spark Data Frame APIs
04:19
Equivalent Spark SQL Queries for Joins
05:05

Ranking using Pyspark Data Frame APIs

7 lectures
Introduction to Ranking using Spark Data Frame APIs
05:52
Syntax for ranking using Spark Data Frame APIs
02:59
Compute Global Ranks using Spark Data Frame APIs
05:02
Filter Based on Global Ranks using Spark Data Frame APIs
04:19
Compute Ranks per Partition using Spark Data Frame APIs
03:30
Filter Based on Ranks Per Partition using Spark Data Frame APIs
03:46
Difference Between rank and dense_rank
07:27

Integration of Spark SQL and Pyspark Data Frame APIs

7 lectures
Introduction to Integration of Spark SQL and Pyspark Data Frame APIs
01:42
Run Spark SQL Queries on Spark Data Frames
05:18
Create Spark Metastore Tables using Data Frames
04:59
Insert into Spark Metastore Tables using Data Frames
04:02
Read Data from Spark Metastore Table to Data Frames
03:33
Process Data in Spark Metastore Tables using Data Frame APIs
05:37
Manage Spark Metastore Database Objects using Spark APIs
05:07

ELT Data Pipelines using Databricks

13 lectures
Overview of Databricks Workflows
03:10
Pass Arguments to Databricks Python Notebooks
07:36
Pass Arguments to Databricks SQL Notebooks
03:16
Create and Run First Databricks Job
07:31
Run Databricks Jobs and Tasks with Parameters
05:40
Create and Run Orchestrated Pipeline using Databricks Job
06:53
Import ELT Data Pipeline Applications into Databricks Environment
02:56
Spark SQL Application to Cleanup Database and Datasets
03:52
Review File Format Converter Pyspark Code
05:11
Review Databricks SQL Notebooks for Tables and Final Results
03:57
Validate Applications for ELT Pipeline using Databricks
07:36
Build ELT Pipeline using Databricks Job in Workflows
09:22
Run and Review Execution details of ELT Data Pipeline using Databricks Job
05:00

Performance Tuning of Spark - Catalyst Optimizer

10 lectures
Getting Started with Performance Tuning using Spark on Databricks
03:12
Overview of Spark Catalyst Optimizer
04:15
Review Explain Plan for Spark Dataframe logic using Spark UI
03:41
Review Explain Plan for Spark SQL logic using Spark UI
03:03
Generate Explain Plans on Spark Dataframes using explain function
03:47
Generate Explain Plans on Spark SQL Queries using explain command
02:33
Interpreting Explain Plan for Spark SQL Query
04:57
Overview of Spark Architecture
03:27
Understand Filter and Broadcast of Orders Data
05:31
Understand Join and Aggregation for Daily Product Revenue
04:18

Performance Tuning of Spark - Cluster Configuration

11 lectures
Introduction to Databricks Cluster Configuration
01:58
Difference Between All Purpose and Jobs Clusters
05:17
Setting up All Purpose Databricks Compute Clusters
04:46
Understand the size of the data using dbutils
06:03
Create Multinode Databricks Cluster with Auto Scaling
02:58
Overview of Auto Scaling of Databricks Clusters
03:21
Performance Tuning of Cluster using Auto Scaling
04:40
Analyze Airlines Data using Spark SQL
02:58
Setup Databricks Job Compute Clusters using Workflows
04:27
Review Running Job Details using Spark UI
03:03
Review Completed Job Details using Spark UI
03:20

Performance Tuning while inferring schema from CSV or JSON files

5 lectures
Overview of Inferring Schema using CSV or JSON Files
01:45
Steps to convert CSV or JSON Files to Parquet or Delta Files
03:50
Overview of CSV or JSON Files
05:08
Overview of overhead for inferring schema
05:47
Performance Tuning to infer schema of Spark Dataframe
04:58

Performance Tuning using Columnar File Format and Partitioning Strategy

18 lectures
Introduction to Performance Tuning while storing data in Data Lake
01:15
Side effects of using CSV Files in Data Lake
04:28
Review the side effects of using CSV Files in Data Lake
03:02
Restructure CSV Data to Columnar Format using Pyspark
04:55
Compute Size of restrucutured data using Parquet File Format
03:53
Run Operations on Partitioned Parquet Data
04:21
Overview of Performance Assessment of Spark Jobs
03:43
Review Performance Details of Spark Operations on Parquet Files
07:54
Overview of Columnar File Formats in Spark and Databricks
06:49
Overview of Folder Structure for Partitioned Data in Spark
04:48
Evaluate Requirements against Partition Pruning
06:32
Solutions on Airlines Data for Performance Tuning using Partitioning
07:00
Review Execution Details for Performance Tuning using Partitioning
10:17
Get Airlines Data using Date Range without Partition Pruning
06:23
Parameterize Spark SQL Solution for Partition Pruning
03:42
Add Condition for Partition Pruning
04:41
Redesign Partition Strategy to tune the performance
05:21
Recap of Spark Performance Tuning Scenarios
04:51

Setup Hadoop and Spark Cluster using Dataproc

14 lectures
Introduction to Setup Hadoop and Spark Cluster using Dataproc
03:18
Overview of different Spark Platforms on Cloud
06:42
Overview of Hadoop and Spark Cluster Types and Architecture
10:46
Setup Single Node Hadoop and Spark Cluster using Dataproc
03:42
Validate SSH Connectivity to the Dataproc Cluster
06:28
Convert IP Address to Static for Dataproc Cluster
03:58
Setup Project using VS Code Remote Development
15:46
Setup Local Data Sets on Hadoop and Spark Cluster
02:18
Getting Started with HDFS Commands to Manage Files
07:30
Getting Started with Spark CLI using Python
04:52
Getting Started with Spark CLI using Scala
03:59
Getting Started with Spark CLI using SQL
05:00
Stopping the Cluster and Understanding the costs
05:36
Download VS Code Workspace and Delete Cluster
04:17

Recap of important Linux Commands for Data Engineering

16 lectures
Introduction to Linux Commands and Scripts for Data Engineers
02:56
Overview of SSH to connect to remote Servers
21:33
Overview of Profile in Linux Shell
07:15
Overview of Environment Variables in Linux
06:18
Understanding PATH Environment Variable
18:13
Creating Folders in Linux using mkdir
08:54
Copy Files and Folders in Linux using cp command
09:39
Move Files and Folders in Linux using mv command
06:31
Delete Files and Folders in Linux using rm command
08:37
Listing Files and Folders using ls command
10:11
Searching for files using find command in Linux
06:15
Standard Directories in Linux
08:44
Troubleshooting issues in Linux using grep command
13:46
Overview of Shell Scripts
10:39
Running and Debugging Shell Scripts with Arguments
09:26
Overview of Hadoop and Spark Executables
07:44

Mastering Hadoop HDFS Commands and Concepts

18 lectures
Introduction to Mastering Hadoop HDFS Commands and Concepts
02:20
Start the Hadoop and Spark Cluster using Dataproc
02:22
Create Hadoop and Spark Cluster and Setup VS Code Workspace
07:12
Overview of important HDFS Commands
02:51
Create Folder and Copy Files into HDFS using Commands
08:19
Programmatically Copy files into HDFS using Python
11:21
Overview of Multinode Hadoop Cluster
12:19
Review Important Properties of HDFS
06:20
Review HDFS Properties on Dataproc Cluster using VS Code
04:04
Overview of local storage of files
09:35
Setup Data Sets to understand HDFS Concepts
05:51
Overview of Distributed Storage of files in HDFS
06:39
Determining Number of Blocks for each file
09:31
Overview of Blocks related to files in HDFS
09:11
Overview of Replication related to files in HDFS
08:22
Physical Storage of HDFS File Blocks
06:52
Overview of HDFS Namenode for HDFS File Metadata
08:40
Recap of HDFS on Dataproc Cluster
11:32

Build Hive Applications in Hadoop and Spark Clusters

20 lectures
Introduction to Building Hive Applications
02:11
Getting Started with Hive
15:02
Overview of Hive Architecture
08:38
Overview of Integrating Hive Commands with Shell Scripts
07:16
Run Hive Commands using Scripts
08:52
Override Run time Hive Configuration Properties and Variables
11:09
Getting Started Data Loader for NYSE Data using HIve
07:24
Design NYSE Data Loader Application
05:52
Create Partitioned Parquet Table for NYSE Data
11:30
Populate Data into Partitioned NYSE table from Stage Table
10:03
Process NYSE Data and load into partitioned table
05:45
Develop Hive QL Script to Load NYSE Data
05:20
Run Hive QL Commands using Script for NYSE Loader
07:33
Redesign the Solution using HDFS to stage files
12:50
Validate Hive Application to Convert NYSE Data
09:13
Deploy Hive Application in HDFS
05:47
Develop Shell Wrapper to run Hive Application
07:50
Overview of Scheduling and Crontab
09:36
Schedule Hive Applications using Cron
13:11
Recap of Application Development Life Cycle using Hive
05:07

Getting Started with Spark SQL on Hadoop and Spark Cluster

8 lectures
Getting Started with Spark SQL on Hadoop and Spark Cluster
01:45
Getting Started with Data Sets and Spark SQL CLI
05:16
Spark SQL Metastore Architecture
11:14
Overview of Spark Metastore Warehouse Directory
05:14
Launch Spark SQL CLI with Delta Lake Packages
04:58
Populate Data into Delta Lake Tables using Spark SQL
14:31
Run Individual Spark SQL Commands
10:06
Run Spark SQL Scripts using Spark SQL CLI
08:14

Build Real Time Applications using Spark SQL with Shell Wrapper

10 lectures
Introduction to Application Development Life Cycle of Spark SQL Applications
01:11
Review the Requirements and Datasets for NYSE Data
05:33
Design NYSE Converter Application using Spark SQL and Delta
04:22
Copy Files into HDFS for NYSE Converter
04:28
Create External Stage Table for NYSE CSV Files
07:30
Create Target Table for NYSE Data using Delta Format
08:44
Populate Data for Additional Years into Delta NYSE Table
15:30
Develop Spark SQL Application for NYSE Data Conversion
05:15
Validate Spark SQL Application for NYSE Data Conversion
06:56
Develop Shell Wrapper for Spark SQL Application
06:54

Getting Started with Pyspark on Hadoop and Spark Cluster

8 lectures
Introduction to Getting Started with Pyspark
01:31
Launching or Getting Started Pyspark CLI
09:03
Overview of Spark Properties Files
04:56
Review Data Sets to explore Pyspark APIs
04:34
Getting Started with Pyspark for Data Processing
06:13
Read Orders and Order Items Data into Spark Data Frames
05:28
Process Data using pyspark Dataframe APIs
10:06
Overview of Spark Submit Command
04:09

Submitting Python based Spark Applications

13 lectures
Introduction to Submitting Python based Spark Applications
01:03
Develop Pyspark Application for Daily Revenue
05:52
Run Spark Application using spark-submit
06:05
Specify Paths using Environment Variables in Spark Applications
06:25
Run Spark Application with Environment Variables in Client Mode
04:07
Run Spark Application with Environment Variables in Cluster Mode
05:07
Review Spark Application Details using Spark UI
03:11
Review YARN Logs for Spark Applications in Cluster Mode
06:51
Overview of Execution Process of Spark Applications
07:27
Deep Dive into Spark Deploy Modes
13:29
Submit Spark Applications with dependencies as packages
12:09
Submit Spark Applications with dependencies as jars
16:19
Develop Shell Wrappers to submit Spark Applications
07:49

Logging in Python based Spark Applications

8 lectures
Introduction to Logging in Python baesd Spark Applications
01:06
Run Application without logging
06:33
Overview of Logging Concepts such as Log Levels
04:49
Getting Started with logging using Python
06:52
Changing the Log Message Format using logging
06:28
Add logging to Python based Spark Applications
03:56
Validate Logging of Spark Application using Client Mode
03:49
Validate Logging of Spark Application using Cluster Mode
05:16

Performance Tuning of Spark Applications on Hadoop and Spark

31 lectures
Introduction to Performance Tuning of Spark Applications on Hadoop and Spark Clu
02:03
Delete Single Node Hadoop and Spark Cluster using Dataproc
02:12
Increase GCP VM Quotas for Mutlinode Hadoop and Spark Cluster
03:42
Review Quotas to setup Multinode Hadoop and Spark Cluster
02:16
Setup Multinode Hadoop and Spark Cluster using GCP Dataproc
03:47
Setup SSH Connectivity and VS Code Workspace using Master Node
07:19
Review Multi Node Hadoop and Spark Clusters using Web Interfaces
05:13
Overview of Multinode Hadoop and Spark Cluster Topology
12:07
Computing Overall Capacity of Multinode Hadoop and Spark Clusters
06:06
Determine Overall YARN Capacity
05:24
Overview of Spark History Server UI
04:53
Generate Test Data for Spark Performance Tuning
04:23
Develop Word Count Application using Spark
11:57
Develop and Validate Shell Script for Word Count
05:05
Overview of Jobs related to Spark Applications using Spark UI
08:14
Review Environment Properties and Disabling Dynamic Allocation
07:42
Overriding Spark Executor Instances to tune the performance
07:14
Determine Maximum Capacity to submit a Spark Application
03:35
Overview of Adaptive Query Execution
07:03
Overview of Shuffling - Part 1
08:51
Overview of Shuffling - Part 2
12:21
Overview of Spark Application
04:44
Overview of Lazy Evaluation
08:40
Review the code of Word Count Application
03:22
Run Spark Application with out Adaptive Query Execution
10:12
Run Spark Application using Adaptive Query Execution
06:20
Overview of Spark Dynamic Allocation
09:27
Demo on Spark Dynamic Allocation
10:41
Running Spark Application using Dynamic Allocation
06:46
Overview of number of Spark Partitions
08:50
Delete Multinode Hadoop and Spark Cluster
03:09

Đánh giá của học viên

Chưa có đánh giá
Course Rating
5
0%
4
0%
3
0%
2
0%
1
0%

Bình luận khách hàng

Viết Bình Luận

Bạn đánh giá khoá học này thế nào?

image

Đăng ký get khoá học Udemy - Unica - Gitiho giá chỉ 50k!

Get khoá học giá rẻ ngay trước khi bị fix.