Mô tả

As part of this course, you will be learning to build streaming pipelines by integrating Kafka and Spark Structured Streaming. Let us go through the details about what is covered in the course.

  • First of all, we need to have the proper environment to build streaming pipelines using Kafka and Spark Structured Streaming on top of Hadoop or any other distributed file system. As part of the course, you will start with setting up a self-support lab with all the key components such as Hadoop, Hive, Spark, and Kafka on a single node Linux-based system.

  • Once the environment is set up you will go through the details related to getting started with Kafka. As part of that process, you will create a Kafka topic, produce messages into the topic as well as consume messages from the topic.

  • You will also learn how to use Kafka Connect to ingest data from web server logs into Kafka topic as well as ingest data from Kafka topic into HDFS as a sink.

  • Once you understand Kafka from the perspective of Data Ingestion, you will get an overview of some of the key concepts of related Spark Structured Streaming.

  • After learning Kafka and Spark Structured streaming separately, you will build a streaming pipeline to consume data from Kafka topic using Spark Structured Streaming, then process and write to different targets.

  • You will also learn how to take care of incremental data processing using Spark Structured Streaming.

Course Outline

Here is a brief outline of the course. You can choose either Cloud9 or GCP to provision a server to set up the environment.

  • Setting up Environment using AWS Cloud9 or GCP

  • Setup Single Node Hadoop Cluster

  • Setup Hive and Spark on top of Single Node Hadoop Cluster

  • Setup Single Node Kafka Cluster on top of Single Node Hadoop Cluster

  • Getting Started with Kafka

  • Data Ingestion using Kafka Connect - Web server log files as a source to Kafka Topic

  • Data Ingestion using Kafka Connect - Kafka Topic to HDFS a sink

  • Overview of Spark Structured Streaming

  • Kafka and Spark Structured Streaming Integration

  • Incremental Loads using Spark Structured Streaming

Udemy based support

In case you run into technical challenges while taking the course, feel free to raise your concerns using Udemy Messenger. We will make sure that issue is resolved in 48 hours.

Bạn sẽ học được gì

Setting up self support lab with Hadoop (HDFS and YARN), Hive, Spark, and Kafka

Overview of Kafka to build streaming pipelines

Data Ingestion to Kafka topics using Kafka Connect using File Source

Data Ingestion to HDFS using Kafka Connect using HDFS 3 Connector Plugin

Overview of Spark Structured Streaming to process data as part of Streaming Pipelines

Incremental Data Processing using Spark Structured Streaming using File Source and File Target

Integration of Kafka and Spark Structured Streaming - Reading Data from Kafka Topics

Yêu cầu

  • Laptop with decent configuration
  • Decent internet speed to watch the lessons
  • Self Support lab (instructions will be provided as part of the course) or ITVersity labs
  • Knowledge about Functional Programming (preferably Python or Scala)
  • Knowledge or experience using Spark

Nội dung khoá học

11 sections

Introduction

5 lectures
Introduction to Data Engineering using Kafka and Spark Structured Streaming
03:55
Important Note for first time Data Engineering Customers
02:55
Important Note for Data Engineering Essentials (Python and Spark) Customers
02:53
How to get 30 days complementary lab access?
02:35
How to access material used for this course?
00:32

Getting Started with Kafka

10 lectures
Overview of Kafka
05:50
Managing Topics using Kafka CLI
07:07
Produce and Consume Messages using CLI
04:45
Validate Generation of Web Server Logs
02:29
Create Web Server using nc
04:42
Produce retail logs to Kafka Topic
02:24
Consume retail logs from Kafka Topic
01:46
Clean up Kafka CLI Sessions to produce and consume messages
01:28
Define Kafka Connect to produce
08:08
Validate Kafka Connect to produce
07:46

Data Ingestion using Kafka Connect

10 lectures
Overview of Kafka Connect
01:33
Define Kafka Connect to Produce Messages
09:12
Validate Kafka Connect to produce messages
07:55
Cleanup Kafka Connect to produce messages
02:45
Write Data to HDFS using Kafka Connect
01:24
Setup HDFS 3 Sink Connector Plugin
07:36
Overview of Kafka Consumer Groups
08:00
Configure HDFS 3 Sink Properties
06:30
Run and Validate HDFS 3 Sink
06:51
Cleanup Kafka Connect to consume messages
03:50

Overview of Spark Structured Streaming

12 lectures
Understanding Streaming Context
05:10
Validate Log Data for Streaming
01:56
Push log messages to Netcat Webserver
04:37
Overview of built-in Input Sources
02:09
Reading Web Server logs using Spark Structured Streaming
12:33
Overview of Output Modes
01:18
Using append as Output Mode
08:21
Using complete as Output Mode
07:34
Using update as Output Mode
04:04
Overview of Triggers in Spark Structured Streaming
05:39
Overview of built-in Output Sinks
02:32
Previewing the Streaming Data
07:37

Kafka and Spark Structured Streaming Integration

9 lectures
Create Kafka Topic
03:09
Read Data from Kafka Topic
05:32
Preview data using console
05:18
Preview data using memory
03:23
Transform Data using Spark APIs
09:26
Write Data to HDFS using Spark
07:57
Validate Data in HDFS using Spark
07:12
Write Data to HDFS using Spark using Header
07:06
Cleanup Kafka Connect and Files in HDFS
03:01

Incremental Loads using Spark Structured Streaming

18 lectures
Overview of Spark Structured Streaming Triggers
01:36
Steps for Incremental Data Processing
03:16
Create Working Directory in HDFS
03:21
Logic to Upload GHArchive Files
12:22
Upload GHArchive Files to HDFS
12:58
Add new GHActivity JSON Files
03:40
Read JSON Data using Spark Structured streaming
06:11
Write in Parquet File Format
11:05
Analyze GHArchive Data in Parquet files using Spark
07:52
Add New GHActivity JSON files
01:36
Load Data Incrementally to Target Table
07:39
Validate Incremental Load
02:56
Add New GHActivity JSON files
01:12
Using maxFilerPerTrigger and latestFirst
10:42
Validate Incremental Load
03:30
Add New GHActivity JSON files
01:29
Incremental Load using Archival Process
10:18
Validate Incremental Load
03:09

Setting up Environment using AWS Cloud9

9 lectures
Getting Started with Cloud9
02:50
Creating Cloud9 Environment
04:36
Warming up with Cloud9 IDE
03:19
Overview of EC2 related to Cloud9
01:31
Opening ports for Cloud9 Instance
03:12
Associating Elastic IPs to Cloud9 Instance
04:13
Increase EBS Volume Size of Cloud9 Instance
03:15
Setup Jupyter Lab on Cloud9
07:06
[Commands] Setup Jupyter Lab on Cloud9
00:15

Setting up Environment - Overview of GCP and Provision Ubuntu VM

8 lectures
Signing up for GCP
03:25
Overview of GCP Web Console
02:01
Overview of GCP Pricing
06:10
Provision Ubuntu VM from GCP
07:41
Setup Docker
08:16
Validating Python
04:05
Setup Jupyter Lab
10:29
Setup Jupyter Lab locally on Mac
05:59

Setup Single Node Hadoop Cluster

10 lectures
Introduction to Single Node Hadoop Cluster
02:45
Material related to setting up the environment
00:20
Setup Prerequisites
04:55
Setup Password less login
03:36
Download and Install Hadoop
04:14
Configure Hadoop HDFS
06:50
Start and Validate HDFS
05:59
Configure Hadoop YARN
01:19
Start and Validate YARN
02:03
Managing Single Node Hadoop
03:49

Setup Hive and Spark

15 lectures
Setup Data Sets for Practice
05:02
Download and Install Hive
02:17
Setup Database for Hive Metastore
07:13
Configure and Setup Hive Metastore
07:10
Launch and Validate Hive
04:57
Scripts to Manage Single Node Cluster
04:40
Download and Install Spark 2
03:28
Configure Spark 2
09:02
Validate Spark 2 using CLIs
08:23
Validate Jupyter Lab Setup
09:45
Integrate Spark 2 with Jupyter Lab
06:08
Download and Install Spark 3
01:55
Configure Spark 3
06:46
Validate Spark 3 using CLIs
07:54
Integrate Spark 3 with Jupyter Lab
04:19

Setup Single Node Kafka Cluster

7 lectures
Download and Install Kafka
05:03
Configure and Start Zookeeper
03:28
Configure and Start Kafka Broker
05:12
Scripts to manage single node cluster
05:00
Overview of Kafka CLI
08:40
Setup Retail log Generator
03:58
Redirecting logs to Kafka
05:34

Đánh giá của học viên

Chưa có đánh giá
Course Rating
5
0%
4
0%
3
0%
2
0%
1
0%

Bình luận khách hàng

Viết Bình Luận

Bạn đánh giá khoá học này thế nào?

image

Đăng ký get khoá học Udemy - Unica - Gitiho giá chỉ 50k!

Get khoá học giá rẻ ngay trước khi bị fix.