Mô tả

  • End to End PySpark Real Time Project Implementation.

  • Projects uses all the latest technologies - Spark, Python, PyCharm, HDFS, YARN, Google Cloud, AWS, Azure, Hive, PostgreSQL.

  • Learn a pyspark coding framework, how to structure the code following industry standard best practices.

  • Install a single Node Cluster at Google Cloud and integrate the cluster with Spark.

  • install Spark as a Standalone in Windows.

  • Integrate Spark with a Pycharm IDE.

  • Includes a Detailed HDFS Course.

  • Includes a Python Crash Course.

  • Understand the business Model and project flow of a USA Healthcare project.

  • Create a data pipeline starting with data ingestion, data preprocessing, data transform, data storage ,data persist and finally data transfer.

  • Learn how to add a Robust Logging configuration in PySpark Project.

  • Learn how to add an error handling mechanism in PySpark Project.

  • Learn how to transfer  files to AWS S3.

  • Learn how to transfer  files to Azure Blobs.

  • This project is developed in such a way that it can be run automated.

  • Learn how to add an error handling mechanism in PySpark Project.

  • Learn how to persist data in Apache Hive for future use and audit.

  • Learn how to persist data in PostgreSQL for future use and audit.                

  • Full Integration Test.                  

  • Unit Test.             


Bạn sẽ học được gì

End to End PySpark Real Time Project Implementation.

Projects uses all the latest technologies - Spark, Python, PyCharm, HDFS, YARN, Google Cloud, AWS, Azure, Hive, PostgreSQL

Learn a pyspark coding framework, how to structure the code following industry standard best practices.

Install a single Node Cluster at Google Cloud and integrate the cluster with Spark.

install Spark as a Standalone in Windows.

Integrate Spark with a Pycharm IDE.

Includes a Detailed HDFS Course.

Includes a Python Crash Course.

Understand the business Model and project flow of a USA Healthcare project.

Create a data pipeline starting with data ingestion, data preprocessing, data transform, data storage ,data persist and finally data transfer.

Learn how to add a Robust Logging configuration in PySpark Project.

Learn how to add an error handling mechanism in PySpark Project.

Learn how to transfer files to S3 and Azure Blobs.

Learn how to persist data in Hive and PostgreSQL for future use and audit (Will be added shortly)

Yêu cầu

  • Basic Knowledge on PySpark. You may brush up your knowledge from my another course 'Complete PySpark Developer Course".
  • Basic Knowledge on HDFS (A detailed HDFS course is included in this course)
  • Basic Knowledge on Python (A Python Crash course is included in this course)

Nội dung khoá học

27 sections

Preview the Course

1 lectures
Preview
08:10

Project Description

6 lectures
Project Slides
00:01
Project Flow
03:15
High Level Functional Specification
03:02
Project Flow (Code Level)
02:47
Project Parts
00:20
Approach
01:07

Download Input Files

1 lectures
Download Input Vendor Files - City and Prescriber
00:52

Single Node Cluster Installation (Spark 2.x/3.x, Hive, HDFS, PostgreSQL, Docker)

12 lectures
Introduction and Installation Flow Chart
06:02
Resources
00:01
Register Free at Google Cloud (GCP) and Launch an Ubuntu based Virtual Machine
07:27
Set Up Python and Java
04:42
Set up Secure Connect to Localhost
03:03
Set up Hadoop tar, HDFS, YARN and manage Cluster Services
16:16
Set Up Docker, PostgreSQL, Hive Part 1
15:29
Set up Docker, PostgrSQL, Hive, Metastore Part 2
06:26
Set up Spark 2.x and Spark 3.x Part 1
16:30
Set up Spark 2.x and Spark 3.x Part 2
10:57
Set up Web UI and ports for Cluster and Application History
12:30
Manage the Cluster - Start & Stop the Cluster
02:21

Spark Installation - Set Up Standalone (Windows)

11 lectures
Resources
00:01
Minimum Supported Versions/Prerequisites
01:59
Java Installation
10:51
Python Installation
04:19
Spark Installation
10:10
WinUtils Set up
06:07
PyCharm Installation
05:09
PyCharm Basics
09:04
PyCharm Run Time Arguments
04:31
PyCharm Integrate Python and PySpark
06:08
How to debug Python Applications using PyCharm
12:55

HDFS Course

19 lectures
What is HDFS and Why to use HDFS
03:38
Resources
00:01
HDFS Components and Metadata
03:54
Data Blocks and Replication
05:07
Rack Awareness
01:49
HDFS Read Mechanism Architecture
02:35
Exercise - HDFS CLI Help Commands
02:14
Exercise - Bring Data from GitHub to Local to HDFS
02:32
Exercise - Listing and Sorting Files and Directories in HDFS
02:54
Exercise - Create or Remove Directories in HDFS
09:04
Exercise - Copy Data from HDFS to Local
05:08
Exercise - Copy data from Local to HDFS
06:54
Exercise - Preview Data in HDFS
03:51
Exercise - Knowing Statistics in HDFS
03:42
Exercise - Knowing Storage in HDFS File System
03:07
Exercise - Metadata in HDFS
04:54
Exercise - Managing File Permissions in HDFS
05:28
Exercise - Update Properties in HDFS
08:19
Note
00:08

Python Crash Course

37 lectures
Introduction and Installation
06:15
Main Features of Python
05:13
Python Basics
08:52
Python Variables
04:27
print(), dir(), help()
05:15
Python Operators
06:55
Modules
06:12
Python Datatypes - Numeric Types
13:31
String
19:01
Python Datatypes - List Part 1
14:45
Python Datatypes - List Part 2
12:14
Tuple
04:00
Set
08:47
Dictionary
12:06
Date and Time
06:10
Conditional Statements (if ... else)
03:56
For Loop
05:41
While Loop
04:25
User Defined Functions
06:49
Lambda Functions
03:43
Map Function
05:04
Filter Function
01:30
Reduce Function
02:31
File Handling
09:31
OOPs Basics Part 1
09:53
OOPs Basics Part 2
05:21
OOPs Basics - Exercise
10:26
OOPS Basics - Class Attributes
04:40
Python Special Variable : __name__
04:23
Work with Environment Variables
04:38
Exception Handling in Python
12:54
How to Traceback Exceptions in Python
09:45
Logging in Python - Download Slides
00:01
Logging in Python - Introduction
07:40
Logging in Python - Integrate with Exception Stack traces
02:28
Logging in Python - Custom Logger
07:12
Logging in Python - Using Configuration File
15:20

Project Set up

4 lectures
Project Folders Set up at PyCharm
04:04
Project Integration with PySpark
02:34
Understand input and output File Layouts
05:25
Move Input Files to Project Staging Folder
00:31

Part 1 Introduction

1 lectures
Part 1 Introduction
04:26

Declare Variables

1 lectures
Write Script to declare All Variables
09:05

Create Objects

6 lectures
Create Objects - Spark Object
04:39
Create Objects - Validate Spark Object
04:43
Create Objects - Integrate Exception Handling
07:38
Create Objects - Implement Logging
11:20
Create Objects - Integrate Logging with Exception
06:32
Create Objects - Add Custom Logger
07:46

Data Ingestion

3 lectures
Data Ingestion - Load City Dimension File Part 1
13:34
Data Ingestion - Load City Dimension File Part 2
12:14
Data Ingestion - Load Prescriber Fact File
05:48

Data Preprocessing

8 lectures
Data Preprocessing - City Dimension
11:25
Data Preprocessing - Prescriber DataFrame Part 1
02:59
Data Preprocessing - Prescriber DataFrame Part 2
06:18
Validation - Print Schema for any DataFrame
04:56
Data Preprocessing - Prescriber Dimension Part 3
07:33
Data Preprocessing - Prescriber DataFrame Part 4
04:02
Data Preprocessing - Prescriber DataFrame Part 5
03:09
Data Preprocessing - Prescriber DataFrame Part 6
11:59

Data Transform

10 lectures
Data Transform - City Report
12:38
Data Transform - Prescriber Report
11:00
Quick Note to Copy Code
00:28
Quick Note on connect PyCharm to GCP
00:35
Copy developed codes from Windows to GCP
01:27
Note to Install Pandas Latest Version
00:14
Create HDFS Folders to keep input city and Fact Files
02:30
Write and Execute Unix Shell Script to Copy data into HDFS
09:30
Code Changes in the scripts to accommodate HDFS Paths
08:39
Perform a Test run using spark-submit at Cluster
04:01

Data Extraction

2 lectures
File Extraction - City and Prescriber Report
17:48
Validations - City and Prescriber Reports
02:33

Wrap up Part 1

1 lectures
Part 1 - Combine all scripts into one
02:59

Part 2 Introduction

1 lectures
Part 2 - Introduction
00:42

Copy Files HDFS to Local

1 lectures
Copy final City and Presc files HDFS to Local Server
04:51

Copy Files to AWS S3

4 lectures
Prepare for S3 Transfer
01:09
Set up Free Tier AWS Account and Create a S3 Bucket
03:47
Set up AWS CLI Client, Create Profile and Access S3 Bucket
05:23
Push Files to S3
06:38

Copy Files to Azure Blob

3 lectures
Set up Free Microsoft Azure Account and Create Containers
04:16
Install azcopy at our Local Server
02:15
Push Files to Azure Blobs
10:36

Wrap Up Part 2

1 lectures
Wrap up Part 2 and add the part2 scripts in the main script
01:30

Part 3 Introduction

1 lectures
Part 3 Introduction
00:50

Data Persist at Hive

4 lectures
Persist Data into Hive Part 1
02:20
Persist Data into Hive Part 2
07:58
Persist Data into Hive Part 3
06:12
Persist data into Hive Part 4
08:14

Data Persist at PostgreSQL

5 lectures
Persist data at PostgreSQL Introduction
00:27
Persist Data at PostgreSQL Part 1
02:22
Persist Data at PostgreSQL Part 2
06:07
Persist Data at PostgreSQL Part 3
04:58
Persist Data at PostgreSQL Part 4
02:29

Wrap up Part 3

1 lectures
Wrap up Part 3
00:55

Full Integration Test

4 lectures
Full Integration Test Introduction
00:44
Quick Note - Add New Lines to the Logger Statements to make log files readable
00:49
Create Master script for final Integration
07:12
Full Integration Test
03:30

Unit Test

6 lectures
Introduction to Unit Testing
01:17
Why we need Unit Test ?
02:30
Basic Structure of Unit Test in Python
04:28
Sample Unit Tests
04:45
How to get Help for Unit Test Functions
04:13
Unit Test for our Project
09:28

Đánh giá của học viên

Chưa có đánh giá
Course Rating
5
0%
4
0%
3
0%
2
0%
1
0%

Bình luận khách hàng

Viết Bình Luận

Bạn đánh giá khoá học này thế nào?

image

Đăng ký get khoá học Udemy - Unica - Gitiho giá chỉ 50k!

Get khoá học giá rẻ ngay trước khi bị fix.