Mô tả

This course will show each step to write an ETL pipeline in Python from scratch to production using the necessary tools such as Python 3.9, Jupyter Notebook, Git and Github, Visual Studio Code, Docker and Docker Hub and the Python packages Pandas, boto3, pyyaml, awscli, jupyter, pylint, moto, coverage and the memory-profiler.

Two different approaches how to code in the Data Engineering field will be introduced and applied - functional and object oriented programming.

Best practices in developing Python code will be introduced and applied:

  • design principles

  • clean coding

  • virtual environments

  • project/folder setup

  • configuration

  • logging

  • exeption handling

  • linting

  • dependency management

  • performance tuning with profiling

  • unit testing

  • integration testing

  • dockerization


What is the goal of this course?

In the course we are going to use the Xetra dataset. Xetra stands for Exchange Electronic Trading and it is the trading platform of the Deutsche Börse Group. This dataset is derived near-time on a minute-by-minute basis from Deutsche Börse’s trading system and saved in an AWS S3 bucket available to the public for free.

The ETL Pipeline we are going to create will extract the Xetra dataset from the AWS S3 source bucket on a scheduled basis, create a report using transformations and load the transformed data to another AWS S3 target bucket.

The pipeline will be written in a way that it can be deployed easily to almost any production environment that can handle containerized applications. The production environment we are going to write the ETL pipeline for consists of a GitHub Code repository, a DockerHub Image Repository, an execution platform such as Kubernetes and an Orchestration tool such as the container-native Kubernetes workflow engine Argo Workflows or Apache Airflow.


So what can you expect in the course?

You will receive primarily practical interactive lessons where you have to code and implement the pipeline and theory lessons when needed. Furthermore you will get the python code for each lesson in the course material, the whole project on GitHub and the ready to use docker image with the application code on Docker Hub.

There will be power point slides for download for each theoretical lesson and useful links for each topic and step where you find more information and can even dive deeper.


Bạn sẽ học được gì

How to write professional ETL pipelines in Python.

Steps to write production level Python code.

How to apply functional programming in Data Engineering.

How to do a proper object oriented code design.

How to use a meta file for job control.

Coding best practices for Python in ETL/Data Engineering.

How to implement a pipeline in Python extracting data from an AWS S3 source, transforming and loading the data to another AWS S3 target.

Yêu cầu

  • Basic Python and Pandas knowledge is desirable.
  • Basic ETL and AWS S3 knowledge is desirable.

Nội dung khoá học

8 sections

Introduction

5 lectures
Course Introduction
03:26
[Important!] Updates
00:20
Task Description
04:01
Production Environment
02:03
Task Steps
03:49

Quick and Dirty Solution

9 lectures
Why to use a virtual environment?
04:02
Virtual Environment Setup
06:10
AWS Setup
06:53
Understanding the source data
10:10
Quick and Dirty: Read multiple files
12:27
Quick and Dirty: Transformations
15:48
Quick and Dirty: Argument Date
09:57
Quick and Dirty: Save to S3
08:41
Quick and Dirty: Code Improvements
08:27

Functional Approach

8 lectures
Why a code design is needed?
02:42
Functional vs. Object Oriented Programming
06:29
Why Software Testing?
04:32
Quick and Dirty to Functions: Architecture Design
00:57
Quick and Dirty to Functions: Restructure Part 1
15:38
Quick and Dirty to Functions: Restructure Part 2
12:32
Restructure get_objects Intro
01:50
Restructure get_objects Implementation
11:35

Object Oriented Approach

7 lectures
Design Principles OOP
04:22
More Requirements - Configuration, Meta Data, Logging, Exceptions, Entrypoint
11:30
Meta Data: return_date_list Quick and Dirty
17:50
Meta Data: return_date_list Function
14:17
Meta Data: update_meta_file
12:12
Code Design - Class design, methods, attributes, arguments
13:41
Comparison Functional Programming and OOP
01:04

Setup and Class Frame Implementation

14 lectures
Setting up Git Repository
04:54
Setting up Python Project - Folder Structure
04:40
Installation Visual Studio Code
02:31
Setting up class frame - Task Description
01:54
Setting up class frame - Solution S3
11:57
Setting up class frame - Solution meta_process
01:01
Setting up class frame - Solution constants
01:00
Setting up class frame - Solution custom_exceptions
00:23
Setting up class frame - Solution xetra_transformer
02:54
Setting up class frame - Solution run
00:34
Logging in Python - Intro
01:28
Logging in Python - Implementation
12:50
Create Pythonpath
06:00
Python Clean Coding
03:34

Code Implementation

29 lectures
list_files_in_prefix - Thoughts
04:11
list_files_in_prefix - Implementation
02:21
list_files_in_prefix - Linting Intro
01:15
list_files_in_prefix - Pylint
04:46
list_files_in_prefix - Unit Testing Intro
02:58
list_files_in_prefix - Unit Test Specification
03:12
list_files_in_prefix - Unit Test Implementation 1
14:49
list_files_in_prefix - Unit Test Implementation 2
14:30
Task Description - Writing Methods
02:02
Solution - read_csv_to_df - Implementation
00:38
Solution - read_csv_to_df - Unit Test Implementation
02:15
Solution - write_df_to_s3 - Implementation
02:41
Solution - write_df_to_s3 - Unit Test Implementation
03:38
Solution - update_meta_file - Implementation
02:01
Solution - update_meta_file - Unit Test Implementation
06:29
Solution - return_date_list - Implementation
00:24
Solution - return_date_list - Unit Test Implementation
04:50
Solution - extract - Implementation
01:27
Solution - extract - Unit Test Implementation
04:30
Solution - transform_report1 - Implementation
00:35
Solution - transform_report1 - Unit Test Implementation
02:20
Solution - load - Implementation
00:43
Solution - load - Unit Test Implementation
01:32
Solution - etl_report1 - Implementation
00:34
Solution - etl_report1 - Unit Test Implementation
01:48
Integration Tests - Intro
00:28
Integration Tests - Test Specification
01:04
Integration Tests - Implementation
12:09
Entrypoint run - Implementation
12:22

Finalizing the ETL Job

6 lectures
Dependency Management - Intro
07:34
pipenv Implementation
04:12
Profiling and Timing - Intro
02:05
Mem-Profiler
03:07
Dockerization
04:54
Run in Production Environment
04:31

Summary

1 lectures
Summary
01:32

Đánh giá của học viên

Chưa có đánh giá
Course Rating
5
0%
4
0%
3
0%
2
0%
1
0%

Bình luận khách hàng

Viết Bình Luận

Bạn đánh giá khoá học này thế nào?

image

Đăng ký get khoá học Udemy - Unica - Gitiho giá chỉ 50k!

Get khoá học giá rẻ ngay trước khi bị fix.