Mô tả

Welcome to the Ultimate Web Scraping With Python Bootcamp, the only course you need to go from a complete beginner in python to a very competent web scraper.

Web scraping is the process of programmatically extracting data from the web. Scraping agents visit a web resource, extract content from it, and then process the resulting data in order to parse some specific information of interest.

Scraping is the kind of programming skill that offers immediate feedback, and can be used to automate a wide variety of data collection and processing tasks.

Over the next 17+ hours, we will methodically cover everything you need to know to write web scraping agents in python.

This bootcamp is organized in three parts of increasing difficulty designed to help you progressively build your skill.


Part I - Begin

We'll start by understanding how the web works by taking a closer look at HTTP, the key application layer communication protocol of the modern web. Next, we'll explore HTML, CSS, and JavaScript from first principles to get a deeper understanding of how website are built. Finally, we'll learn how to use python to send HTTP requests and parse the resulting HTML, CSS, and JavaScript to extract the data we need. Our goal in the first part of the course is to build a solid foundation in both web scraping and python, and put those skills to practice by building functional web scrapers from scratch. Selected topics include:


  • a detailed overview the request-response cycle

  • understanding user-agents, HTTP verbs, headers and statuses

  • understanding why custom headers can often be used to bypass paywalls

  • mastering the requests library to work with HTTP in python

  • what stateless means and how cookies work

  • exploring the role of proxies in modern web architectures

  • mastering beautifulsoup for parsing and data extraction


Part II - Refine

In the second part of the course, we'll build on the foundation we've already laid to explore more advanced topics in web scraping. We'll learn how to scrape dynamic websites that use JavaScript to render their content, by setting up Microsoft Playwright as a headless browser to automate this process. We'll also learn how to identify and emulate API calls to scrape data from websites that don't have formally public APIs. Our projects in this section will include an image scraper that can download a set number of high-resolution images given some keyword, as well as another scraping agent that extracts price and content of discounted video games from a dynamically rendered website. Topics include:


  • identifying and using hidden APIs and understanding the benefits they offer

  • emulating headers, cookies, and body content with ease

  • automatically generating python code from intercepted API requests using postman and httpie

  • working with the highly performant selectolax parsing library

  • mastering CSS selectors

  • introducing Microsoft Playwright for headless browsing and dynamic rendering


Part III - Master

In the final part of the course, we'll introduce scrapy. This will give us an excellent, time-tested framework for building more complex and robust web scrapers. We'll learn how to set up scrapy within a virtual environment and how to create spiders and pipelines to extract data from websites in a variety of formats. Having learned how to use scrapy, we'll then explore how to integrate it with Playwright so that we tackle the challenge of scraping dynamic websites from right within scrapy. We'll conclude this section by building a scraping agent that executes custom JavaScript code before returning the resulting HTML to scrapy. Some topics from this section:


  • learning how to set up scrapy and explore its command line interface ("the scrapy tool")

  • dynamically explore response objects using scrapy shell

  • understand and define item schemas and load data using itemloaders and input/output processors

  • integrate Playwright into scrapy to tackle dynamically rendered JavaScript sites

  • write PageMethods to specify highly specific instructions to the headless browser from right within scrapy

  • define custom pipelines for saving into SQL databases and highly customized output formats


In this bootcamp, I will take you step-by-step through engaging video lectures and teach you everything you need to know to get started with web scraping in python.

By the end of this course, you will have a complete toolset to conceptualize and implement scraping agents for any website you can imagine.


See you inside!

Bạn sẽ học được gì

Understand the fundamentals of web scraping in python from absolute scratch

Scrape information from static and dynamic websites and extract it to a variety of formats

Intercept and emulate hidden APIs to identify highly productive alternatives to getting your data

Master the requests library for working with HTTP

Parse and extract content from HTML using beautifulsoup, selectolax, and Microsoft Playwright

Master complex CSS selectors including descendant, child, sibling combinators

Understand how the web works, including HTTP, HTML, CSS, and JavaScript

Create scrapy crawlers and practice items, itemloaders and custom pipelines

Integrate scrapy with playwright for highly performant, fine-tuned dynamic website crawling

Practice processing and extracting data to a variety of formats including csv, json, xml, and SQL

Yêu cầu

  • No programming experience needed - I'll teach you everything you need to know
  • No paid software required - we'll be using open-sourced python libraries
  • A computer with access to the internet
  • Prepare to learn real skills you could put to practice right away

Nội dung khoá học

16 sections

Introduction

3 lectures
Prerequisites
01:19
A Useful Mental Model
03:39
All Code Resources
00:20

The HTTP Protocol

9 lectures
What Is HTTP?
02:46
The Request-Response Cycle
03:28
Extra: But, This Website Remembers Me
05:20
User-Agents
03:16
HTTP Verbs
02:38
Status Codes
06:13
Headers
03:35
Extra: Headers Do Lie
05:10
Proxies
05:45

HTML, CSS, And JavaScript

10 lectures
The Ingredients
05:40
Markup
08:32
Attributes
06:00
Presentation
04:42
Some More Rules
04:43
Behaviour
08:03
More JavaScript
04:28
JavaScript In Web Scraping
07:21
Comments
04:39
Embedded
05:15

Web Requests In Python

7 lectures
Urllib
05:36
Requests
05:36
Setting Headers
07:41
Query Parameters
11:13
Authentication And Authorization
07:05
Aside From GET
04:21
POSTing Data
06:39

Parsing And Extraction

15 lectures
BeautifulSoup
07:54
Tags
05:50
Parents, Children, And Descendants
08:13
Siblings
02:25
Extracting Text
06:35
All Strings
03:19
Search
11:15
Challenge
01:30
Solution
09:32
Solution Refinement
12:04
An Extra: pandas
11:12
Functional Search Patterns
08:23
Text Search
08:58
Searching By CSS
07:21
Just One Tag
03:09

Project 1 - Portfolio Valuation With Google Finance

7 lectures
Scope Statement
03:04
An Extra: Some Finance Concepts
04:31
Parsing Price
12:41
Non-USD Prices
08:44
Adding Structure With Dataclasses
09:02
Position And Portfolio
09:00
Tabular Display
12:15

APIs: The Hidden Gems

10 lectures
Befriend The Network Tab
05:38
Case Study: Coffee Shop Locations
08:33
The Advantages Of APIs
07:02
Full Header Emulation
06:01
An Extra: Postman
03:53
Code Generation
06:38
Challenge
03:13
Solution: Interacting With The API
06:48
Solution: Processing The Data
06:43
Solution: Adding Geocode
09:56

Selectolax And Advanced CSS Selectors

5 lectures
Introduction
01:36
What Is selectolax?
09:10
CSS Combinators
08:46
Sibling Combinators
07:37
Selector Types
08:03

Project 2 - Image Scraper

12 lectures
Scope Statement
03:34
Prospecting
07:47
NOTE: Quick Correction To CSS Selector
00:24
Scraping HTML
07:34
Filtering Relevant URLs
09:17
Extracting High-Res Image URLs
11:20
Saving The Images
06:54
Stepping It Up With Logging
08:40
Back To The API
05:54
Filtered Canonical URLs
07:33
Pagination Prospecting
04:29
Wrapping Up
12:41

Tackling JavaScript With Microsoft PlayWright

4 lectures
What You See vs. What You Get
09:53
Rendering JavaScript
05:24
PlayWright Over Selenium
04:53
Case Study: Show Me The Money
10:39

Project 3 - Building A Configurable Scraping Pipeline

20 lectures
Scope Statement
06:41
Initial Setup
05:25
Fully Loaded Site
04:23
Selecting Game Containers
07:00
More Robust Render Thresholds
02:39
Extracting Title And Thumbnail
05:45
Game Category Tags
04:31
Release Date And Reviews
05:43
Original And Discount Price
05:51
Refactoring
05:19
Introducing Config
06:22
Configuration Integrated
06:58
Parsing Pipeline
12:03
Parameterized Extraction
10:12
Functional Post-Processing
11:28
Date Formatting
09:18
Regular Expressions
11:02
Saving To Disk
07:00
Integrating HTMLParser With The Generic Parser
07:46
Finishing Touches
05:22

The Scrapy Framework

15 lectures
Introduction
02:02
Virtual Environments And Scrapy
06:21
First Project And Spider
04:40
Scraping Elements
09:04
Extracting Specific Attributes
08:29
An Extra: Scrapy Shell
04:24
Rewriting Using XPath Selectors
10:28
Outputting Data
06:35
Defining Scrapy Items
06:40
Introducing Itemloaders
10:18
Fine-Tuned Post-Processing
10:07
Pipelined Data Validation
08:30
Saving To Databases
12:02
Challenge
03:54
Solution: Defining NoDuplicateCountryPipeline
07:13

Boosting Scrapy With scrapy-playwright

7 lectures
The JavaScript Wrench In The Works
09:44
Integrating scrapy-playwright
07:30
PageMethods
03:51
Pagination And Infinite Scroll
03:50
Playwright, Do This
08:46
Improved Snippet As PageMethod
07:25
Scraping Location, Department, And Posted Date
04:33

Project 4 - Scraping Dynamic Sites With Scrapy And PlayWright

6 lectures
Scope Statement
04:00
New Project And Spider
04:04
Item And Itemloading
12:31
Pipelining To Database
09:00
Quick Fix
02:28
Grouped Elements JSON Export
09:42

Closing Thoughts

3 lectures
Try To Respect robots.txt
02:57
Thank You
00:28
My Other Courses
00:15

Appendix - Python Fundamentals

28 lectures
A Quick Note + Section Resources
00:19
Data Types
02:35
Variables
08:27
Arithmetic And Augmented Assignment Operators
07:16
Ints And Floats
08:54
Booleans And Comparison Operators
05:12
Strings
07:52
Methods
06:29
Containers I - Lists
06:08
Lists vs. Strings
06:53
List Methods And Functions
07:54
Containers II - Tuples
04:43
Containers III - Sets
10:32
Containers IV - Dictionaries
05:15
Dictionary Keys And Values
08:14
Membership Operators
04:28
Controlling Flow With if, else, And elif
08:21
Truth Value Of Non-Booleans
03:28
For Loops
05:05
The range() Immutable Sequence
05:10
While Loops
05:55
Break And Continue
04:15
Zipping Iterables
03:39
List Comprehensions
07:47
Defining Functions
10:18
Function Arguments: Positional vs Keyword
06:54
Lambdas
05:28
Importing Modules
05:38

Đánh giá của học viên

Chưa có đánh giá
Course Rating
5
0%
4
0%
3
0%
2
0%
1
0%

Bình luận khách hàng

Viết Bình Luận

Bạn đánh giá khoá học này thế nào?

image

Đăng ký get khoá học Udemy - Unica - Gitiho giá chỉ 50k!

Get khoá học giá rẻ ngay trước khi bị fix.