Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort
The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser , It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis
1-understand how you can legally use the data
2- not downloading data at too rapid a rate because this may break the website.
figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags
library used
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
1- download the libraryes 2- set the url to the website and access the site with our requests library.
3- parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure.
The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet.
powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities
Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
Many websites have large collections of pages generated dynamically from an underlying structured source like a database
DOM parsing
Vertical aggregation
he preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically
can be used to locate specific data snippets
using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.
## Web Scraping best practices
### + Respect Robots.txt
What if you need some data that is forbidden by Robots.txt. You could still scrape it. Most anti-scraping tools block web scraping when you are scraping pages that are not allowed by Robots.txt.
Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trial runs. Do this periodically because the environment does change over time.
Humans generally will not perform repetitive tasks as they browse through a site with random actions
There are several methods that can change your outgoing IP.
Honeypots are systems set up to lure hackers and detect any hacking attempts that try to gain information.
Login is basically permission to get access to web pages. Some websites like Indeed do not allow permission.
They could take away your credentials or block your account which can, in turn, lead to your web scraping efforts being blocked.
Unusual traffic/high download rate especially from a single client/or IP address within a short time span.