Reading-notes

Web Scrape

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort

The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser , It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis

Important notes about web scraping:

1-understand how you can legally use the data

2- not downloading data at too rapid a rate because this may break the website.

Inspecting the Website

figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags

python code

library used

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

1- download the libraryes 2- set the url to the website and access the site with our requests library.

3- parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure.

Techniques

The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet.

powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities

Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.

Many websites have large collections of pages generated dynamically from an underlying structured source like a database

he preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically

can be used to locate specific data snippets

using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.

## Web Scraping best practices

### + Respect Robots.txt

What if you need some data that is forbidden by Robots.txt. You could still scrape it. Most anti-scraping tools block web scraping when you are scraping pages that are not allowed by Robots.txt.

Make the crawling slower, do not slam the server, treat websites nicely

Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trial runs. Do this periodically because the environment does change over time.

Do not follow the same crawling pattern

Humans generally will not perform repetitive tasks as they browse through a site with random actions

Make requests through Proxies and rotate them as needed

There are several methods that can change your outgoing IP.

Beware of Honey Pot Traps

Honeypots are systems set up to lure hackers and detect any hacking attempts that try to gain information.

Avoid scraping data behind a login

Login is basically permission to get access to web pages. Some websites like Indeed do not allow permission.

They could take away your credentials or block your account which can, in turn, lead to your web scraping efforts being blocked.

How can websites detect and block web scraping?

How do you find out if a website has blocked or banned you?