Great post and well explained tutorial
As an addition for the web crawling tutorial some websites will kinda block you from scrapping them by detecting that you are using some crawlers because by default requests module of python send the http request with a user agent "python-requests/{version_of_module}" so to bypass this you just need to send the request with a custom user agent "requests.get(URL, headers={'user-agent': 'YOUR USER AGENT'})"
user_agent = Is the identity of your browser and the operation system that you are using
You are viewing a single comment's thread from:
I haven't (deliberately!) explained how to use all the Requests attributes and methods, including setting a user agent. But please know that I am well-aware of (temporary) web crawler blocking done by some web servers detecting crawls from a certain IP. Please also know that in those type of cases only setting a self-defined user agent adds zero value: you will get blocked nonetheless.
There are several workarounds for that (e.g. block detection combined with using a multitude of VPNs and/or Onion IPs, and/or even IP spoofing). But since I'm an ethical person, in situations such as those, I ask myself "would it be OK if I used those techniques on this webserver?" And my answer to that is: "No, let's look somewhere else for the data I need, or contact the web admin of that webserver to discuss if they are willing to voluntarily provide me with the data I'm looking for."
As a rule of thumb I always try to treat people like I'd like them to treat me. That's not always possible if the other party thinks and feels differently, but in such cases I prefer to interact with people that do feel the same as I do.
Google's motto is (was?) "Don't be evil" For me personally I'm taking it one step further: "Be Me, always, which equals: Be a Good Person".
@scipio