The version of the HTTP protocol: In this tutorial we will focus on HTTP 1.There are other HTTP verbs, and you can see the full list here.
The GET verb or method: This means we request data from the specific path: /product/.In the first line of this request, you can see the following:
#PYTHON WEBSCRAPER CODE MAC OS X#
User-Agent: Mozilla/5.0 (Macintosh Intel Mac OS X 10_11_6 ) AppleWebKit \ FTP, for example, is stateful.īasically, when you type a website address in your browser, the HTTP request looks like this:Īccept: text/html,application/xhtml+xml,application/xml q =0.9,image/web \
HTTP is called a stateless protocol because each transaction (request/response) is independent. Then the server answers with a response (the HTML code for example) and closes the connection. An HTTP client (a browser, your Python program, cURL, Requests…) opens a connection and sends a message (“I want to see that page : /product”)to an HTTP server (Nginx, Apache…). HyperText Transfer Protocol (HTTP) uses a client/server model. I don’t have the pretension to explain everything, but I will explain the most important to understand for extracting data from the web. The internet is complex: there are many underlying technologies and concepts involved to view a simple web page in your browser. Note: When I talk about Python in this blog post you should assume that I talk about Python3. Of course, we won't be able to cover every aspect of every tool we discuss, but this post should give you a good idea of what each tool does, and when to use one. We will go from the basic to advanced ones, covering the pros and cons of each. In this post, which can be read as a follow-up to our guide about web scraping without getting blocked, we will cover almost all of the tools Python offers to scrape the web.