Hi! Today we will install Scrapy and run a simple spider. You can find many articles about how to install in virtualenv but we install scrapy in docker container. First of all install docker. I will write commands for Ubuntu, if it will be interesting I can repeat this article for windows. Altought I already installed docker this information is from docker documentation:
- Install packages to allow apt to use a repository over HTTPS:
$ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
- Add Docker’s official GPG key:
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
- Use the following command to set up the stable repository. You always need the stable repository, even if you want to install edge builds as well.
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
- Update the apt package index.
$ sudo apt-get update
- Install the latest version of Docker
$ sudo apt-get install docker-ce
- Verify that Docker CE is installed correctly by running the hello-world image.
$ sudo docker run hello-world
This command downloads a test image and runs it in a container. When the container runs, it prints an informational message and exits.
Now let’s make simple Scrapy container
- Make container directory in Home
$ mkdir ~/Scrapy
$ cd Scrapy
- Write instructions to build container
$ nano Dockerfile
# Version: 0.0.1
FROM python
RUN apt-get update && apt-get upgrade -y
RUN pip install --upgrade pip
RUN pip install scrapy
Now we’re creating a shared volume which will be mounted to container and place test spider to that volume
$ mkdir ~/Scrapy/scrapy-data
$ cd ~/Scrapy/scrapy-data
$ sudo nano
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
' http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
- Build image
$ sudo docker build -t scrapy .
- And run it
$ sudo docker run -v ~/Scrapy/scrapy-data:/scrapy scrapy scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json
-v ~/Scrapy/scrapy-data:/scrapy
means that in our container directory ~/Scrapy
will be created a shared volume that is mounted to /scrapy
of container.
scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json
- command that will be run in container
file ~/Scrapy/scrapy-data/quotes.json
will contain the result of executing our spider. As a result we have an environment to write our spiders for steemit, and it will be theme of the next article.
Thank you.
Nice HOWTO but you haven't told the readers what Scrapy is or does. You possibly could have quickly explained what Containers are or even what Docker is as well.
Of course you could have just simply linked to the explanations. ;-)
It's your work but I always write as if the reader knows nothing about the subject, which for Steemit is probably 95% of your audience. :-)
Pete
Hi! Thank you, I will take your advice into further articles. About Scrapy was my last article https://steemit.com/howto/@ertinfagor/python-scrapy :) I had to add a link to this article :)
As I say. It's you work not mine. Just an observation, hope you're not offended.
It`s ok :)