skip to content
aguo

Scrapy project — A simple guide

A simple guide of scrapy that include project structure, persistence, deploy, distributed, etc.

Install

pip install scrapy

If you have encountered some specific network issues, try using another repository. You can use the index-url flag with the pip install command.

pip install scrapy --index-url <https://pypi.tuna.tsinghua.edu.cn/simple>

This will temporarily use the Qinghua repository for this installation of Scrapy. Keep in mind that this is only temporary, and you will need to use the -index-url flag every time.

Create a project

To create a Scrapy project, use the scrapy startproject command.

scrapy startproject project_name

This will create a new directory called project_name with the basic structure of a Scrapy project. From there, you can start adding spiders, items, and pipelines to your project.

Here’s the basic structure of a Scrapy project:

project_name/
    scrapy.cfg
    project_name/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            spider1.py
            spider2.py
            ...
  • scrapy.cfg is the configuration file for the project.
  • items.py defines the data structure for the scraped items.
  • middlewares.py contains middleware classes that can be used to modify requests and responses.
  • pipelines.py contains pipeline classes that process scraped items.
  • settings.py contains project settings.
  • spiders/ is a directory where you can put your spiders.

Spiders

Each spider is defined in its own module under the spiders/ directory. The spider module should define a class that extends scrapy.Spider and provides the spider’s name and start URL(s). The spider’s logic should be implemented in the parse method.

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        '<https://example.toscrape.com/page/1/>',
        '<https://example.toscrape.com/page/2/>',
    ]

    def parse(self, response):
				# yield something
        pass

In the parse method, we can yield some items data that is structured like what we define in items.py ,or yield some scrapy.Request.

Items

Define the data structure for the scraped items:

Using scrapy.Item:

import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    description = scrapy.Field()
    brand = scrapy.Field()

We define a ProductItem class that extends scrapy.Item. It has four fields: name, price, description, and brand. These fields will be used to store the scraped data.

Using dataclass:

from dataclasses import dataclass

@dataclass
class Product:
    name: str
    price: float
    description: str
    brand: str

Also, we can define this class using dataclass. The types of the fields are specified using type hints.

Both of these approaches will work in a Scrapy project. The choice between them depends on personal preference and project requirements.

Piplines

MongoDBPipeline

To use a pipeline to store data in MongoDB with Scrapy, we need to define a pipeline class that extends scrapy.ItemPipeline.

import pymongo

class MongoDBPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection_name = spider.name
        self.db[collection_name].insert(dict(item))
        return item

We define a MongoDBPipeline class that extends scrapy.ItemPipeline. It uses the pymongo library to connect to a MongoDB database and store the scraped data.

The __init__ method takes two arguments: mongo_uri and mongo_db. These arguments are used to configure the MongoDB connection.

The from_crawler method is a class method that creates a new instance of the pipeline class. It reads the MONGO_URI and MONGO_DATABASE settings from the Scrapy settings.

Alternatively, we can use the get_project_settings function from the utils.project module to get the MONGO_URI and MONGO_DATABASE settings.

from scrapy.utils.project import get_project_settings

settings = get_project_settings() # it return a dictionary
mongo_uri = settings.get('MONGO_URI'),
mongo_db = settings.get('MONGO_DATABASE')

The open_spider method is called when the spider is opened. It creates a new MongoDB client and connects to the specified database.

The close_spider method is called when the spider is closed. It closes the MongoDB client.

The process_item method is called for each item that is scraped. It inserts the item into the appropriate collection in the database.

Please modify the process_item method in the MongoDBPipeline class, eg:

import datetime
def process_item(self, item, spider):
    collection_name = spider.name
    timestamp = datetime.datetime.utcnow()
    self.db[collection_name].update_one(
            {'_id': item['_id']},
            {'$setOnInsert': {'timestamp': timestamp}, '$set': dict(item)},
            upsert=True
        )
    return item

Here, I use the update_one method to update the document if it already exists or insert it if it doesn’t. The $setOnInsert operator sets the timestamp field only on the first insert. The $set operator sets all other fields in the document.

Note that the _id field must be present in the item for $setOnInsert to work correctly. If the _id field is not present, MongoDB will generate a new ObjectId for the document.

Now, each document that is inserted into the database will have a timestamp field that indicates when it was first inserted, but the field will not be updated on subsequent updates to the document.

MinioFilesPipline

To store files on a Minio server using a pipeline in Scrapy.

First ,pip install botocore.

Then we create a new pipeline class that extends scrapy.pipelines.files.FilesPipeline. eg:

from scrapy.pipelines.files import FilesPipeline

class MinioFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        adapter = ItemAdapter(item)
		    for file_url in adapter['file_urls']:
		        yield scrapy.Request(file_url)

    def file_path(self, request, response=None, info=None):
        return request.meta['filename']

    def item_completed(self, results, item, info):
        return item

We also need to configure the settings for the MinioFilesPipeline in the settings.py file. Configure the target storage setting FILES_STORE to a valid value that will be used for storing the downloaded file. Otherwise, the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.

FILES_STORE = 's3://your-bucket-name'
MINIO_ENDPOINT = 'localhost:9000'
MINIO_ACCESS_KEY = 'access_key'
MINIO_SECRET_KEY = 'secret_key'
MINIO_SECURE = False

To use those pipelines in our Scrapy project, we need to add them to the ITEM_PIPELINES setting.

ITEM_PIPELINES = {
    'myproject.pipelines.MongoDBPipeline': 300,
		'myproject.pipelines.MinioFilesPipeline': 400,
}

Log

Modify the LOG_ENABLED, LOG_LEVEL, and LOG_FILE settings in our settings.py file.

# Logging settings
import os.path
from datetime import datetime
import pathlib

now = datetime.now()
log_file_path = f"logs/log_{now.year}_{now.month}_{now.day}_{now.hour}_{now.minute}_{now.second}.log"

# if you need a directory to store logs file, check path exists first.
base_path = pathlib.Path().resolve() # it should be like "path/to/project/"
if not os.path.exists(f"{base_path}/logs/"):
    os.mkdir(f"{base_path}/logs/")
# use os.makedirs() to create multiple levels of nested directories if you need. eg:you want 'path/to/project/project/logs/'

LOG_FORMATTER = "project.util.CustomLogFormatter"
LOG_ENABLED = True
LOG_ENCODING = "utf-8"
LOG_FILE = log_file_path
LOG_LEVEL = "DEBUG"
LOG_STDOUT = False

I want to not output scraped items to the log while keeping the LOG_LEVEL="DEBUG" .I used a custom log formatter to exclude items from the logs.Remember to modify the value of LOG_FORMATTER in settings.py

import logging
from scrapy.utils.log import LogFormatter

class CustomLogFormatter(LogFormatter):

    def scraped(self, item, response, spider):
        return (
            super().scraped(item, response, spider)
            if spider.settings.getbool("LOG_SCRAPED_ITEMS")
            else None
        )
# use LOG_SCRAPYED_ITEMS to control whether to log items

Reduce logging output about boto3 (used by MinioFilesPipeline) by adding this code to Scrapy project’s spider file:

import logging

# Set the boto3 debug log level to INFO
logging.getLogger('boto3').setLevel(logging.INFO)

Note that this will only affect the logging for the current spider file, not the entire Scrapy project.

Deploy

To deploy a Scrapy project using Scrapyd and Scrapyd-client:

  1. Install Scrapyd and Scrapyd-client.
  2. Create a deployment target in the scrapy.cfg file.
  3. Deploy the project to Scrapyd.
pip install scrapyd
pip install scrapyd-client
[deploy:machine1]
url = <http://192.168.1.10:6800/>
project = myproject

[deploy:machine2]
url = <http://192.168.1.11:6800/>
project = myproject
scrapyd-deploy -a myproject # for all target

Distributed

To use scrapy-redis for distributed Scrapy projects:

  1. Install scrapy-redis by running pip install scrapy-redis.
  2. Update the settings.py file to include scrapy-redis specific settings.
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Specify the host and port to use when connecting to Redis (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# Specify the Redis database to use (optional).
REDIS_DB = 0

# Specify the Redis password to use (optional).
REDIS_PARAMS = {
    'password': 'your_password_here'
}

# Enables stats shared based on Redis
STATS_CLASS = 'scrapy_redis.stats.RedisStatsCollector'

We use the make_request_from_data method to generate requests. Then, we need to push some start_urls to Redis.

If redis_key is not defined, RedisSpider tries to use spider_name:start_urls as the key to get start_urls from Redis.

import scrapy
from scrapy_redis.spiders import RedisSpider

class MySpider(RedisSpider):
    name = 'myspider'
    redis_key = 'myspider:start_urls'

    def make_request_from_data(self, data):
        url = data.decode('utf-8')
        return scrapy.Request(url, dont_filter=True)

Every machine that runs the project needs to have the same Scrapy and project packages installed, such as scrapy, scrapyd, scrapy-redis, and pymongo, among others.

Now, deploy the project to all machines and run the spiders.