Scrapy project — A simple guide
A simple guide of scrapy that include project structure, persistence, deploy, distributed, etc.
Install
pip install scrapy
If you have encountered some specific network issues, try using another repository. You can use the index-url flag with the pip install command.
pip install scrapy --index-url <https://pypi.tuna.tsinghua.edu.cn/simple>
This will temporarily use the Qinghua repository for this installation of Scrapy. Keep in mind that this is only temporary, and you will need to use the -index-url flag every time.
Create a project
To create a Scrapy project, use the scrapy startproject command.
scrapy startproject project_name
This will create a new directory called project_name with the basic structure of a Scrapy project. From there, you can start adding spiders, items, and pipelines to your project.
Here’s the basic structure of a Scrapy project:
project_name/
scrapy.cfg
project_name/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
scrapy.cfgis the configuration file for the project.items.pydefines the data structure for the scraped items.middlewares.pycontains middleware classes that can be used to modify requests and responses.pipelines.pycontains pipeline classes that process scraped items.settings.pycontains project settings.spiders/is a directory where you can put your spiders.
Spiders
Each spider is defined in its own module under the spiders/ directory. The spider module should define a class that extends scrapy.Spider and provides the spider’s name and start URL(s). The spider’s logic should be implemented in the parse method.
import scrapy
class ExampleSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'<https://example.toscrape.com/page/1/>',
'<https://example.toscrape.com/page/2/>',
]
def parse(self, response):
# yield something
pass
In the parse method, we can yield some items data that is structured like what we define in items.py ,or yield some scrapy.Request.
Items
Define the data structure for the scraped items:
Using scrapy.Item:
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
brand = scrapy.Field()
We define a ProductItem class that extends scrapy.Item. It has four fields: name, price, description, and brand. These fields will be used to store the scraped data.
Using dataclass:
from dataclasses import dataclass
@dataclass
class Product:
name: str
price: float
description: str
brand: str
Also, we can define this class using dataclass. The types of the fields are specified using type hints.
Both of these approaches will work in a Scrapy project. The choice between them depends on personal preference and project requirements.
Piplines
MongoDBPipeline
To use a pipeline to store data in MongoDB with Scrapy, we need to define a pipeline class that extends scrapy.ItemPipeline.
import pymongo
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection_name = spider.name
self.db[collection_name].insert(dict(item))
return item
We define a MongoDBPipeline class that extends scrapy.ItemPipeline. It uses the pymongo library to connect to a MongoDB database and store the scraped data.
The __init__ method takes two arguments: mongo_uri and mongo_db. These arguments are used to configure the MongoDB connection.
The from_crawler method is a class method that creates a new instance of the pipeline class. It reads the MONGO_URI and MONGO_DATABASE settings from the Scrapy settings.
Alternatively, we can use the get_project_settings function from the utils.project module to get the MONGO_URI and MONGO_DATABASE settings.
from scrapy.utils.project import get_project_settings
settings = get_project_settings() # it return a dictionary
mongo_uri = settings.get('MONGO_URI'),
mongo_db = settings.get('MONGO_DATABASE')
The open_spider method is called when the spider is opened. It creates a new MongoDB client and connects to the specified database.
The close_spider method is called when the spider is closed. It closes the MongoDB client.
The process_item method is called for each item that is scraped. It inserts the item into the appropriate collection in the database.
Please modify the process_item method in the MongoDBPipeline class, eg:
import datetime
def process_item(self, item, spider):
collection_name = spider.name
timestamp = datetime.datetime.utcnow()
self.db[collection_name].update_one(
{'_id': item['_id']},
{'$setOnInsert': {'timestamp': timestamp}, '$set': dict(item)},
upsert=True
)
return item
Here, I use the update_one method to update the document if it already exists or insert it if it doesn’t. The $setOnInsert operator sets the timestamp field only on the first insert. The $set operator sets all other fields in the document.
Note that the _id field must be present in the item for $setOnInsert to work correctly. If the _id field is not present, MongoDB will generate a new ObjectId for the document.
Now, each document that is inserted into the database will have a timestamp field that indicates when it was first inserted, but the field will not be updated on subsequent updates to the document.
MinioFilesPipline
To store files on a Minio server using a pipeline in Scrapy.
First ,pip install botocore.
Then we create a new pipeline class that extends scrapy.pipelines.files.FilesPipeline. eg:
from scrapy.pipelines.files import FilesPipeline
class MinioFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
adapter = ItemAdapter(item)
for file_url in adapter['file_urls']:
yield scrapy.Request(file_url)
def file_path(self, request, response=None, info=None):
return request.meta['filename']
def item_completed(self, results, item, info):
return item
We also need to configure the settings for the MinioFilesPipeline in the settings.py file. Configure the target storage setting FILES_STORE to a valid value that will be used for storing the downloaded file. Otherwise, the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
FILES_STORE = 's3://your-bucket-name'
MINIO_ENDPOINT = 'localhost:9000'
MINIO_ACCESS_KEY = 'access_key'
MINIO_SECRET_KEY = 'secret_key'
MINIO_SECURE = False
To use those pipelines in our Scrapy project, we need to add them to the ITEM_PIPELINES setting.
ITEM_PIPELINES = {
'myproject.pipelines.MongoDBPipeline': 300,
'myproject.pipelines.MinioFilesPipeline': 400,
}
Log
Modify the LOG_ENABLED, LOG_LEVEL, and LOG_FILE settings in our settings.py file.
# Logging settings
import os.path
from datetime import datetime
import pathlib
now = datetime.now()
log_file_path = f"logs/log_{now.year}_{now.month}_{now.day}_{now.hour}_{now.minute}_{now.second}.log"
# if you need a directory to store logs file, check path exists first.
base_path = pathlib.Path().resolve() # it should be like "path/to/project/"
if not os.path.exists(f"{base_path}/logs/"):
os.mkdir(f"{base_path}/logs/")
# use os.makedirs() to create multiple levels of nested directories if you need. eg:you want 'path/to/project/project/logs/'
LOG_FORMATTER = "project.util.CustomLogFormatter"
LOG_ENABLED = True
LOG_ENCODING = "utf-8"
LOG_FILE = log_file_path
LOG_LEVEL = "DEBUG"
LOG_STDOUT = False
I want to not output scraped items to the log while keeping the LOG_LEVEL="DEBUG" .I used a custom log formatter to exclude items from the logs.Remember to modify the value of LOG_FORMATTER in settings.py
import logging
from scrapy.utils.log import LogFormatter
class CustomLogFormatter(LogFormatter):
def scraped(self, item, response, spider):
return (
super().scraped(item, response, spider)
if spider.settings.getbool("LOG_SCRAPED_ITEMS")
else None
)
# use LOG_SCRAPYED_ITEMS to control whether to log items
Reduce logging output about boto3 (used by MinioFilesPipeline) by adding this code to Scrapy project’s spider file:
import logging
# Set the boto3 debug log level to INFO
logging.getLogger('boto3').setLevel(logging.INFO)
Note that this will only affect the logging for the current spider file, not the entire Scrapy project.
Deploy
To deploy a Scrapy project using Scrapyd and Scrapyd-client:
- Install Scrapyd and Scrapyd-client.
- Create a deployment target in the
scrapy.cfgfile. - Deploy the project to Scrapyd.
pip install scrapyd
pip install scrapyd-client
[deploy:machine1]
url = <http://192.168.1.10:6800/>
project = myproject
[deploy:machine2]
url = <http://192.168.1.11:6800/>
project = myproject
scrapyd-deploy -a myproject # for all target
Distributed
To use scrapy-redis for distributed Scrapy projects:
- Install scrapy-redis by running
pip install scrapy-redis. - Update the
settings.pyfile to include scrapy-redis specific settings.
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Specify the host and port to use when connecting to Redis (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# Specify the Redis database to use (optional).
REDIS_DB = 0
# Specify the Redis password to use (optional).
REDIS_PARAMS = {
'password': 'your_password_here'
}
# Enables stats shared based on Redis
STATS_CLASS = 'scrapy_redis.stats.RedisStatsCollector'
We use the make_request_from_data method to generate requests. Then, we need to push some start_urls to Redis.
If redis_key is not defined, RedisSpider tries to use spider_name:start_urls as the key to get start_urls from Redis.
import scrapy
from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
name = 'myspider'
redis_key = 'myspider:start_urls'
def make_request_from_data(self, data):
url = data.decode('utf-8')
return scrapy.Request(url, dont_filter=True)
Every machine that runs the project needs to have the same Scrapy and project packages installed, such as scrapy, scrapyd, scrapy-redis, and pymongo, among others.
Now, deploy the project to all machines and run the spiders.