Scrapy project — A simple guide
A simple guide of scrapy that include project structure, persistence, deploy, distributed, etc.
Install
pip install scrapy
If you have encountered some specific network issues, try using another repository. You can use the index-url
flag with the pip install
command.
pip install scrapy --index-url <https://pypi.tuna.tsinghua.edu.cn/simple>
This will temporarily use the Qinghua repository for this installation of Scrapy. Keep in mind that this is only temporary, and you will need to use the -index-url
flag every time.
Create a project
To create a Scrapy project, use the scrapy startproject
command.
scrapy startproject project_name
This will create a new directory called project_name
with the basic structure of a Scrapy project. From there, you can start adding spiders, items, and pipelines to your project.
Here’s the basic structure of a Scrapy project:
project_name/
scrapy.cfg
project_name/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
scrapy.cfg
is the configuration file for the project.items.py
defines the data structure for the scraped items.middlewares.py
contains middleware classes that can be used to modify requests and responses.pipelines.py
contains pipeline classes that process scraped items.settings.py
contains project settings.spiders/
is a directory where you can put your spiders.
Spiders
Each spider is defined in its own module under the spiders/
directory. The spider module should define a class that extends scrapy.Spider
and provides the spider’s name and start URL(s). The spider’s logic should be implemented in the parse
method.
import scrapy
class ExampleSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'<https://example.toscrape.com/page/1/>',
'<https://example.toscrape.com/page/2/>',
]
def parse(self, response):
# yield something
pass
In the parse
method, we can yield some items
data that is structured like what we define in items.py
,or yield some scrapy.Request
.
Items
Define the data structure for the scraped items:
Using scrapy.Item
:
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
brand = scrapy.Field()
We define a ProductItem
class that extends scrapy.Item
. It has four fields: name
, price
, description
, and brand
. These fields will be used to store the scraped data.
Using dataclass
:
from dataclasses import dataclass
@dataclass
class Product:
name: str
price: float
description: str
brand: str
Also, we can define this class using dataclass
. The types of the fields are specified using type hints.
Both of these approaches will work in a Scrapy project. The choice between them depends on personal preference and project requirements.
Piplines
MongoDBPipeline
To use a pipeline to store data in MongoDB with Scrapy, we need to define a pipeline class that extends scrapy.ItemPipeline
.
import pymongo
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection_name = spider.name
self.db[collection_name].insert(dict(item))
return item
We define a MongoDBPipeline
class that extends scrapy.ItemPipeline
. It uses the pymongo
library to connect to a MongoDB database and store the scraped data.
The __init__
method takes two arguments: mongo_uri
and mongo_db
. These arguments are used to configure the MongoDB connection.
The from_crawler
method is a class method that creates a new instance of the pipeline class. It reads the MONGO_URI
and MONGO_DATABASE
settings from the Scrapy settings.
Alternatively, we can use the get_project_settings
function from the utils.project
module to get the MONGO_URI
and MONGO_DATABASE
settings.
from scrapy.utils.project import get_project_settings
settings = get_project_settings() # it return a dictionary
mongo_uri = settings.get('MONGO_URI'),
mongo_db = settings.get('MONGO_DATABASE')
The open_spider
method is called when the spider is opened. It creates a new MongoDB client and connects to the specified database.
The close_spider
method is called when the spider is closed. It closes the MongoDB client.
The process_item
method is called for each item that is scraped. It inserts the item into the appropriate collection in the database.
Please modify the process_item
method in the MongoDBPipeline
class, eg:
import datetime
def process_item(self, item, spider):
collection_name = spider.name
timestamp = datetime.datetime.utcnow()
self.db[collection_name].update_one(
{'_id': item['_id']},
{'$setOnInsert': {'timestamp': timestamp}, '$set': dict(item)},
upsert=True
)
return item
Here, I use the update_one
method to update the document if it already exists or insert it if it doesn’t. The $setOnInsert
operator sets the timestamp
field only on the first insert. The $set
operator sets all other fields in the document.
Note that the _id
field must be present in the item for $setOnInsert
to work correctly. If the _id
field is not present, MongoDB will generate a new ObjectId for the document.
Now, each document that is inserted into the database will have a timestamp
field that indicates when it was first inserted, but the field will not be updated on subsequent updates to the document.
MinioFilesPipline
To store files on a Minio server using a pipeline in Scrapy.
First ,pip install botocore
.
Then we create a new pipeline class that extends scrapy.pipelines.files.FilesPipeline
. eg:
from scrapy.pipelines.files import FilesPipeline
class MinioFilesPipeline(FilesPipeline):
def get_media_requests(self, item, info):
adapter = ItemAdapter(item)
for file_url in adapter['file_urls']:
yield scrapy.Request(file_url)
def file_path(self, request, response=None, info=None):
return request.meta['filename']
def item_completed(self, results, item, info):
return item
We also need to configure the settings for the MinioFilesPipeline
in the settings.py
file. Configure the target storage setting FILES_STORE
to a valid value that will be used for storing the downloaded file. Otherwise, the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES
setting.
FILES_STORE = 's3://your-bucket-name'
MINIO_ENDPOINT = 'localhost:9000'
MINIO_ACCESS_KEY = 'access_key'
MINIO_SECRET_KEY = 'secret_key'
MINIO_SECURE = False
To use those pipelines in our Scrapy project, we need to add them to the ITEM_PIPELINES
setting.
ITEM_PIPELINES = {
'myproject.pipelines.MongoDBPipeline': 300,
'myproject.pipelines.MinioFilesPipeline': 400,
}
Log
Modify the LOG_ENABLED
, LOG_LEVEL
, and LOG_FILE
settings in our settings.py
file.
# Logging settings
import os.path
from datetime import datetime
import pathlib
now = datetime.now()
log_file_path = f"logs/log_{now.year}_{now.month}_{now.day}_{now.hour}_{now.minute}_{now.second}.log"
# if you need a directory to store logs file, check path exists first.
base_path = pathlib.Path().resolve() # it should be like "path/to/project/"
if not os.path.exists(f"{base_path}/logs/"):
os.mkdir(f"{base_path}/logs/")
# use os.makedirs() to create multiple levels of nested directories if you need. eg:you want 'path/to/project/project/logs/'
LOG_FORMATTER = "project.util.CustomLogFormatter"
LOG_ENABLED = True
LOG_ENCODING = "utf-8"
LOG_FILE = log_file_path
LOG_LEVEL = "DEBUG"
LOG_STDOUT = False
I want to not output scraped items to the log while keeping the LOG_LEVEL="DEBUG"
.I used a custom log formatter to exclude items from the logs.Remember to modify the value of LOG_FORMATTER
in settings.py
import logging
from scrapy.utils.log import LogFormatter
class CustomLogFormatter(LogFormatter):
def scraped(self, item, response, spider):
return (
super().scraped(item, response, spider)
if spider.settings.getbool("LOG_SCRAPED_ITEMS")
else None
)
# use LOG_SCRAPYED_ITEMS to control whether to log items
Reduce logging output about boto3
(used by MinioFilesPipeline) by adding this code to Scrapy project’s spider file:
import logging
# Set the boto3 debug log level to INFO
logging.getLogger('boto3').setLevel(logging.INFO)
Note that this will only affect the logging for the current spider file, not the entire Scrapy project.
Deploy
To deploy a Scrapy project using Scrapyd and Scrapyd-client:
- Install Scrapyd and Scrapyd-client.
- Create a deployment target in the
scrapy.cfg
file. - Deploy the project to Scrapyd.
pip install scrapyd
pip install scrapyd-client
[deploy:machine1]
url = <http://192.168.1.10:6800/>
project = myproject
[deploy:machine2]
url = <http://192.168.1.11:6800/>
project = myproject
scrapyd-deploy -a myproject # for all target
Distributed
To use scrapy-redis for distributed Scrapy projects:
- Install scrapy-redis by running
pip install scrapy-redis
. - Update the
settings.py
file to include scrapy-redis specific settings.
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Specify the host and port to use when connecting to Redis (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# Specify the Redis database to use (optional).
REDIS_DB = 0
# Specify the Redis password to use (optional).
REDIS_PARAMS = {
'password': 'your_password_here'
}
# Enables stats shared based on Redis
STATS_CLASS = 'scrapy_redis.stats.RedisStatsCollector'
We use the make_request_from_data
method to generate requests. Then, we need to push some start_urls
to Redis.
If redis_key
is not defined, RedisSpider
tries to use spider_name:start_urls
as the key to get start_urls
from Redis.
import scrapy
from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
name = 'myspider'
redis_key = 'myspider:start_urls'
def make_request_from_data(self, data):
url = data.decode('utf-8')
return scrapy.Request(url, dont_filter=True)
Every machine that runs the project needs to have the same Scrapy and project packages installed, such as scrapy
, scrapyd
, scrapy-redis
, and pymongo
, among others.
Now, deploy the project to all machines and run the spiders.