Python Scrapy Example(爬蟲框架)

scrapy install

install

python 3.11 有問題, python 3.10 ok

1
2
3
4
5
6
7
8
# install myenv10_scrapy
rem cd \app\python_env\
rem py -3.10 -m virtualenv myenv10_scrapy
# install
pip install scrapy
pip install pylint
pip install autopep8
pip install ipython

scrapy version & help

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# show version & help
(myenv10_scrapy) D:\work\git\python_crawler>scrapy
Scrapy 2.7.1 - no active project

Usage:
scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

(myenv10_scrapy) D:\work\git\python_crawler>

scrapy bench(simple test)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
(myenv10_scrapy) D:\work\git\python_crawler>scrapy bench
2022-12-02 22:22:22 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: scrapybot)
2022-12-02 22:22:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19045-SP0
2022-12-02 22:22:23 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2022-12-02 22:22:23 [py.warnings] WARNING: D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\utils\request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

2022-12-02 22:22:24 [scrapy.extensions.telnet] INFO: Telnet Password: e7858b3f4e0433a6
2022-12-02 22:22:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.logstats.LogStats']
2022-12-02 22:22:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-02 22:22:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-02 22:22:25 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-02 22:22:25 [scrapy.core.engine] INFO: Spider opened
2022-12-02 22:22:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-02 22:22:26 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:27 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:28 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:29 [scrapy.extensions.logstats] INFO: Crawled 205 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:30 [scrapy.extensions.logstats] INFO: Crawled 261 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:31 [scrapy.extensions.logstats] INFO: Crawled 293 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:32 [scrapy.extensions.logstats] INFO: Crawled 333 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:33 [scrapy.extensions.logstats] INFO: Crawled 365 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:34 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:35 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2022-12-02 22:22:35 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 22:22:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 192019,
'downloader/request_count': 461,
'downloader/request_method_count/GET': 461,
'downloader/response_bytes': 1277946,
'downloader/response_count': 461,
'downloader/response_status_count/200': 461,
'elapsed_time_seconds': 10.698746,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2022, 12, 2, 14, 22, 36, 32014),
'log_count/INFO': 20,
'log_count/WARNING': 1,
'request_depth_max': 16,
'response_received_count': 461,
'scheduler/dequeued': 461,
'scheduler/dequeued/memory': 461,
'scheduler/enqueued': 9220,
'scheduler/enqueued/memory': 9220,
'start_time': datetime.datetime(2022, 12, 2, 14, 22, 25, 333268)}
2022-12-02 22:22:36 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

scrapy fetch(get web page)

1
2
3
4
5
6
7
8
9
10
11
12
13
(myenv10_scrapy) D:\work\git\python_crawler>scrapy fetch https://www.google.com
2022-12-02 22:29:41 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: scrapybot)
2022-12-02 22:29:41 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19045-SP0
2022-12-02 22:29:41 [scrapy.crawler] INFO: Overridden settings:
{}
2022-12-02 22:29:41 [py.warnings] WARNING: D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\utils\request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

......

run scrapy for worldometers

create project

1
2
3
4
5
6
7
(myenv10_scrapy) D:\work\git\python_crawler\101-scrapy>scrapy startproject worldmeters
New Scrapy project 'worldmeters', using template directory 'D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\templates\project', created in:
D:\work\git\python_crawler\101-scrapy\worldmeters

You can start your first spider with:
cd worldmeters
scrapy genspider example example.com

create spider

1
2
3
4
5
6
# https://www.worldometers.info/world-population/population-by-country/
# default http:, removed "/" (because scrape will auto add)

(myenv10_scrapy) D:\work\git\python_crawler\101-scrapy\worldmeters>scrapy genspider countries www.worldometers.info/world-population/population-by-country
Created spider 'countries' using template 'basic' in module:
worldmeters.spiders.countries
1
2
3
4
5
6
7
8
9
10
11
12
# worldmeters\worldmeters\spiders\countries.py
import scrapy


class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# change http: to https:
start_urls = ['https://www.worldometers.info']

def parse(self, response):
pass

scrape shell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
(myenv10_scrapy) D:\work\git\python_crawler\101-scrapy\worldmeters>scrapy shell
D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\spiderloader.py:37: UserWarning: There are several spiders with the same name:

CountriesSpider named 'countries' (in worldmeters.spiders.countries - 複製)

CountriesSpider named 'countries' (in worldmeters.spiders.countries)

CountriesSpider named 'countries' (in worldmeters.spiders.countries_update)

This can cause unexpected behavior.
warnings.warn(
2022-12-03 10:04:03 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: worldmeters)
2022-12-03 10:04:03 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19045-SP0
2022-12-03 10:04:03 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'worldmeters',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'worldmeters.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['worldmeters.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-12-03 10:04:03 [asyncio] DEBUG: Using selector: SelectSelector
2022-12-03 10:04:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-12-03 10:04:03 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-12-03 10:04:03 [scrapy.extensions.telnet] INFO: Telnet Password: 896a4cf5aba51b02
2022-12-03 10:04:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2022-12-03 10:04:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-03 10:04:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-03 10:04:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-03 10:04:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-03 10:04:04 [asyncio] DEBUG: Using selector: SelectSelector
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000029E3EC8F3D0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x0000029E3EC8F430>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2022-12-03 10:04:05 [asyncio] DEBUG: Using selector: SelectSelector
In [1]:

worldometers not support Crawled(no robots.txt)

DEBUG: Crawled (404) <GET https://www.worldometers.info/robots.txt> (referer: None)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
In [1]: fetch("https://www.worldometers.info/world-population/population-by-cou
...: ntry/")
2022-12-03 10:08:03 [scrapy.core.engine] INFO: Spider opened
2022-12-03 10:08:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.worldometers.info/robots.txt> (referer: None)
2022-12-03 10:08:05 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2022-12-03 10:08:05 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2022-12-03 10:08:05 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2022-12-03 10:08:05 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2022-12-03 10:08:05 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2022-12-03 10:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)

In [2]: 2022-12-03 10:08:06 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)
Traceback (most recent call last):
File "D:\app\python_env\myenv10_scrapy\lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\utils\defer.py", line 285, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\utils\defer.py", line 272, in deferred_from_coro
event_loop = get_asyncio_event_loop_policy().get_event_loop()
File "D:\app\Python\Python310\lib\asyncio\events.py", line 656, in get_event_loop
raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1 (start)'.
2022-12-03 10:08:06 [py.warnings] WARNING: D:\app\python_env\myenv10_scrapy\lib\site-packages\twisted\internet\defer.py:892: RuntimeWarning: coroutine 'SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output' was never awaited
current.result = callback( # type: ignore[misc]

show body

1
2
3
In [5]: response.body
Out[5]: b'\n\n<!DOCTYPE html><!--[if IE 8]> <html lang="en" class="ie8"> <![endif]--><!--[if IE 9]> <html lang="en" class="ie9"> <![endif]--><!--[if !IE]><!--> <html lang="en"> <!--<![endif]--> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Population by Country (2022) - Worldometer</title><meta name="description" content="List of countries and dependencies in the world ranked by population, from the most populated. Growth rate, median age, fertility rate, area, density, population density, urbanization, urban population, share of world population."><!-- Favicon --><link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon"><link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png"><link rel="apple-touch-icon" sizes="60x60" href="/favicon/apple-icon-60x60.png"><link rel="apple-touch-icon" sizes="72x72" href="/favicon/apple-icon-72x72.png"><link rel="apple-touch-icon" sizes="76x76" href="/favicon/apple-icon-76x76.png"><link rel="apple-touch-icon" sizes="114x114" href="/favicon/apple-icon-114x114.png"><link rel="apple-touch-icon" sizes="120x120" href="/favicon/apple-icon-120x120.png"><link rel="apple-touch-icon" sizes="144x144" href="/favicon/apple-icon-144x144.png"><link rel="apple-touch-icon" sizes="152x152" href="/favicon/apple-icon-152x152.png"><link rel="apple-touch-icon" sizes="180x180" href="/favicon/apple-icon-180x180.png"><link rel="icon"
......

scrape view

  • ctrl-shift i (run Chrome DevTools)
  • ctrl-shift p (command)
    • javascript disable
  • ctrl-r (refresh)
    1
    2
    In [6]: view(response)
    Out[6]: True

XPath expression & CSS selectors

  • ctrl-shift c (inspect)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    # XPath expression
    In [16]: title = response.xpath("//h1")

    In [17]: title
    Out[17]: [<Selector xpath='//h1' data='<h1>Countries in the world by populat...'>]

    In [18]: title = response.xpath("//h1/text()")

    In [19]: title
    Out[19]: [<Selector xpath='//h1/text()' data='Countries in the world by population ...'>]

    In [20]: title.get()
    Out[20]: 'Countries in the world by population (2022)'

    # CSS selectors
    In [22]: title_css = response.css("h1::text")

    In [23]: title_css
    Out[23]: [<Selector xpath='descendant-or-self::h1/text()' data='Countries in the world by population ...'>]

    In [26]: title_css.get()
    Out[26]: 'Countries in the world by population (2022)'


    # XPath expression
    In [30]: countries = response.xpath("//td/a/text()").getall()

    In [31]: countries
    Out[31]:
    ['China',
    'India',
    'United States',
    'Indonesia',
    'Pakistan',
    'Brazil',
    'Nigeria',
    ......

    'Holy See']

    # CSS selectors
    In [34]: countries_css = response.css("td a::text").getall()

    In [35]: countries_css
    Out[35]:
    ['China',
    'India',
    'United States',
    'Indonesia',
    'Pakistan',
    'Brazil',
    'Nigeria',
    ......

    'Holy See']

modify worldmeters\worldmeters\spiders\countries.py for XPath expression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import scrapy

class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# start_urls = ['https://www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country']

def parse(self, response):
title = response.xpath("//h1/text()").get()
countries = response.xpath("//td/a/text()").getall()

yield {
'tittle': title,
'counties': countries
}

run scrapy clawer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy crawl countries
2022-12-06 12:03:18 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: worldmeters)
2022-12-06 12:03:18 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19044-SP0
2022-12-06 12:03:18 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'worldmeters',
'NEWSPIDER_MODULE': 'worldmeters.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['worldmeters.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-12-06 12:03:18 [asyncio] DEBUG: Using selector: SelectSelector
2022-12-06 12:03:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-12-06 12:03:18 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-12-06 12:03:18 [scrapy.extensions.telnet] INFO: Telnet Password: 6b21a23169d5dea4
2022-12-06 12:03:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-12-06 12:03:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-06 12:03:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-06 12:03:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-06 12:03:18 [scrapy.core.engine] INFO: Spider opened
2022-12-06 12:03:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-06 12:03:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-06 12:03:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.worldometers.info/robots.txt> (referer: None)
2022-12-06 12:03:19 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2022-12-06 12:03:19 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2022-12-06 12:03:19 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2022-12-06 12:03:19 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
2022-12-06 12:03:19 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2022-12-06 12:03:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)
2022-12-06 12:03:20 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/population-by-country/>
{'tittle': 'Countries in the world by population (2022)', 'counties': ['China', 'India', 'United States', 'Indonesia', 'Pakistan', 'Brazil', 'Nigeria', 'Bangladesh', 'Russia', 'Mexico', 'Japan', 'Ethiopia', 'Philippines', 'Egypt', 'Vietnam', 'DR Congo', 'Turkey', 'Iran', 'Germany', 'Thailand', 'United Kingdom', 'France', 'Italy', 'Tanzania', 'South Africa', 'Myanmar', 'Kenya', 'South Korea', 'Colombia', 'Spain', 'Uganda', 'Argentina', 'Algeria', 'Sudan', 'Ukraine', 'Iraq', 'Afghanistan', 'Poland', 'Canada', 'Morocco', 'Saudi Arabia', 'Uzbekistan', 'Peru', 'Angola', 'Malaysia', 'Mozambique', 'Ghana', 'Yemen', 'Nepal', 'Venezuela', 'Madagascar', 'Cameroon', "Côte d'Ivoire", 'North Korea', 'Australia', 'Niger', 'Taiwan', 'Sri Lanka', 'Burkina Faso', 'Mali', 'Romania', 'Malawi', 'Chile', 'Kazakhstan', 'Zambia', 'Guatemala', 'Ecuador', 'Syria', 'Netherlands', 'Senegal', 'Cambodia', 'Chad', 'Somalia', 'Zimbabwe', 'Guinea', 'Rwanda', 'Benin', 'Burundi', 'Tunisia', 'Bolivia', 'Belgium', 'Haiti', 'Cuba', 'South Sudan', 'Dominican Republic', 'Czech Republic (Czechia)', 'Greece', 'Jordan', 'Portugal', 'Azerbaijan', 'Sweden', 'Honduras', 'United Arab Emirates', 'Hungary', 'Tajikistan', 'Belarus', 'Austria', 'Papua New Guinea', 'Serbia', 'Israel', 'Switzerland', 'Togo', 'Sierra Leone', 'Hong Kong', 'Laos', 'Paraguay', 'Bulgaria', 'Libya', 'Lebanon', 'Nicaragua', 'Kyrgyzstan', 'El Salvador', 'Turkmenistan', 'Singapore', 'Denmark', 'Finland', 'Congo', 'Slovakia', 'Norway', 'Oman', 'State of Palestine', 'Costa Rica', 'Liberia', 'Ireland', 'Central African Republic', 'New Zealand', 'Mauritania', 'Panama', 'Kuwait', 'Croatia', 'Moldova', 'Georgia', 'Eritrea', 'Uruguay', 'Bosnia and Herzegovina', 'Mongolia', 'Armenia', 'Jamaica', 'Qatar', 'Albania', 'Puerto Rico', 'Lithuania', 'Namibia', 'Gambia', 'Botswana', 'Gabon', 'Lesotho', 'North Macedonia', 'Slovenia', 'Guinea-Bissau', 'Latvia', 'Bahrain', 'Equatorial Guinea', 'Trinidad and Tobago', 'Estonia', 'Timor-Leste', 'Mauritius', 'Cyprus', 'Eswatini', 'Djibouti', 'Fiji', 'Réunion', 'Comoros', 'Guyana', 'Bhutan', 'Solomon Islands', 'Macao', 'Montenegro', 'Luxembourg', 'Western Sahara', 'Suriname', 'Cabo Verde', 'Micronesia', 'Maldives', 'Malta', 'Brunei ', 'Guadeloupe', 'Belize', 'Bahamas', 'Martinique', 'Iceland', 'Vanuatu', 'French Guiana', 'Barbados', 'New Caledonia', 'French Polynesia', 'Mayotte', 'Sao Tome & Principe', 'Samoa', 'Saint Lucia', 'Channel Islands', 'Guam', 'Curaçao', 'Kiribati', 'Grenada', 'St. Vincent & Grenadines', 'Aruba', 'Tonga', 'U.S. Virgin Islands', 'Seychelles', 'Antigua and Barbuda', 'Isle of Man', 'Andorra', 'Dominica', 'Cayman Islands', 'Bermuda', 'Marshall Islands', 'Northern Mariana Islands', 'Greenland', 'American Samoa', 'Saint Kitts & Nevis', 'Faeroe Islands', 'Sint Maarten', 'Monaco', 'Turks and Caicos', 'Saint Martin', 'Liechtenstein', 'San Marino', 'Gibraltar', 'British Virgin Islands', 'Caribbean Netherlands', 'Palau', 'Cook Islands', 'Anguilla', 'Tuvalu', 'Wallis & Futuna', 'Nauru', 'Saint Barthelemy', 'Saint Helena', 'Saint Pierre & Miquelon', 'Montserrat', 'Falkland Islands', 'Niue', 'Tokelau', 'Holy See']}
2022-12-06 12:03:20 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-06 12:03:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 491,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 18989,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.387354,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 6, 4, 3, 20, 290084),
'httpcompression/response_bytes': 96257,
'httpcompression/response_count': 2,
'item_scraped_count': 1,
'log_count/DEBUG': 11,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 12, 6, 4, 3, 18, 902730)}
2022-12-06 12:03:20 [scrapy.core.engine] INFO: Spider closed (finished)

Worldometers Get Countries Population

try xpath
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy shell "https://www.worldometers.info/world-population/population-by-country/"
2022-12-09 12:24:35 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: worldmeters)
......

[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2022-12-09 12:24:37 [asyncio] DEBUG: Using selector: SelectSelector
In [1]:

In [1]: countries = response.xpath("//td/a")

In [2]: countries
Out[2]:
[<Selector xpath='//td/a' data='<a href="/world-population/china-popu...'>,
<Selector xpath='//td/a' data='<a href="/world-population/india-popu...'>,
......
<Selector xpath='//td/a' data='<a href="/world-population/tokelau-po...'>,
<Selector xpath='//td/a' data='<a href="/world-population/holy-see-p...'>]

In [3]:
countries.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy


class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# start_urls = ['https://www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country']

def parse(self, response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()

yield {
'country_name': name,
'country_link': link
}
run
1
2
3
4
5
6
7
8
9
10
11
12
13
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy crawl countries
2022-12-09 13:54:27 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: worldmeters)
......
2022-12-09 13:54:28 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2022-12-09 13:54:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)
2022-12-09 13:54:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/population-by-country/>
{'country_name': 'China', 'country_link': '/world-population/china-population/'}
2022-12-09 13:54:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/population-by-country/>
{'country_name': 'India', 'country_link': '/world-population/india-population/'}
2022-12-09 13:54:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/population-by-country/>
{'country_name': 'United States', 'country_link': '/world-population/us-population/'}
......
2022-12-09 13:54:29 [scrapy.core.engine] INFO: Spider closed (finished)
countries.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy


class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# start_urls = ['https://www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country']

def parse(self, response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()

# absolute url
# absolute_url = f'https://www.worldometers.info{link}'
# absolute_url = response.urljoin(link)
# yield scrapy.Request(url=absolute_url)

# relative url
yield response.follow(url=link)
run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy crawl countries
2022-12-09 14:17:54 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: worldmeters)
2022-12-09 14:17:54 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19044-SP0
2022-12-09 14:17:54 [scrapy.crawler] INFO: Overridden settings:
......
2022-12-09 14:17:55 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2022-12-09 14:17:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/population-by-country/> (referer: None)
2022-12-09 14:17:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/iran-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)
2022-12-09 14:17:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.worldometers.info/world-population/mexico-population/> (referer: https://www.worldometers.info/world-population/population-by-country/)
......
'scheduler/dequeued/memory': 236,
'scheduler/enqueued': 236,
'scheduler/enqueued/memory': 236,
'start_time': datetime.datetime(2022, 12, 9, 6, 17, 55, 174514)}
2022-12-09 14:18:05 [scrapy.core.engine] INFO: Spider closed (shutdown)

get country’s year and population

countries.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import scrapy
import logging

class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# start_urls = ['https://www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country']

def parse(self, response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()

# absolute url
# absolute_url = f'https://www.worldometers.info{link}'
# absolute_url = response.urljoin(link)
# yield scrapy.Request(url=absolute_url)

# relative url
yield response.follow(url=link, callback=self.parse_country)

def parse_country(self, response):
# show log
# logging.info(response.url)
rows = response.xpath("(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1]/tbody/tr")
for row in rows:
year = row.xpath("./td[1]/text()").get()
population = row.xpath("./td[2]/strong/text()").get()
yield {
'year' : year,
'population': population
}
run
1
2
3
4
5
6
7
8
9
10
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy crawl countries
2022-12-09 17:07:25 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: worldmeters)
......
2022-12-09 17:07:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/philippines-population/>
{'year': '2020', 'population': '109,581,078'}
2022-12-09 17:07:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/philippines-population/>
{'year': '2019', 'population': '108,116,615'}
2022-12-09 17:07:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/philippines-population/>
{'year': '2018', 'population': '106,651,394'}
......

class use global variable for country name(not work …)

countries.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import scrapy
import logging

class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# start_urls = ['https://www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country']
country_name = ''

def parse(self, response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()

# class use global variable for country name
self.country_name = name

# absolute url
# absolute_url = f'https://www.worldometers.info{link}'
# absolute_url = response.urljoin(link)
# yield scrapy.Request(url=absolute_url)

# relative url
yield response.follow(url=link, callback=self.parse_country)

def parse_country(self, response):
# show log
# logging.info(response.url)
rows = response.xpath("(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1]/tbody/tr")
for row in rows:
year = row.xpath("./td[1]/text()").get()
population = row.xpath("./td[2]/strong/text()").get()
yield {
'name' : self.country_name,
'year' : year,
'population': population
}
run
1
2
3
4
5
6
7
8
9
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy crawl countries
......
2022-12-12 15:17:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/china-population/>
{'name': 'Denmark', 'year': '2020', 'population': '1,439,323,776'}
2022-12-12 15:17:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/china-population/>
{'name': 'Norway', 'year': '2019', 'population': '1,433,783,686'}
2022-12-12 15:17:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/china-population/>
{'name': 'Norway', 'year': '2018', 'population': '1,427,647,786'}
......

add meta for callback parameter

countries.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import scrapy

class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
# start_urls = ['https://www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country']

def parse(self, response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()

# absolute url
# absolute_url = f'https://www.worldometers.info{link}'
# absolute_url = response.urljoin(link)
# yield scrapy.Request(url=absolute_url)

# relative url
# add meta for callback parameter
yield response.follow(url=link, callback=self.parse_country, meta={'country_name': name})

def parse_country(self, response):
# add meta for callback parameter
name = response.request.meta['country_name']
rows = response.xpath("(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1]/tbody/tr")
for row in rows:
year = row.xpath("./td[1]/text()").get()
population = row.xpath("./td[2]/strong/text()").get()
yield {
'country_name' : name,
'year' : year,
'population': population
}
run
1
2
3
4
5
6
7
8
9
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy crawl countries
......
2022-12-12 15:25:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/uk-population/>
{'country_name': 'United Kingdom', 'year': '2020', 'population': '67,886,011'}
2022-12-12 15:25:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/uk-population/>
{'country_name': 'United Kingdom', 'year': '2019', 'population': '67,530,172'}
2022-12-12 15:25:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.worldometers.info/world-population/uk-population/>
{'country_name': 'United Kingdom', 'year': '2018', 'population': '67,141,684'}
......

data generate by dataset(json, csv, xml)

1
2
3
4
5
6
# generate json file
scrapy crawl countries -o population_dataset.json
# generate csv file
scrapy crawl countries -o population_dataset.csv
# generate xml file
scrapy crawl countries -o population_dataset.xml
1
2
3
4
5
6
7
8
9
[
{"country_name": "China", "year": "2020", "population": "1,439,323,776"},
{"country_name": "China", "year": "2019", "population": "1,433,783,686"},
{"country_name": "China", "year": "2018", "population": "1,427,647,786"},
{"country_name": "China", "year": "2017", "population": "1,421,021,791"},
......
{"country_name": "India", "year": "1960", "population": "450,547,679"},
{"country_name": "India", "year": "1955", "population": "409,880,595"}
]
1
2
3
4
5
6
7
8
country_name,year,population
China,2020,"1,439,323,776"
China,2019,"1,433,783,686"
China,2018,"1,427,647,786"
......
DR Congo,1965,"17,369,883"
DR Congo,1960,"15,248,251"
DR Congo,1955,"13,517,513"
1
2
3
4
5
6
7
8
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><country_name>China</country_name><year>2020</year><population>1,439,323,776</population></item>
<item><country_name>China</country_name><year>2019</year><population>1,433,783,686</population></item>
......
<item><country_name>Philippines</country_name><year>1960</year><population>26,269,734</population></item>
<item><country_name>Philippines</country_name><year>1955</year><population>22,177,058</population></item>
</items>

Debt to GDP ratio by country

create spider

1
2
3
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\worldmeters>scrapy genspider gdp_debt worldpopulationreview.com/countries/countries-by-national-debt
Created spider 'gdp_debt' using template 'basic' in module:
worldmeters.spiders.gdp_debt

countries.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy


class GdpDebtSpider(scrapy.Spider):
name = 'gdp_debt'
allowed_domains = ['worldpopulationreview.com']
# start_urls = ['http://worldpopulationreview.com/']
start_urls = ['https://worldpopulationreview.com/country-rankings/countries-by-national-debt']

def parse(self, response):
rows = response.xpath("//tbody/tr")
for row in rows:
name = row.xpath("./td[1]/a/text()").get()
debt_rate = row.xpath("./td[2]/text()").get()

yield {
'country_name' : name,
'debt_rate' : debt_rate
}

run(wait new method)

Cannot get table "Debt to GDP Ratio by Country" : the web site reason(data generate by JavaScript)

tinydeal

open robots.txt

search page name, if not found mean not restriction for scrape

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
User-agent: *
Allow: /*main_page=top_brands
Allow: /*main_page=ws_search_result
Allow: /*index.php?main_page=zone_2dollars
Disallow: /*main_page=*
Disallow: /bg/
Disallow: /cs/
Disallow: /da/
Disallow: /el/
Disallow: /fi/
Disallow: /hu/
Disallow: /hr/
Disallow: /lt/
Disallow: /no/
Disallow: /pl/
Disallow: /ro/
Disallow: /sk/
Disallow: /sl/
Disallow: /sv/
Disallow: /tr/
Disallow: /ja/
Disallow: /jp/
Disallow: /ko/
Disallow: /wordpress/
Disallow: /new/
Disallow: /*/includes/
Disallow: /shop/products/
Disallow: /index.php/*-si-*.html
Disallow: /*-c-*-pg-1.html
Disallow: */buy/*surl=
Disallow: */buy/*-c-
Disallow: /*pagesize=
Disallow: /*sk=
Disallow: /*?dp=
Disallow: /*fb_comment_id=
Disallow: /*reviews_id=
Disallow: /*gotowhere
Disallow: /*/cheap-product
Disallow: /es/compra*-t-*
Disallow: /pt/compra*-t-*
Disallow: /fr/bon*-t-*
Disallow: /de/kaufen*-t-*
Disallow: /it/economico*-t-*
Disallow: /ru/Купи*-t-*
Disallow: /nl/*goedkoop*-t-*
Disallow: /ar/بالأسعار-المعقولة*-t-*
Disallow: /*is_input=


User-agent: Yandex
Allow: /*main_page=top_brands
Allow: /*main_page=ws_search_result
Allow: /*index.php?main_page=zone_2dollars
Disallow: /*main_page=*
Disallow: /es/
Disallow: /it/
Disallow: /pt/
Disallow: /fr/
Disallow: /de/
Disallow: /ar/
Disallow: /bg/
Disallow: /cs/
Disallow: /da/
Disallow: /el/
Disallow: /fi/
Disallow: /hu/
Disallow: /hr/
Disallow: /lt/
Disallow: /nl/
Disallow: /no/
Disallow: /pl/
Disallow: /ro/
Disallow: /sk/
Disallow: /sl/
Disallow: /sv/
Disallow: /tr/
Disallow: /ja/
Disallow: /jp/
Disallow: /ko/
Disallow: /wordpress/
Disallow: /customers_photo/
Disallow: /new/
Disallow: /*/includes/
Disallow: /shop/products/
Disallow: /index.php/*-si-*.html
Disallow: /*-c-*-pg-1.html
Disallow: */buy/*surl=
Disallow: */buy/*-c-
Disallow: /*pagesize=
Disallow: /*sk=
Disallow: /*?dp=
Disallow: /*fb_comment_id=
Disallow: /*reviews_id=
Disallow: /*gotowhere
Disallow: /*/cheap-product
Disallow: /es/compra*-t-*
Disallow: /pt/compra*-t-*
Disallow: /fr/bon*-t-*
Disallow: /de/kaufen*-t-*
Disallow: /it/economico*-t-*
Disallow: /ru/Купи*-t-*
Disallow: /nl/*goedkoop*-t-*
Disallow: /ar/بالأسعار-المعقولة*-t-*
Disallow: /*is_input=

User-Agent: Baiduspider
Disallow: /
User-Agent: 360Spider
Disallow: /
User-Agent: Sogouspider
Disallow: /
User-Agent: Sosospider
Disallow: /
User-agent: YoudaoBot
Disallow: /
User-agent: magpie-crawler
Disallow: /

User-agent: AdsBot-Google
Disallow:
User-agent: Googlebot-Image
Disallow:

Sitemap: http://www.tinydeal.com/sitemap.xml
1
https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html

open web site then disable Javascript

  • open Chrome devtools
  • run command(Ctrl+Shift+p) : Disable JavaScript
  • refresh web page

create project and spider

1
2
3
4
5
6
7
8
9
10
11
12
13
myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>scrapy startproject tinydeal
New Scrapy project 'tinydeal', using template directory 'D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\templates\project', created in:
D:\work\run\python_crawler\101-scrapy\tinydeal

You can start your first spider with:
cd tinydeal
scrapy genspider example example.com

(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>cd tinydeal

(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\tinydeal>scrapy genspider special_offers https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html
Created spider 'special_offers' using template 'basic' in module:
tinydeal.spiders.special_offers

update special_offers.py

1
2
3
4
5
6
7
8
9
10
11
import scrapy

class SpecialOffersSpider(scrapy.Spider):
name = 'special_offers'
allowed_domains = ['web.archive.org']
# start_urls = ['http://web.archive.org/']
# change web site
start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html']

def parse(self, response):
pass

update special_offers.py (get product information)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy

class SpecialOffersSpider(scrapy.Spider):
name = 'special_offers'
allowed_domains = ['web.archive.org']
# start_urls = ['http://web.archive.org/']
# change web site
start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html']

def parse(self, response):
for product in response.xpath('//ul[@class="productlisting-ul"]/div/li'):
yield {
'title' : product.xpath('.//a[@class="p_box_title"]/text()').get(),
'url' : response.urljoin(product.xpath('.//a[@class="p_box_title"]/@href').get()),
'discounted_price' : product.xpath('.//div[@class="p_box_price"]/span[1]/text()').get(),
'original_price' : product.xpath('.//div[@class="p_box_price"]/span[2]/text()').get()
}

run

1
scrapy crawl special_offers  -o dataset.json

settings.py - change json for utf-8 format(no show unicode)

1
2
# set JSON utf-8 format
FEED_EXPORT_ENCODING = 'utf-8'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[
{
"title": "SanDisk A1 32GB UHS-I / Class 10 up to 98MB / s Micro SDHC Memory Card\u00a0EFM-530161",
"url": "https://web.archive.org/web/20190225123327/https:/www.tinydeal.com/sandisk-a1-32gb-uhs-i-class-10-up-to-98mb-s-micro-sdhc-memory-card-p-165914.html",
"discounted_price": "$6.72",
"original_price": "$12.09 "
},
{
"title": "18g Super Strong Sealant Fix Metal Adhesive Sealing Glue Bond\u00a0HHI-557389",
"url": "https://web.archive.org/web/20190225123327/https:/www.tinydeal.com/18g-super-strong-sealant-fix-metal-adhesive-sealing-glue-bond-p-177571.html",
"discounted_price": "$1.40",
"original_price": "$3.76 "
},


# set FEED_EXPORT_ENCODING = 'utf-8'
[
{
"title": "SanDisk A1 32GB UHS-I / Class 10 up to 98MB / s Micro SDHC Memory Card EFM-530161",
"url": "https://web.archive.org/web/20190225123327/https:/www.tinydeal.com/sandisk-a1-32gb-uhs-i-class-10-up-to-98mb-s-micro-sdhc-memory-card-p-165914.html",
"discounted_price": "$6.72",
"original_price": "$12.09 "
},
{
"title": "18g Super Strong Sealant Fix Metal Adhesive Sealing Glue Bond HHI-557389",
"url": "https://web.archive.org/web/20190225123327/https:/www.tinydeal.com/18g-super-strong-sealant-fix-metal-adhesive-sealing-glue-bond-p-177571.html",
"discounted_price": "$1.40",
"original_price": "$3.76 "
},

special_offers.py - dealing with pagination(csv)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy


class SpecialOffersSpider(scrapy.Spider):
name = 'special_offers'
allowed_domains = ['web.archive.org']
# start_urls = ['http://web.archive.org/']
# change web site
start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html']

def parse(self, response):
for product in response.xpath('//ul[@class="productlisting-ul"]/div/li'):
yield {
'title' : product.xpath('.//a[@class="p_box_title"]/text()').get(),
'url' : response.urljoin(product.xpath('.//a[@class="p_box_title"]/@href').get()),
'discounted_price' : product.xpath('.//div[@class="p_box_price"]/span[1]/text()').get(),
'original_price' : product.xpath('.//div[@class="p_box_price"]/span[2]/text()').get()
}

next_page = response.xpath('//a[@class="nextPage"]/@href').get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# only get until page 9
scrapy crawl special_offers -o dataset.csv
......
{'downloader/request_bytes': 7557,
'downloader/request_count': 19,
'downloader/request_method_count/GET': 19,
'downloader/response_bytes': 541040,
'downloader/response_count': 19,
'downloader/response_status_count/200': 9,
'downloader/response_status_count/302': 9,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 7.098086,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 14, 6, 8, 40, 825326),
'httpcompression/response_bytes': 3199518,
'httpcompression/response_count': 9,
'item_scraped_count': 495,
'log_count/DEBUG': 517,
'log_count/INFO': 10,
'request_depth_max': 8,
'response_received_count': 10,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 18,
'scheduler/dequeued/memory': 18,
'scheduler/enqueued': 18,
'scheduler/enqueued/memory': 18,
'start_time': datetime.datetime(2022, 12, 14, 6, 8, 33, 727240)}
2022-12-14 14:08:40 [scrapy.core.engine] INFO: Spider closed (finished)

(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\tinydeal>

change User-Agent

check scrapy heads
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\tinydeal>scrapy shell "https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html"
......
2022-12-14 14:19:55 [asyncio] DEBUG: Using selector: SelectSelector
# show the flow
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x00000265CAFC3850>
[s] item {}
[s] request <GET https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html>
[s] response <200 https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html>
[s] settings <scrapy.settings.Settings object at 0x00000265CAFC37F0>
[s] spider <SpecialOffersSpider 'special_offers' at 0x265cb41db70>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2022-12-14 14:19:56 [asyncio] DEBUG: Using selector: SelectSelector
# request header
In [1]: request.headers
Out[1]:
{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
b'Accept-Language': b'en',
b'User-Agent': b'Scrapy/2.7.1 (+https://scrapy.org)',
b'Accept-Encoding': b'gzip, deflate'}

# response request headers
In [3]: response.request.headers
Out[3]:
{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
b'Accept-Language': b'en',
b'User-Agent': b'Scrapy/2.7.1 (+https://scrapy.org)',
b'Accept-Encoding': b'gzip, deflate'}

In [4]:
check browser user agent
change User-Agent by settings.py(2 ways option)
1
2
3
4
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tinydeal (+http://www.yourdomain.com)'
# change user agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
1
2
3
4
5
6
7
8
9
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# change default heads
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
change User-Agent by .py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import scrapy


class SpecialOffersSpider(scrapy.Spider):
name = 'special_offers'
allowed_domains = ['web.archive.org']
# start_urls = ['http://web.archive.org/']
# change web site

# change user agent
# start_urls = ['https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html']
def start_requests(self):
yield scrapy.Request(url='https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html', callback=self.parse, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
})

def parse(self, response):
for product in response.xpath('//ul[@class="productlisting-ul"]/div/li'):
yield {
'title': product.xpath('.//a[@class="p_box_title"]/text()').get(),
'url': response.urljoin(product.xpath('.//a[@class="p_box_title"]/@href').get()),
'discounted_price': product.xpath('.//div[@class="p_box_price"]/span[1]/text()').get(),
'original_price': product.xpath('.//div[@class="p_box_price"]/span[2]/text()').get(),
# show response.request User-Agent
'User-Agent': response.request.headers['User-Agent']
}

next_page = response.xpath('//a[@class="nextPage"]/@href').get()
if next_page:
# change user agent
yield scrapy.Request(url=next_page, callback=self.parse, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
})
1
2
3
4
5
6
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\tinydeal>scrapy crawl special_offers
......
2022-12-14 15:20:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html>
{'title': '18g Super Strong Sealant Fix Metal Adhesive Sealing Glue Bond\xa0HHI-557389', 'url': 'https://web.archive.org/web/20190225123327/https:/www.tinydeal.com/18g-super-strong-sealant-fix-metal-adhesive-sealing-glue-bond-p-177571.html', 'discounted_price': '$1.40', 'original_price': '$3.76 ', 'User-Agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-14 15:20:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://web.archive.org/web/20190225123327/https://www.tinydeal.com/specials.html>
{'title': '64GB USB 2.0 Flash Drive USB Pen Drive U Disk\xa0EFM-561923', 'url': 'https://web.archive.org/web/20190225123327/https:/www.tinydeal.com/64gb-usb-20-flash-drive-usb-pen-drive-u-disk-p-178875.html', 'discounted_price': '$6.42', 'original_price': '$19.08 ', 'User-Agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}

glassesshop

create project and spider

1
2
3
4
5
6
7
8
9
10
11
12
myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>scrapy startproject glassesshop
New Scrapy project 'glassesshop', using template directory 'D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\templates\project', created in:
D:\work\run\python_crawler\101-scrapy\glassesshop

You can start your first spider with:
cd glassesshop
scrapy genspider example example.com

(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>cd glassesshop
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\glassesshop>scrapy genspider products https://www.glassesshop.com/bestsellers
Created spider 'products' using template 'basic' in module:
glassesshop.spiders.products

https://www.glassesshop.com/robots.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
User-agent: *
Disallow: /login/
Disallow: /register/
Disallow: /promotion/
Disallow: /cart/
Disallow: /lens?*
Disallow: /lens/new?*
Disallow: *?currency*
Disallow: *?source*
Disallow: *?sort*
Disallow: *?utm_source*
Disallow: *&currency*
Disallow: *?referer*
Disallow: *?PageSpeed*

Sitemap: https://www.glassesshop.com/sitemap.xml

products.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import scrapy


class ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['www.glassesshop.com']
start_urls = ['https://www.glassesshop.com/bestsellers']
page_index = 1

def parse(self, response):
for product in response.xpath('//div[@class="col-12 pb-5 mb-lg-3 col-lg-4 product-list-row text-center product-list-item"]'):
yield {
'product_name': product.xpath('.//div[@class="p-title"]/a/text()').get().strip(),
'product_price': product.xpath('.//div[@class="p-price"]/div/span/text()').get(),
'product_url': product.xpath('.//div[@class="product-img-outer"]/a/@href').getall(),
'product_image': product.xpath('.//img[@class="lazy d-block w-100 product-img-default"]/@data-src').get().split('?')[0],
'page_number': self.page_index
}

self.page_index += 1
next_page = response.xpath('//a[@class="page-link"][@rel="next"]/@href').get()
if next_page:
yield {
'link' : next_page
}
yield scrapy.Request(url=next_page, callback=self.parse)

run

1
scrapy crawl products -o products.json

json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[
{
"product_name": "Union",
"product_price": "$35.95",
"product_url": [
"https://www.glassesshop.com/eyeglasses/fz1750",
"https://www.glassesshop.com/eyeglasses/fz1733",
"https://www.glassesshop.com/eyeglasses/fz1731"
],
"product_image": "https://res.glassesshop.com/products/202108/610a547c82bdc.jpg",
"page_number": 1
},
{
"product_name": "Placerville",
"product_price": "$14.98",
"product_url": [
"https://www.glassesshop.com/eyeglasses/fz2025",
"https://www.glassesshop.com/eyeglasses/fz2022",
"https://www.glassesshop.com/eyeglasses/fz2023",
"https://www.glassesshop.com/eyeglasses/fz2024"
],
"product_image": "https://res.glassesshop.com/products/202209/63292118e1589.jpg",
"page_number": 1
},
......
{
"product_name": "Cloud",
"product_price": "$45.95",
"product_url": [
"https://www.glassesshop.com/eyeglasses/sup1238",
"https://www.glassesshop.com/eyeglasses/sup1239",
"https://www.glassesshop.com/eyeglasses/sup1240"
],
"product_image": "https://res.glassesshop.com/products/202109/613efbdd8b577.jpg",
"page_number": 4
}
]

imdb(crawl template)

create project and spider

1
2
3
4
5
6
7
8
9
10
11
12
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>scrapy startproject imdb
New Scrapy project 'imdb', using template directory 'D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\templates\project', created in:
D:\work\run\python_crawler\101-scrapy\imdb

You can start your first spider with:
cd imdb
scrapy genspider example example.com

(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>cd imdb
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\imdb>scrapy genspider -t crawl best_movies imdb.com
Created spider 'best_movies' using template 'crawl' in module:
imdb.spiders.best_movies

best_movies.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
name = 'best_movies'
allowed_domains = ['imdb.com']
start_urls = ['http://imdb.com/']

rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
# other link extractor condition
# Rule(LinkExtractor(deny=r'Items/'), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_xpaths='//a[@class="active"]'), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_css=''), callback='parse_item', follow=True),
)

def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item
best_movies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# best_movies.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
name = 'best_movies'
allowed_domains = ['imdb.com']
start_urls = ['https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc']

rules = (
Rule(LinkExtractor(restrict_xpaths='//h3[@class="lister-item-header"]/a'), callback='parse_item', follow=True),
)

def parse_item(self, response):
print("************")
print(response.url)
settings.py
1
2
3
4
5
6
# settings.py
# if no change head, reaposne code 403
# change default heads
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# run
myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\imdb>scrapy crawl best_movies
......
2022-12-21 11:35:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc> (referer: None)
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
************
https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0071562/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
************
https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt
************
https://www.imdb.com/title/tt0071562/?ref_=adv_li_tt
************
https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
************
https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 11:35:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
......

get movies information

runner.py
1
2
3
4
5
6
7
8
9
10
11
12
# runner.py for imdb.spiders.best_movies
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
# set crawl code
from imdb.spiders.best_movies import BestMoviesSpider

# get configure
process = CrawlerProcess(settings=get_project_settings())
# set crawl entry
process.crawl(BestMoviesSpider)
process.start()
best_movies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# best_movies.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
name = 'best_movies'
allowed_domains = ['imdb.com']
start_urls = ['https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc']

rules = (
Rule(LinkExtractor(restrict_xpaths='//h3[@class="lister-item-header"]/a'), callback='parse_item', follow=True),
)

def parse_item(self, response):
yield {
'title': response.xpath("//div[@class='sc-80d4314-1 fbQftq']/h1/text()").get(),
'year': response.xpath("//span[@class='sc-8c396aa2-2 itZqyK']/text()").get(),
'duration': ''.join(response.xpath("//ul[@class='ipc-inline-list ipc-inline-list--show-dividers sc-8c396aa2-0 kqWovI baseAlt']/li[3]/text()").getall()),
'genre': response.xpath("//div[@class='ipc-chip-list__scroller']/a/span/text()").getall(),
'rating': response.xpath("//div[@data-testid='hero-rating-bar__aggregate-rating__score']/span[1]/text()").get(),
'movie_url': response.url
}
run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\imdb>scrapy crawl best_movies
......
2022-12-21 15:57:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt>
{'title': '十二怒漢', 'year': '1957', 'duration': '1h 36m', 'genre': ['Crime', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt'}
2022-12-21 15:57:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 15:57:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 15:57:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt>
{'title': '辛德勒的名單', 'year': '1993', 'duration': '3h 15m', 'genre': ['Biography', 'Drama', 'History'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt'}
2022-12-21 15:57:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt>
{'title': '教父', 'year': '1972', 'duration': '2h 55m', 'genre': ['Crime', 'Drama'], 'rating': '9.2', 'movie_url': 'https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt'}
2022-12-21 15:57:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0110912/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 15:57:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 15:57:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0110912/?ref_=adv_li_tt>
{'title': '黑色追緝令', 'year': '1994', 'duration': '2h 34m', 'genre': ['Crime', 'Drama'], 'rating': '8.9', 'movie_url': 'https://www.imdb.com/title/tt0110912/?ref_=adv_li_tt'}
2022-12-21 15:57:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt>
......
{'downloader/request_bytes': 30424,
'downloader/request_count': 52,
'downloader/request_method_count/GET': 52,
'downloader/response_bytes': 9412156,
'downloader/response_count': 52,
'downloader/response_status_count/200': 52,
'elapsed_time_seconds': 15.156407,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 21, 9, 5, 16, 429558),
'httpcompression/response_bytes': 51156690,
'httpcompression/response_count': 50,
# item_scraped_count
'item_scraped_count': 50,
'log_count/DEBUG': 109,
'log_count/INFO': 10,
'request_depth_max': 1,
'response_received_count': 52,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 51,
'scheduler/dequeued/memory': 51,
'scheduler/enqueued': 51,
'scheduler/enqueued/memory': 51,
'start_time': datetime.datetime(2022, 12, 21, 9, 5, 1, 273151)}
2022-12-21 17:05:16 [scrapy.core.engine] INFO: Spider closed (finished)

Following liks in pagination

best_movies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# best_movies.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
name = 'best_movies'
allowed_domains = ['imdb.com']
start_urls = ['https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc']

rules = (
Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a"), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths="(//a[@class='lister-page-next next-page'])[2]"))
)

def parse_item(self, response):
yield {
'title': response.xpath("//div[@class='sc-80d4314-1 fbQftq']/h1/text()").get(),
'year': response.xpath("//span[@class='sc-8c396aa2-2 itZqyK']/text()").get(),
'duration': ''.join(response.xpath("//ul[@class='ipc-inline-list ipc-inline-list--show-dividers sc-8c396aa2-0 kqWovI baseAlt']/li[3]/text()").getall()),
'genre': response.xpath("//div[@class='ipc-chip-list__scroller']/a/span/text()").getall(),
'rating': response.xpath("//div[@data-testid='hero-rating-bar__aggregate-rating__score']/span[1]/text()").get(),
'movie_url': response.url
}
run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# run
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\imdb>scrapy crawl best_movies
......
2022-12-21 17:12:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 17:12:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 17:12:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt>
{'title': '十二怒漢', 'year': '1957', 'duration': '1h 36m', 'genre': ['Crime', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt'}
2022-12-21 17:12:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 17:12:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt>
{'title': '辛德勒的名單', 'year': '1993', 'duration': '3h 15m', 'genre': ['Biography', 'Drama', 'History'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt'}
2022-12-21 17:12:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 17:12:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt>
{'title': '黑暗騎士', 'year': '2008', 'duration': '2h 32m', 'genre': ['Action', 'Crime', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt'}
2022-12-21 17:12:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt>
{'title': '刺激1995', 'year': '1994', 'duration': '2h 22m', 'genre': ['Drama'], 'rating': '9.3', 'movie_url': 'https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt'}
......
{'downloader/request_bytes': 135320,
'downloader/request_count': 186,
'downloader/request_method_count/GET': 186,
'downloader/response_bytes': 33685481,
'downloader/response_count': 186,
'downloader/response_status_count/200': 186,
'elapsed_time_seconds': 37.338221,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 21, 9, 15, 57, 911814),
'httpcompression/response_bytes': 183562546,
'httpcompression/response_count': 181,
'item_scraped_count': 181,
'log_count/DEBUG': 374,
'log_count/INFO': 10,
'request_depth_max': 4,
# item_scraped_count
'response_received_count': 186,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 185,
'scheduler/dequeued/memory': 185,
'scheduler/enqueued': 185,
'scheduler/enqueued/memory': 185,
'start_time': datetime.datetime(2022, 12, 21, 9, 15, 20, 573593)}
2022-12-21 17:15:57 [scrapy.core.engine] INFO: Spider closed (finished)

add reguest heads

best_movies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# best_movies.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
name = 'best_movies'
allowed_domains = ['imdb.com']

# change user agent
# start_urls = ['https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'

def start_requests(self):
yield scrapy.Request(url='https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc', headers={
'User-Agent': self.user_agent
})

rules = (
Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a"), callback='parse_item', follow=True, process_request='set_user_agent'),
# add next page rule
Rule(LinkExtractor(restrict_xpaths="(//a[@class='lister-page-next next-page'])[2]"))
)


# for scrappier 2.0
def set_user_agent(self, request, spider):
request.headers['User-Agent'] = self.user_agent
return request

def parse_item(self, response):
yield {
'title': response.xpath("//div[@class='sc-80d4314-1 fbQftq']/h1/text()").get(),
'year': response.xpath("//span[@class='sc-8c396aa2-2 itZqyK']/text()").get(),
'duration': ''.join(response.xpath("//ul[@class='ipc-inline-list ipc-inline-list--show-dividers sc-8c396aa2-0 kqWovI baseAlt']/li[3]/text()").getall()),
'genre': response.xpath("//div[@class='ipc-chip-list__scroller']/a/span/text()").getall(),
'rating': response.xpath("//div[@data-testid='hero-rating-bar__aggregate-rating__score']/span[1]/text()").get(),
'movie_url': response.url,
'user-agent': response.request.headers['User-Agent']
}
run
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\imdb>scrapy crawl best_movies
......
2022-12-21 20:08:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc> (referer: None)
2022-12-21 20:08:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0110912/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0110912/?ref_=adv_li_tt>
{'title': 'Pulp Fiction', 'year': '1994', 'duration': '2h 34m', 'genre': ['Crime', 'Drama'], 'rating': '8.9', 'movie_url': 'https://www.imdb.com/title/tt0110912/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0071562/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0071562/?ref_=adv_li_tt>
{'title': 'The Godfather Part II', 'year': '1974', 'duration': '3h 22m', 'genre': ['Crime', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0071562/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt>
{'title': 'The Lord of the Rings: The Return of the King', 'year': '2003', 'duration': '3h 21m', 'genre': ['Action', 'Adventure', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0167260/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt>
{'title': 'The Dark Knight', 'year': '2008', 'duration': '2h 32m', 'genre': ['Action', 'Crime', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0468569/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt>
{'title': 'The Shawshank Redemption', 'year': '1994', 'duration': '2h 22m', 'genre': ['Drama'], 'rating': '9.3', 'movie_url': 'https://www.imdb.com/title/tt0111161/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt>
{'title': "Schindler's List", 'year': '1993', 'duration': '3h 15m', 'genre': ['Biography', 'Drama', 'History'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0108052/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt>
{'title': '12 Angry Men', 'year': '1957', 'duration': '1h 36m', 'genre': ['Crime', 'Drama'], 'rating': '9.0', 'movie_url': 'https://www.imdb.com/title/tt0050083/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt>
{'title': 'The Godfather', 'year': '1972', 'duration': '2h 55m', 'genre': ['Crime', 'Drama'], 'rating': '9.2', 'movie_url': 'https://www.imdb.com/title/tt0068646/?ref_=adv_li_tt', 'user-agent': b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
2022-12-21 20:08:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt> (referer: https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=user_rating,desc)
2022-12-21 20:08:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.imdb.com/title/tt15097216/?ref_=adv_li_tt>
.....
{'downloader/request_bytes': 153035,
'downloader/request_count': 186,
'downloader/request_method_count/GET': 186,
'downloader/response_bytes': 33336260,
'downloader/response_count': 186,
'downloader/response_status_count/200': 186,
'elapsed_time_seconds': 37.364846,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 21, 12, 9, 27, 445944),
'httpcompression/response_bytes': 182944864,
'httpcompression/response_count': 181,
'item_scraped_count': 181,
'log_count/DEBUG': 374,
'log_count/INFO': 10,
'request_depth_max': 4,
'response_received_count': 186,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 185,
'scheduler/dequeued/memory': 185,
'scheduler/enqueued': 185,
'scheduler/enqueued/memory': 185,
'start_time': datetime.datetime(2022, 12, 21, 12, 8, 50, 81098)}
2022-12-21 20:09:27 [scrapy.core.engine] INFO: Spider closed (finished)

books.toscrape.com

generate project and spider

1
2
3
4
5
6
7
8
9
10
11
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>scrapy startproject toscrape
New Scrapy project 'toscrape', using template directory 'D:\app\python_env\myenv10_scrapy\lib\site-packages\scrapy\templates\project', created in:
D:\work\run\python_crawler\101-scrapy\toscrape
You can start your first spider with:
cd toscrape
scrapy genspider example example.com

(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy>cd toscrape
(myenv10_scrapy) D:\work\run\python_crawler\101-scrapy\toscrape>scrapy genspider -t crawl books www.udemy.com/course/web-scraping-in-python-using-scrapy-and-splash/learn/lecture/16263060#overview
Created spider 'books' using template 'crawl' in module:
toscrape.spiders.books

books.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BooksSpider(CrawlSpider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']

rules = (
Rule(LinkExtractor(restrict_xpaths='//h3/a'), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"))
)


def parse_item(self, response):
yield {
'book_name': response.xpath("//h1/text()").get(),
'book_price': response.xpath("//p[@class='price_color']/text()").get()
}

run

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
scrapy crawl books -o books.csv
......
2022-12-21 22:31:40 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/catalogue/jane-eyre_27/index.html>
{'book_name': 'Jane Eyre', 'book_price': '£38.43'}
2022-12-21 22:31:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-50.html> (referer: http://books.toscrape.com/catalogue/page-49.html)
2022-12-21 22:31:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/frankenstein_20/index.html> (referer: http://books.toscrape.com/catalogue/page-50.html)
2022-12-21 22:31:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com/catalogue/frankenstein_20/index.html>
{'book_name': 'Frankenstein', 'book_price': '£38.00'}
2022-12-21 22:31:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-21 22:31:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (1000 items) in: books.csv
2022-12-21 22:31:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 379636,
'downloader/request_count': 1051,
'downloader/request_method_count/GET': 1051,
'downloader/response_bytes': 22126017,
'downloader/response_count': 1051,
'downloader/response_status_count/200': 1050,
'downloader/response_status_count/404': 1,
'dupefilter/filtered': 5979,
'elapsed_time_seconds': 50.440181,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 21, 14, 31, 41, 402564),
'item_scraped_count': 1000,
'log_count/DEBUG': 2055,
'log_count/INFO': 11,
'request_depth_max': 51,
'response_received_count': 1051,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1050,
'scheduler/dequeued/memory': 1050,
'scheduler/enqueued': 1050,
'scheduler/enqueued/memory': 1050,
'start_time': datetime.datetime(2022, 12, 21, 14, 30, 50, 962383)}
2022-12-21 22:31:41 [scrapy.core.engine] INFO: Spider closed (finished)