文章詳情頁

python Scrapy框架原理解析

瀏覽：86日期：2022-06-30 14:19:23

Python 爬蟲包含兩個重要的部分：正則表達(dá)式和Scrapy框架的運(yùn)用，正則表達(dá)式對于所有語言都是通用的，網(wǎng)絡(luò)上可以找到各種資源。

如下是手繪Scrapy框架原理圖，幫助理解

python Scrapy框架原理解析

如下是一段運(yùn)用Scrapy創(chuàng)建的spider：使用了內(nèi)置的crawl模板，以利用Scrapy庫的CrawlSpider。相對于簡單的爬取爬蟲來說，Scrapy的CrawlSpider擁有一些網(wǎng)絡(luò)爬取時可用的特殊屬性和方法：

$ scrapy genspider country_or_district example.python-scrapying.com--template=crawl

運(yùn)行g(shù)enspider命令后，下面的代碼將會在example/spiders/country_or_district.py中自動生成。

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom example.items import CountryOrDistrictItemclass CountryOrDistrictSpider(CrawlSpider): name = ’country_or_district’ allowed_domains = [’example.python-scraping.com’] start_urls = [’http://example.python-scraping.com/’] rules = ( Rule(LinkExtractor(allow=r’/index/’, deny=r’/user/’), follow=True), Rule(LinkExtractor(allow=r’/view/’, deny=r’/user/’), callback=’parse_item’), ) def parse_item(self, response): item = CountryOrDistrictItem() name_css = ’tr#places_country_or_district__row td.w2p_fw::text’ item[’name’] = response.css(name_css).extract() pop_xpath = ’//tr[@id='places_population__row']/td[@class='w2p_fw']/text()’ item[’population’] = response.xpath(pop_xpath).extract() return item

爬蟲類包括的屬性：

name: 識別爬蟲的字符串。 allowed_domains: 可以爬取的域名列表。如果沒有設(shè)置該屬性，則表示可以爬取任何域名。 start_urls: 爬蟲起始URL列表。 rules: 該屬性為一個通過正則表達(dá)式定義的Rule對象元組，用于告知爬蟲需要跟蹤哪些鏈接以及哪些鏈接包含抓取的有用內(nèi)容。

以上就是python Scrapy框架原理解析的詳細(xì)內(nèi)容，更多關(guān)于Scrapy框架原理的資料請關(guān)注好吧啦網(wǎng)其它相關(guān)文章！

Python 編程

上一條：Python Selenium庫的基本使用教程下一條：如何用 Python 處理不平衡數(shù)據(jù)集

相關(guān)文章：

1. Intellij IDEA官方最完美編程字體Mono使用2. springboot基于Redis發(fā)布訂閱集群下WebSocket的解決方案3. 關(guān)于探究python中sys.argv時遇到的問題詳解4. 基于android studio的layout的xml文件的創(chuàng)建方式5. CSS自定義滾動條樣式案例詳解6. JS繪圖Flot如何實(shí)現(xiàn)動態(tài)可刷新曲線圖7. IDEA項(xiàng)目的依賴(pom.xml文件)導(dǎo)入問題及解決8. python使用requests庫爬取拉勾網(wǎng)招聘信息的實(shí)現(xiàn)9. 使用ProcessBuilder調(diào)用外部命令，并返回大量結(jié)果10. Java發(fā)送http請求的示例(get與post方法請求)

排行榜

					
					使用ProcessBuilder調(diào)用外部命令，并返回大量結(jié)果
關(guān)于探究python中sys.argv時遇到的問題詳解
Intellij IDEA官方最完美編程字體Mono使用
CSS自定義滾動條樣式案例詳解
基于android studio的layout的xml文件的創(chuàng)建方式
python使用requests庫爬取拉勾網(wǎng)招聘信息的實(shí)現(xiàn)
springboot基于Redis發(fā)布訂閱集群下WebSocket的解決方案
JS繪圖Flot如何實(shí)現(xiàn)動態(tài)可刷新曲線圖
IDEA項(xiàng)目的依賴(pom.xml文件)導(dǎo)入問題及解決
Java發(fā)送http請求的示例(get與post方法請求)
python利用后綴表達(dá)式實(shí)現(xiàn)計(jì)算器功能